从python中的字符串中剥离不可打印的字符

88

我过去跑步

$s =~ s/[^[:print:]]//g;

在Perl上摆脱不可打印的字符。

在Python中，没有POSIX正则表达式类，并且我不能写[：print：]来表示我想要的意思。我不知道在Python中无法检测字符是否可打印。

你会怎么做？

编辑：它也必须支持Unicode字符。string.printable方式会很乐意将它们从输出中剥离。curses.ascii.isprint将为任何unicode字符返回false。

python string non-printable

— 文科·弗萨洛维奇（Vinko Vrsalovic）
source

83

不幸的是，在Python中遍历字符串相当慢。对于这种事情，正则表达式的速度要快一个数量级。您只需要自己构建角色类即可。该unicodedata模块是这个相当有帮助，尤其是unicodedata.category（）函数。有关类别的说明，请参见Unicode字符数据库。

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

对于某些用例，最好使用其他类别（例如，来自控制组的所有类别），尽管这可能会减慢处理时间并显着增加内存使用。每个类别的字符数：

Cc （控制）：65
Cf （格式）：161
Cs （代理）：2048
Co （私人使用）：137468
Cn （未分配）：836601

编辑从注释中添加建议。

— 蚂蚁阿斯玛
source

4

这里的“抄送”够了吗？我不知道，我只是在问-在我看来，某些其他“ C”类别也可能是此过滤器的候选人。

— Patrick Johnmeyer

1

该功能已发布，删除了一半的希伯来字符。我给定的两种方法都得到相同的效果。

— dotancohen 2012年

1

从性能角度来看，在这种情况下string.translate（）不会更快地工作吗？见stackoverflow.com/questions/265960/...

— 卡什亚普

3

使用all_chars = (unichr(i) for i in xrange(sys.maxunicode))来避免狭隘生成错误。

— danmichaelo

4

对我来说control_chars == '\x00-\x1f\x7f-\x9f'（已在Python 3.5.2上测试）

— AXO

72

据我所知，最pythonic /最有效的方法是：

import string

filtered_string = filter(lambda x: x in string.printable, myStr)

— 威廉·凯勒
source

10

您可能想要filtered_string =''.join（filter（lambda x：x in string.printable，myStr），以便您返回一个字符串。）

— Nathan Shively-Sanders

12

可悲的是string.printable不包含Unicode字符，因此ü或ó将不会出现在输出中……也许还有其他东西吗？

— Vinko Vrsalovic

17

您应该使用列表推导或生成器表达式，而不要使用filter + lambda。其中之一将使99.9％的时间更快。” .join（如果在字符串中为s，则为myStr中的s）

— -habnabit

3

@AaronGallagher：快99.9％？您是从哪儿摘来这个数字的？性能比较远没有那么差。

— 克里斯·摩根

4

嗨，威廉。此方法似乎删除了所有非ASCII字符。Unicode中有许多可打印的非ASCII字符！

— dotancohen 2012年

17

您可以尝试使用以下unicodedata.category()功能设置过滤器：

import unicodedata
printable = {'Lu', 'Ll'}
def filter_non_printable(str):
  return ''.join(c for c in str if unicodedata.category(c) in printable)

有关可用类别，请参见Unicode数据库字符属性中的表4-9（第175页）。

— 伯尔
source

您开始了列表理解，但没有在最后一行结束。我建议您完全卸下开口支架。

— tzot

感谢您指出这一点。我编辑相应的岗位

— 小檗碱

1

这似乎是最直接，最直接的方法。谢谢。

— dotancohen

1

@CsabaToth这三个都有效，并且产生相同的集合。您可能是指定集合文字的最好方法。

— 小檗碱

1

@AnubhavJhalani您可以向过滤器添加更多Unicode类别。预留空间和数字除了字母使用printable = {'Lu', 'Ll', Zs', 'Nd'}

— 小檗碱

10

在Python 3中，

def filter_nonprintable(text):
    import itertools
    # Use characters of control category
    nonprintable = itertools.chain(range(0x00,0x20),range(0x7f,0xa0))
    # Use translate to remove all non-printable characters
    return text.translate({character:None for character in nonprintable})

请参阅有关删除标点符号的StackOverflow帖子，了解.translate（）与regex和.replace（）的比较方式

可以通过nonprintable = (ord(c) for c in (chr(i) for i in range(sys.maxunicode)) if unicodedata.category(c)=='Cc')使用@Ants Aasma所示的Unicode字符数据库类别来生成范围。

— 肖恩拉德
source

最好使用Unicode范围（请参阅@Ants Aasma的答案）。结果将是text.translate({c:None for c in itertools.chain(range(0x00,0x20),range(0x7f,0xa0))})。

— darkdragon

8

以下将适用于Unicode输入并且相当快...

import sys

# build a table mapping all non-printable characters to None
NOPRINT_TRANS_TABLE = {
    i: None for i in range(0, sys.maxunicode + 1) if not chr(i).isprintable()
}

def make_printable(s):
    """Replace non-printable characters in a string."""

    # the translate method on str removes characters
    # that map to None from the string
    return s.translate(NOPRINT_TRANS_TABLE)


assert make_printable('Café') == 'Café'
assert make_printable('\x00\x11Hello') == 'Hello'
assert make_printable('') == ''

我自己的测试表明，这种方法比使用迭代字符串并使用返回结果的函数要快str.join。

— 克里斯
source

这是唯一适用于unicode字符的答案。真棒，您提供了测试用例！

— pir

1

如果要允许换行，请在构建表时添加LINE_BREAK_CHARACTERS = set(["\n", "\r"])和and not chr(i) in LINE_BREAK_CHARACTERS。

— pir

5

此函数使用列表推导和str.join，因此它以线性时间而不是O（n ^ 2）的形式运行：

from curses.ascii import isprint

def printable(input):
    return ''.join(char for char in input if isprint(char))

— 柯克·斯特劳斯
source

2

filter(isprint,input)

— yingted

5

python 3中的另一个选项：

re.sub(f'[^{re.escape(string.printable)}]', '', my_string)

— c6401
source

这对我及其1行非常有效。谢谢

— 斩Labalagun

1

由于某种原因，它在Windows上运行良好，但不能在Linux上使用它，我不得不将f更改为r，但是我不确定这是否是解决方案。

— 斩Labalagun

听起来您的Linux Python太旧了，无法支持f字符串。尽管您可以说r弦非常不同r'[^' + re.escape(string.printable) + r']'。（我不认为这re.escape()是完全正确的，但是如果

— 可行的

2

我现在想出的最好的是（由于上面的python-izers）

def filter_non_printable(str):
  return ''.join([c for c in str if ord(c) > 31 or ord(c) == 9])

这是我发现可用于Unicode字符/字符串的唯一方法

还有更好的选择吗？

— 文科·弗萨洛维奇（Vinko Vrsalovic）
source

1

除非您使用python 2.3，否则内部[]是多余的。“ return''.join（c for c ...）”

— habnabit

尽管最终结果是相同的，但并不是很多余-它们具有不同的含义（和性能特征）。

— Miles

范围的另一端是否也不应受到保护？：“ ord（c）<= 126”

— Gearoid Murphy 2011年

7

但是，有些Unicode字符也不可打印。

— 2012年

2

下面的一个比上面的其他要快。看一看

''.join([x if x in string.printable else '' for x in Str])

— 尼拉夫·巴兰·戈什（Nilav Baran Ghosh）
source

"".join([c if 0x21<=ord(c) and ord(c)<=0x7e else "" for c in ss])

— evandrix

2

在Python中，没有POSIX正则表达式类

使用regex库时有： https //pypi.org/project/regex/

它维护良好，并支持Unicode regex，Posix regex等。用法（方法签名）非常与Python的相似re。

从文档中：

[[:alpha:]]; [[:^alpha:]]

支持POSIX字符类。这些通常被视为\p{...}。

（我没有隶属关系，只是一个用户。）

— 里沙迪尼亚
source

1

根据@Ber的答案，我建议仅删除Unicode字符数据库类别中定义的控制字符：

import unicodedata
def filter_non_printable(s):
    return ''.join(c for c in s if not unicodedata.category(c).startswith('C'))

— 黑暗龙
source

这是一个很好的答案！

— tdc

0

要删除“空白”，

import re
t = """
\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>&nbsp;</p>\n\t<p>
"""
pat = re.compile(r'[\t\n]')
print(pat.sub("", t))

— 知识公园
source

实际上，那时您也不需要方括号。

— 三人

0

改编自Ants Aasma和shawnrad的答案：

nonprintable = set(map(chr, list(range(0,32)) + list(range(127,160))))
ord_dict = {ord(character):None for character in nonprintable}
def filter_nonprintable(text):
    return text.translate(ord_dict)

#use
str = "this is my string"
str = filter_nonprintable(str)
print(str)

在Python 3.7.7上测试

— 乔
source