如何检查字符串是unicode还是ascii？

271

我必须在Python中做什么才能弄清楚字符串具有哪种编码？

— 天美时
source

56

Unicode 不是编码。

— ulidtko 2011年

更重要的是，您为什么要关心？

— Johnsyweb

@Johnsyweb因为{UnicodeDecodeError} 'ascii' codec can't decode byte 0xc2

— alex

295

在Python 3中，所有字符串都是Unicode字符序列。有一种bytes类型可以保存原始字节。

在Python 2中，字符串可以是type str或type unicode。您可以使用以下代码告诉哪个使用代码：

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

这不能区分“ Unicode或ASCII”；它仅区分Python类型。Unicode字符串可能仅包含ASCII范围内的字符，而字节字符串可能包含ASCII，编码的Unicode甚至非文本数据。

— 格雷格·休吉尔
source

3

@ProsperousHeart：你可能正在使用Python 3

— 格雷格Hewgill

124

如何判断对象是unicode字符串还是字节字符串

您可以使用type或isinstance。

在Python 2中：

>>> type(u'abc')  # Python 2 unicode string literal
<type 'unicode'>
>>> type('abc')   # Python 2 byte string literal
<type 'str'>

在Python 2中，str只是字节序列。Python不知道其编码是什么。该unicode类型是存储文本的更安全的方式。如果您想了解更多，我建议http://farmdev.com/talks/unicode/。

在Python 3中：

>>> type('abc')   # Python 3 unicode string literal
<class 'str'>
>>> type(b'abc')  # Python 3 byte string literal
<class 'bytes'>

在Python 3中，str就像Python 2一样unicode，用于存储文本。什么叫str在Python 2被称为bytes在Python 3。

如何判断一个字节字符串是有效的utf-8还是ascii

您可以致电decode。如果它引发UnicodeDecodeError异常，则无效。

>>> u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
>>> u_umlaut.decode('utf-8')
u'\xdc'
>>> u_umlaut.decode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

— 米克尔
source

只是其他人的参考- str.decode不会不蟒蛇就像你有存在3.外貌unicode(s, "ascii")什么的

— 暗影

3

抱歉，我的意思是str(s, "ascii")

— Shadow

1

这对于python 3是不正确的

— ProsperousHeart

2

@ProsperousHeart已更新，以涵盖Python3。并尝试解释字节字符串和unicode字符串之间的区别。

— Mikel

44

在python 3.x中，所有字符串都是Unicode字符序列。并进行str的isinstance检查（默认情况下意味着unicode字符串）就足够了。

isinstance(x, str)

关于python 2.x，大多数人似乎正在使用带有两个检查的if语句。一个用于str，另一个用于unicode。

如果要检查是否只有一个语句具有“类似字符串的”对象，则可以执行以下操作：

isinstance(x, basestring)

— ThinkBonobo
source

这是错误的。在Python 2.7中isinstance(u"x",basestring)return True。

— PythonNut 2014年

11

@PythonNut：我相信这就是重点。使用isinstance（x，basestring）足以替换上面不同的双重测试。

— KQ。

5

它在许多情况下很有用，但显然不是提问者的意思。

— mhsmith

3

这就是问题的答案。所有其他人都误解了OP所说的内容，并给出了有关Python类型检查的一般性答案。

— fiatjaf 2015年

1

不回答OP的问题。问题的标题（单独）可以解释为正确答案。但是，OP在问题的描述中专门说出“找出哪一个”，而该答案并未解决该问题。

— MD004

31

Unicode不是一种编码-引用Kumar McMillan的话：

如果ASCII，UTF-8和其他字节字符串是“文本” ...

...那么Unicode是“文字性”；

它是文本的抽象形式

阅读了PyCon 2008 上McMillan的Unicode In Python中的《 Unifiedly Mystified》演讲，它解释的内容比Stack Overflow上的大多数相关答案要好得多。

— 亚历克斯·迪恩
source

这些幻灯片可能是我迄今为止所见过的Unicode的最佳介绍

— Jonny

23

如果你的代码需要兼容两者的Python 2和Python 3，你不能直接使用之类的东西isinstance(s,bytes)或isinstance(s,unicode)不带/包裹它们可尝试不同的或Python版本的测试，因为bytes在Python 2不定，unicode在Python 3未定义。

有一些丑陋的解决方法。一个非常丑陋的是比较类型的名称，而不是比较类型本身。这是一个例子：

# convert bytes (python 3) or unicode (python 2) to str
if str(type(s)) == "<class 'bytes'>":
    # only possible in Python 3
    s = s.decode('ascii')  # or  s = str(s)[2:-1]
elif str(type(s)) == "<type 'unicode'>":
    # only possible in Python 2
    s = str(s)

可以说稍微麻烦一点的解决方法是检查Python版本号，例如：

if sys.version_info >= (3,0,0):
    # for Python 3
    if isinstance(s, bytes):
        s = s.decode('ascii')  # or  s = str(s)[2:-1]
else:
    # for Python 2
    if isinstance(s, unicode):
        s = str(s)

这些都不是Python风格的，在大多数情况下，可能有更好的方法。

— 戴夫·伯顿
source

6

更好的方法可能是使用six，并针对测试six.binary_type和six.text_type

— 伊恩Clelland

1

您可以使用type .__ name__来探测类型名称。

— Paulo Freitas

除非有逻辑错误，否则我不太确定那部分代码的用例。我认为python 2代码中应该有一个“ not”。否则，对于Python 3，您将一切都转换为unicode字符串，而对于Python 2，则相反！

— oligofren 2014年

是的，oligofren就是这样做的。标准内部字符串在Python 3中为Unicode，在Python 2中为ASCII。因此，代码片段将文本转换为标准内部字符串类型（Unicode或ASCII）。

— 戴夫·伯顿

12

用：

import six
if isinstance(obj, six.text_type)

在六个库中，它表示为：

if PY3:
    string_types = str,
else:
    string_types = basestring,

— Madjardi
source

2

应该是if isinstance(obj, six.text_type) 。但是，是的，这是正确的答案。

— karantan

不回答OP的问题。问题的标题（单独）可以解释为正确答案。但是，OP在问题的描述中专门说出“找出哪一个”，而该答案并未解决该问题。

— MD004

4

请注意，在Python 3上，以下任何一种说法都不公平：

strs是任何x的UTFx（例如UTF8）
strs是Unicode
strs是Unicode字符的有序集合

Python的 str类型（通常）是一系列Unicode代码点，其中一些映射到字符。

即使在Python 3上，回答这个问题也不像您想象的那么简单。

一种测试ASCII兼容字符串的明显方法是尝试进行编码：

"Hello there!".encode("ascii")
#>>> b'Hello there!'

"Hello there... ☃!".encode("ascii")
#>>> Traceback (most recent call last):
#>>>   File "", line 4, in <module>
#>>> UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 15: ordinal not in range(128)

该错误区分情况。

在Python 3中，甚至有些字符串包含无效的Unicode代码点：

"Hello there!".encode("utf8")
#>>> b'Hello there!'

"\udcc3".encode("utf8")
#>>> Traceback (most recent call last):
#>>>   File "", line 19, in <module>
#>>> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc3' in position 0: surrogates not allowed

使用相同的方法来区分它们。

— 维德拉克
source

3

这可能会对其他人有所帮助，我开始测试变量s的字符串类型，但是对于我的应用程序来说，简单地将s返回为utf-8更有意义。然后，调用return_utf的进程知道它正在处理什么，并且可以适当地处理该字符串。该代码不是原始的，但我打算将其与Python版本无关，而无需进行版本测试或导入六个代码。请评论以下示例代码的改进以帮助其他人。

def return_utf(s):
    if isinstance(s, str):
        return s.encode('utf-8')
    if isinstance(s, (int, float, complex)):
        return str(s).encode('utf-8')
    try:
        return s.encode('utf-8')
    except TypeError:
        try:
            return str(s).encode('utf-8')
        except AttributeError:
            return s
    except AttributeError:
        return s
    return s # assume it was already utf-8

— fl
source

您我的朋友应该得到正确的答复！我正在使用python 3，但在发现这笔财富之前，我仍然遇到问题！

— mnsr

2

您可以使用Universal Encoding Detector，但是请注意，它只会给您最好的猜测，而不是实际的编码，因为例如，不可能知道字符串“ abc”的编码。您将需要在其他地方获取编码信息，例如HTTP协议为此使用Content-Type标头。

— 塞布
source

0

为了实现py2 / py3兼容性，只需使用

import six if isinstance(obj, six.text_type)

— Vishvajit Pathak
source

0

一种简单的方法是检查是否unicode为内置函数。如果是这样，则说明您使用的是Python 2，并且您的字符串将是一个字符串。确保一切都unicode可以做到：

import builtins

i = 'cats'
if 'unicode' in dir(builtins):     # True in python 2, False in 3
  i = unicode(i)

— 杜海姆
source