Python Unicode编码错误

104

我正在读取和解析Amazon XML文件，而当XML文件显示'时，尝试打印该文件时，出现以下错误：

'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128)

从到目前为止的在线阅读中，该错误是由于XML文件位于UTF-8中引起的，但是Python希望将其作为ASCII编码字符进行处理。有没有简单的方法可以使错误消失并让我的程序在读取时打印XML？

— 亚历克斯·B
source

我只是来发表这个问题。有没有一种简单的方法可以为字符串消毒unicode()？

— 尼克·海纳

请同时检查以下相关问题的答案：“ Python UnicodeDecodeError-我是否误解了编码？”

— tzot

193

可能是，您的问题是您已对其进行了解析，现在您正尝试打印XML的内容，但由于存在一些外来Unicode字符而无法这样做。首先尝试将unicode字符串编码为ascii：

unicodeData.encode('ascii', 'ignore')

“忽略”部分将告诉它只跳过那些字符。从python文档中：

>>> u = unichr(40960) + u'abcd' + unichr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'&#40960;abcd&#1972;'

您可能需要阅读这篇文章：http : //www.joelonsoftware.com/articles/Unicode.html，我发现它对于发生的事情是非常有用的基础教程。阅读之后，您将不再觉得自己只是在猜测要使用的命令（或者至少是我遇到的命令）。

— 斯科特·斯塔福德
source

1

我正在尝试使以下字符串安全：'foo“ bar bar” df'（请注意大括号），但以上内容对我而言仍然失败。

— 尼克·海纳

@Rosarch：失败如何？同样的错误？您使用了哪种错误处理规则？

— 斯科特·斯塔福德

@Rosarch，您的问题可能是更早的。尝试以下代码：＃-- 编码：latin-1-- u = u'foo“ bar bar” df'print u.encode（'ascii'，'ignore'）对您来说，可能是将字符串转换为给定的Unicode您为引发错误的python脚本指定的编码。

— 斯科特·斯塔福德

我继续做我的问题到了自己的问题：stackoverflow.com/questions/3224427/...

— 尼克·海纳

1

.encode('ascii', 'ignore')即使OP的环境可能支持非ascii字符也不必要地丢失数据（大多数情况下）

— jfs 2015年

16

更好的解决方案：

if type(value) == str:
    # Ignore errors even if the string is not proper UTF-8 or has
    # broken marker bytes.
    # Python built-in function unicode() can do this.
    value = unicode(value, "utf-8", errors="ignore")
else:
    # Assume the value object has proper __unicode__() method
    value = unicode(value)

如果您想详细了解原因：

http://docs.plone.org/manage/troubleshooting/unicode.html#id1

— 帕克斯韦尔
source

3

它对OP的问题无济于事：“无法对字符u'\ u2019'进行编码”。u'\u2019已经是Unicode。

— jfs 2015年

6

不要在脚本中对环境的字符编码进行硬编码。直接打印Unicode文本：

assert isinstance(text, unicode) # or str on Python 3
print(text)

如果您的输出重定向到文件（或管道）；您可以使用PYTHONIOENCODINGenvvar来指定字符编码：

$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8

否则，python your_script.py应工作-您的区域设置用于将文本编码（上POSIX检查：LC_ALL，LC_CTYPE，LANGenvvars中-设置LANG为UTF-8语言环境如果需要的话）。

要在Windows上打印Unicode，请参见以下答案，该答案显示了如何将Unicode打印到Windows控制台，文件或使用IDLE。

— f
source

1

优秀文章：http : //www.carlosble.com/2010/12/understanding-python-and-unicode/

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
            isinstance(number, float):
        converted_str = str(number)
    return converted_str


def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')


def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

— Ranvijay Sachan
source

0

您可以使用以下形式

s.decode('utf-8')

它将UTF-8编码的字节字符串转换为Python Unicode字符串。但是要使用的确切过程取决于确切地加载和解析XML文件的方式，例如，如果您从未直接访问XML字符串，则可能必须使用codecs模块中的解码器对象。

— 大卫·Z
source

它已经用UTF-8进行了编码该错误具体是：myStrings = deque（[u'Dorf和Svoboda \ u2019的文本建立在str ...和Computer Engineering \ u2019s的子学科上。']）字符串在UTF-8中为您可以看到，但是它对内部的'\ u2019'感到生气

— Alex B 2010年

哦，好的，我以为您遇到了其他问题。

— David Z'7

7

@Alex B：不，字符串是Unicode，而不是Utf-8。要编码作为UTF-8使用'...'.encode('utf-8')

— 某事

0

我写了以下文章，以解决讨厌的非ascii引号，并强制将其转换为可用的东西。

unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }

def unicodeToAscii(inStr):
    try:
        return str(inStr)
    except:
        pass
    outStr = ""
    for i in inStr:
        try:
            outStr = outStr + str(i)
        except:
            if unicodeToAsciiMap.has_key(i):
                outStr = outStr + unicodeToAsciiMap[i]
            else:
                try:
                    print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
                except:
                    print "unicodeToAscii: unknown code (encoded as _)", repr(i)
                outStr = outStr + "_"
    return outStr

— 用户名
source

0

如果您需要在屏幕上打印字符串的近似表示，而不是忽略那些不可打印的字符，请unidecode在此处尝试打包：

https://pypi.python.org/pypi/Unidecode

在这里找到说明：

https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/

这比u.encode('ascii', 'ignore')对给定的字符串使用更好u，如果字符精度不是您想要的，但仍然希望具有人类可读性，则可以使您免于不必要的麻烦。

威拉湾

— 维拉万·普尔万托
source

-1

尝试将以下行添加到python脚本的顶部。

# _*_ coding:utf-8 _*_

— Abnvanand
source

-1

Python 3.5，2018年

如果您不知道编码是什么，但是unicode解析器出现问题，则可以在中打开文件，Notepad++然后在顶部栏中选择Encoding->Convert to ANSI。然后您可以像这样编写python

with open('filepath', 'r', encoding='ANSI') as file:
    for word in file.read().split():
        print(word)

— Atomar94
source