将Unicode转换为ASCII且在Python中没有错误

177

我的代码只是抓取一个网页，然后将其转换为Unicode。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我得到了UnicodeDecodeError：

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我认为这意味着HTML在某处包含一些错误的Unicode尝试。我可以删除导致问题的任何代码字节而不出错吗？

— 镜子
source

2

如果重要字符被丢弃，我认为这是一个错误！（另外，问题在哪里？）

— Arafangion 2010年

好像您在网页中可能遇到过“无间断空格”？之前需要一个c2字节，否则您可能会收到解码错误：hexutf8.com/?q=C2A0

— jar

105

2018年更新：

截至2018年2月，使用类似的压缩gzip已变得非常流行（大约73％的网站都在使用它，包括Google，YouTube，Yahoo，Wikipedia，Reddit，Stack Overflow和Stack Exchange Network网站等大型网站）。
如果您像原始答案中那样使用gzip压缩后的响应进行简单的解码，则会收到类似以下的错误：

UnicodeDecodeError：'utf8'编解码器无法解码位置1的字节0x8b：意外的代码字节

为了解码gzpipped响应，您需要添加以下模块（在Python 3中）：

import gzip
import io

注意： 在Python 2中，您将使用StringIO代替io

然后，您可以像这样解析内容：

response = urlopen("https://example.com/gzipped-ressource")
buffer = io.BytesIO(response.read()) # Use StringIO.StringIO(response.read()) in Python 2
gzipped_file = gzip.GzipFile(fileobj=buffer)
decoded = gzipped_file.read()
content = decoded.decode("utf-8") # Replace utf-8 with the source encoding of your requested resource

此代码读取响应，并将字节放入缓冲区。然后，gzip模块使用GZipFile函数读取缓冲区。之后，可以将压缩后的文件再次读取为字节，最后将其解码为正常可读的文本。

2010年的原始答案：

我们可以获取用于的实际值link吗？

另外，当我们尝试.encode()使用已编码的字节字符串时，通常会在这里遇到此问题。因此，您可以尝试先将其解码为

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

举个例子：

html = '\xa0'
encoded_str = html.encode("utf8")

与失败

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

而：

html = '\xa0'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")

成功无误。请注意，我以“ windows-1252” 为例。我是从chardet那里得到的，它对它的置信度为0.5，这是正确的！（同样，对于长度为1个字符的字符串，您希望得到什么）您应该将其更改为返回的字节字符串的编码，以.urlopen().read()适应所检索内容的内容。

我看到的另一个问题是，.encode()字符串方法返回修改后的字符串，而不是就地修改源。因此拥有self.response.out.write(html)html是没有用的，因为html不是html.encode中的编码字符串（如果这是您最初的目标）。

正如Ignacio所建议的那样，请在源网页上检查从中返回的字符串的实际编码read()。它位于响应的Meta标签之一或ContentType标头中。然后将其用作的参数.decode()。

但是请注意，不应假定其他开发人员负责确保标题和/或元字符集声明与实际内容匹配。（这是PITA，是的，我应该知道，我以前是其中的一个）。

— Vin-G
source

1

在您的示例中，我认为您的意思是最后一行encoded_str = decoded_str.encode("utf8")

— Ajith Antony

1

我在Python 2.7.15中尝试了此消息raise IOError, 'Not a gzipped file'。我是怎么了

— 金贤恩

221

>>> u'aあä'.encode('ascii', 'ignore')
'a'

使用meta响应中相应标签或Content-Type标头中的字符集对返回的字符串进行解码，然后进行编码。

该方法encode(encoding, errors)接受错误的自定义处理程序。除之外的默认值为ignore：

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'a&#12354;&#228;'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

参见https://docs.python.org/3/library/stdtypes.html#str.encode

— 伊格纳西奥·巴斯克斯（Ignacio Vazquez-Abrams）
source

119

作为对Ignacio Vazquez-Abrams的回答的扩展

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时需要从字符中删除重音并打印基本表格。这可以通过完成

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还需要将其他字符（例如标点符号）转换为最接近的等价字符，例如，编码时未将RIGHT SINGLE QUOTATION MARK Unicode字符转换为ASCII APOSTROPHE。

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

尽管有更有效的方法可以做到这一点。有关更多详细信息，请参见此问题。Python的“此Unicode的最佳ASCII”数据库在哪里？

— 彼得·吉布森
source

4

既有助于解决所提出的问题，又对解决可能是提出的问题的问题具有实用性。这是此类问题的模型答案。

— shanusmagnus

96

使用unidecode-它甚至可以立即将奇怪的字符转换为ASCII，甚至将中文转换为语音ASCII。

$ pip install unidecode

然后：

>>> from unidecode import unidecode
>>> unidecode(u'北京')
'Bei Jing'
>>> unidecode(u'Škoda')
'Skoda'

— 尼莫
source

3

halle-freakin-lujah-大约在一段时间，我找到了一个对我

— 有用

10

为获得乐趣而投票。请注意，这会使所有强调语言的单词变形。Škoda不是斯柯达。斯柯达最有可能意味着鳗鱼和气垫船的沉迷。

— 西尔万

1

到目前为止，我一直在搜寻互联网几天。...谢谢，非常感谢

— Stephen

23

我在所有项目中都使用了此辅助功能。如果它不能转换unicode，它将忽略它。这与django库联系在一起，但是只要进行一些研究，您就可以绕过它。

from django.utils import encoding

def convert_unicode_to_string(x):
    """
    >>> convert_unicode_to_string(u'ni\xf1era')
    'niera'
    """
    return encoding.smart_str(x, encoding='ascii', errors='ignore')

使用此方法后，我不再遇到任何unicode错误。

— 加特斯特
source

10

那就是解决问题，而不是诊断和解决问题。这就像在说“我断脚后，我再也没有玉米和拇囊炎的问题了”。

— John Machin

10

我同意这可以解决问题。看来这就是问题的根源。看一下他的笔记：“我能丢弃引起问题的任何代码字节而不出错吗？”

— Gattster 2010年

3

这与简单地调用“ some-string” .encode（'ascii'，'ignore'）完全相同

— Joshua Burns

17

我无法告诉您，有人对SO提出问题并得到所有这些布道的回应，我有多疲倦。“我的车无法启动。” “为什么要开车？你应该走路。” 停下来！

— shanusmagnus

8

@JohnMachin没人在乎。我不在乎RSS提要中使用了什么弱智废话，如果它不是ascii中的某个字符，则可以将其截断。他们的问题。我只想让python真正阻止并处理它，而不是每次指定“忽略”时都给我错误。谁死了？

— user1244215 2013年

10

对于损坏的控制台cmd.exe和HTML输出，您可以始终使用：

my_unicode_string.encode('ascii','xmlcharrefreplace')

这将保留所有非ASCII字符，同时使它们可以纯ASCII 和 HTML格式打印。

警告：如果在生产代码中使用此代码来避免错误，则很可能代码中有错误。唯一有效的用例是打印到非Unicode控制台或在HTML上下文中轻松转换为HTML实体。

最后，如果您在Windows上并使用cmd.exe，则可以键入chcp 65001启用utf-8输出（适用于Lucida Console字体）。您可能需要添加myUnicodeString.encode('utf8')。

— ccpizza
source

6

您写了“”“，我认为这意味着HTML包含对某处unicode的某些错误格式的尝试。”“”“

HTML不应包含任何格式正确的“尝试unicode”。它必须包含以某种编码方式编码的Unicode字符，通常是在前面提供的...查找“字符集”。

您似乎基于什么理由假设字符集为UTF-8。错误消息中显示的“ \ xA0”字节表示您可能具有单字节字符集，例如cp1252。

如果在HTML的开头对声明一无所知，请尝试使用chardet找出可能的编码。

为什么用“ regex”标记您的问题？

在用一个非问题替换了整个问题后进行更新：

html = urllib.urlopen(link).read()
# html refers to a str object. To get unicode, you need to find out
# how it is encoded, and decode it.

html.encode("utf8","ignore")
# problem 1: will fail because html is a str object;
# encode works on unicode objects so Python tries to decode it using 
# 'ascii' and fails
# problem 2: even if it worked, the result will be ignored; it doesn't 
# update html in situ, it returns a function result.
# problem 3: "ignore" with UTF-n: any valid unicode object 
# should be encodable in UTF-n; error implies end of the world,
# don't try to ignore it. Don't just whack in "ignore" willy-nilly,
# put it in only with a comment explaining your very cogent reasons for doing so.
# "ignore" with most other encodings: error implies that you are mistaken
# in your choice of encoding -- same advice as for UTF-n :-)
# "ignore" with decode latin1 aka iso-8859-1: error implies end of the world.
# Irrespective of error or not, you are probably mistaken
# (needing e.g. cp1252 or even cp850 instead) ;-)

— 约翰·马钦
source

4

如果您有一个string line，则可以.encode([encoding], [errors='strict'])对字符串使用该方法来转换编码类型。

line = 'my big string'

line.encode('ascii', 'ignore')

有关在Python中处理ASCII和unicode的更多信息，这是一个非常有用的网站：https : //docs.python.org/2/howto/unicode.html

— Jama22
source

1

当字符串中有非ASCII字符（如ü）时，此功能将不起作用。

— sajid

4

我认为答案是存在的，但只能是零散的，这使得很难快速解决诸如

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

让我们举个例子，假设我有一个文件，其中的数据具有以下格式（包含ascii和non-ascii字符）

17年1月1日，21：36-土地：欢迎ï¿½ï¿½

而我们只想忽略和保留ascii字符。

该代码将执行以下操作：

import unicodedata
fp  = open(<FILENAME>)
for line in fp:
    rline = line.strip()
    rline = unicode(rline, "utf-8")
    rline = unicodedata.normalize('NFKD', rline).encode('ascii','ignore')
    if len(rline) != 0:
        print rline

和type（rline）会给你

>type(rline) 
<type 'str'>

— 索姆
source

这也适用于（非标准化的）“扩展ascii”案例

— Oliver Zendel

1

unicodestring = '\xa0'

decoded_str = unicodestring.decode("windows-1252")
encoded_str = decoded_str.encode('ascii', 'ignore')

为我工作

— 喜马拉雅编码器
source

-5

看起来您正在使用python2.x。Python 2.x默认为ascii，它不了解Unicode。因此，例外。

只需在shebang上粘贴以下行，它就会起作用

# -*- coding: utf-8 -*-

— 哈伦·拉杜杜
source

该coding评论是不是一个神奇的万灵药。您需要知道为什么会生成错误，这仅在Python源代码中包含错误字符时才能解决。这个问题似乎并非如此。

— Mark Ransom