解码Python字符串中的HTML实体？

266

我正在使用Beautiful Soup 3解析一些HTML，但是它包含HTML实体，Beautiful Soup 3不会自动为我解码：

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("<p>&pound;682m</p>")
>>> text = soup.find("p").string

>>> print text
&pound;682m

如何解码HTML实体text以获得"£682m"而不是"£682m"。

python html html-entities

— k
source

3

相关：在Python中将XML / HTML实体转换为Unicode字符串

— jfs 2012年

521

Python 3.4以上

用途html.unescape()：

import html
print(html.unescape('&pound;682m'))

FYI html.parser.HTMLParser.unescape已过时，应该保留在3.5中，尽管它是错误地遗忘的。它将很快从语言中删除。

Python 2.6-3.3

您可以HTMLParser.unescape()从标准库中使用：

对于python 2.6-2.7 HTMLParser
对于Python 3 html.parser

>>> try:
...     # Python 2.6-2.7 
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
... 
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

您还可以使用six兼容性库来简化导入：

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('&pound;682m'))
£682m

— 吕克
source

9

此方法似乎无法转义“＆＃8217;”等字符在Google App Engine上运行，尽管它在python2.6上本地运行。至少仍会解码实体（例如“”）

— gfxmonk 2010年

如何取消不公开的API？编辑答案。

— Markus Unterwaditzer

@MarkusUnterwaditzer没有理由不推荐使用不公开的方法。这引发了弃用警告-请参阅我对答案的编辑。

— Mark Amery 2015年

似乎更合乎逻辑的是，不赞成使用unescape整个HTMLParser模块，而不仅仅是方法html.parser。

— Tom Russell

值得注意的是Python 2：特殊字符被替换为其Latin-1（ISO-8859-1）编码对等字符。例如，可能有必要h.unescape(s).encode("utf-8")。docs：“”“此处提供的定义包含XHTML 1.0定义的所有实体，可以使用Latin-1字符集（ISO-8859-1）中的简单文本替换来处理这些实体”“”“

— 匿名co

65

Beautiful Soup处理实体转换。在Beautiful Soup 3中，您需要为构造函数指定convertEntities参数BeautifulSoup（请参见存档文档的“实体转换”部分）。在Beautiful Soup 4中，实体会自动解码。

美丽的汤3

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>", 
...               convertEntities=BeautifulSoup.HTML_ENTITIES)
<p>£682m</p>

美丽汤4

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

— 本·詹姆斯
source

+1。不知道我怎么在文档中错过了这个：感谢您提供的信息。我将接受luc的回答，因为luc使用的是我在问题中指定的标准库（对我而言并不重要），并且可能对其他人更通用。

— jkp 2010年

5

BeautifulSoup4HTMLParser主要使用。查看源

— scharfmn

4

如何在Beautiful Soup 4中获得没有原始字符串不包含的所有多余HTML的转换？（即<html>和<body>）

— Praxiteles

@Praxiteles：BeautifulSoup（'＆pound; 682m'，“ html.parser”）stackoverflow.com/a/14822344/4376342

— Soitje，

13

您可以使用w3lib.html库中的replace_entities

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

— Corvax
source

2

Beautiful Soup 4允许您将格式化程序设置为输出

如果您传入formatter=None，Beautiful Soup将不会在输出上完全修改字符串。这是最快的选项，但可能导致Beautiful Soup生成无效的HTML / XML，如以下示例所示：

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>

— LoicUV
source

这不能回答问题。（而且，我不知道文档在说什么对此处的HTML最后一点无效。）

— Mark Amery 2015年

<<Sacrébleu！>>是无效的部分，因为它未转义<和>，并将破坏其周围的html。我知道这是我发的较晚的帖子，但万一有人碰巧发现并想知道...

— GMasucci 2016年

0

我有一个类似的编码问题。我使用了normalize（）方法。将数据框导出到另一个目录中的.html文件时，使用pandas .to_html（）方法时出现Unicode错误。我最终做到了，它奏效了...

    import unicodedata

数据框对象可以是任何您喜欢的对象，我们称之为表...

    table = pd.DataFrame(data,columns=['Name','Team','OVR / POT'])
    table.index+= 1

对表格数据进行编码，以便我们可以将其导出到模板文件夹中的.html文件（可以是您希望的任何位置：））

     #this is where the magic happens
     html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

将规范化的字符串导出到html文件

    file = open("templates/home.html","w") 

    file.write(html_data) 

    file.close()

参考：unicodedata文档

— 亚历克斯
source

-4

这可能与这里无关。但是要从整个文档中消除这些html实体，您可以执行以下操作：（假设document = page，请原谅草率的代码，但是如果您有关于如何使其变得更好的想法，我想所有人-我是新手这个）。

import re
import HTMLParser

regexp = "&.+?;" 
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
    h = HTMLParser.HTMLParser()
    unescaped = h.unescape(e) #finds the unescaped value of the html entity
    page = page.replace(e, unescaped) #replaces html entity with unescaped value

— 尼尔·阿格瓦尔
source

7

没有！您无需自己匹配HTML实体并在其上循环；为你.unescape()做那件事。我不明白为什么您和Rob都发布了这些过于复杂的解决方案，当接受的答案已经清楚地表明.unescape()可以在字符串中找到实体时，它们会滚动自己的实体匹配。

— Mark Amery