将HTML实体转换为Unicode，反之亦然

70

可能重复：

在Python中将XML / HTML实体转换为Unicode字符串

HTML实体代码到文本

如何在Python中将HTML实体转换为Unicode，反之亦然？

python html html-entities

— Hekevintran
source

16

@Jarret Hardie：实际上，表演秀完全可以。从常见问题解答的第一个条目（stackoverflow.com/faq）开始，“也可以提出并回答自己的编程问题，这也很好”。虽然，也鼓励您查找重复项。

— 尚西

13

我发布的问题是我过去为自己解答的，以便其他用户搜索相似的答案。

— hekevintran

也可以在没有外部库的情况下完成。见stackoverflow.com/questions/663058/html-entity-codes-to-text/...

— bobince

6

+1他为数据集做出了贡献。

— Ryan Townshend

2

这个问题的范围比“重复”链接所指出的范围更广：这个问题还要求“反之亦然”，即从Unicode到HTML实体。

— Vebjorn Ljosa 09年

95

至于“反之亦然”（我需要我自己，导致我发现这个问题没有帮助，随后又找到了一个有答案的站点）：

u'some string'.encode('ascii', 'xmlcharrefreplace')

将返回一个纯字符串，其中任何非ASCII字符都将变成XML（HTML）实体。

— 以撒
source

1

我忘记了xmlcharrefreplace，这非常有帮助。每当我需要安全地将编码的或非ascii字符存储到mysql时，我都需要使用此方法。

— cybertoast 2012年

1

这不适用于包含与U + 2019 HTML实体等效的Unicode字符＆＃8217;的字符串文字。这不是问题所要的吗（此答案转换为ASCII子集的ascii）？text.decode（'utf-8'）。encode（'ascii'，'xmlcharrefreplace'）

— Mike S

1

@MikeS可以正常工作；>>> u'\u2019'.encode('utf-8').decode('utf-8').encode('ascii', 'xmlcharrefreplace')给'’'

— Piotr Dobrogost '16

31

您需要有BeautifulSoup。

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&amp;' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&amp;'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&amp;, &reg;, &lt;, &gt;, &cent;, &pound;, &yen;, &euro;, &sect;, &copy;"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

— Hekevintran
source

2

BeautifulSoup API已更改。请参阅最新文档。

— scharfmn 2015年

@hekevintran：是否可以打印'＆＃x00A2 ;、＆＃x00A3 ;、＆＃x00A5 ;、＆＃x20AC ;、＆＃x00A7 ;、＆＃x00A9;' 而不是“¢，£，¥，€，§，©”。任何想法？

— 贾加斯

5

迫切需要Python3更新。

— Routhinator

21

Python 2.7和BeautifulSoup4的更新

Unescape-使用Unicode HTML进行Unicode编码htmlparser（Python 2.7 standard lib）：

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape-使用bs4（BeautifulSoup4）进行Unicode编码的Unicode HTML ：

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Escape-使用bs4（BeautifulSoup4）的Unicode编码HTML ：

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

— char
source

2

表示没有依赖性的标准库解决方案的投票

— Hartley Brody

再次访问时，我仅在指向该答案的问题上留下了@bobince评论。从htmlparser现在开始有记录，并且由于该注释不突出，因此保留了答案的那一部分。

— scharfmn 2016年

12

正如hekevintran答案所建议的那样，您可能会使用它cgi.escape(s)来编码字符串，但是请注意，该函数中的quote编码默认情况下为false，因此最好quote=True在字符串旁边传递关键字参数。但是即使通过传递quote=True，该函数也不会转义单引号（"'"）（由于这些问题，自3.2版以来，该函数已被弃用）

建议使用html.escape(s)代替cgi.escape(s)。（3.2版中的新功能）

也html.unescape(s)一直在3.4版本中引入的。

因此，在python 3.4中，您可以：

使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()特殊字符转换为HTML实体。
而html.unescape(text)转换的HTML实体回纯文本表示。

— AXO
source

1

在Python 2.7中，您可以使用HTMLParser.unescape（text）

— 坦率

4

$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python3 -c "
> import html
> print(
>     html.escape('&©—')
> )"
&amp;©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> import cgi
> print(
>     cgi.escape('&©—')
> )"
&amp;©—

HTML仅严格要求转义（与号）&和<（左尖括号/小于号）。https://html.spec.whatwg.org/multipage/parsing.html#data-state

— 扬·奎·佩布利克（Jan Kyu Peblik）
source

2

如果像我这样的人在那里想知道为什么有些实体编号（代码）像  (for trademark symbol),  (for euro symbol)未正确编码，则原因是在ISO-8859-1（又名Windows-1252）中未定义这些字符。

另请注意，从html5开始，默认字符集为utf-8，对于html4则为ISO-8859-1

因此，我们将必须以某种方式解决该问题（首先查找并替换它们）

Mozilla文档的参考（起点）

https://developer.mozilla.org/zh-CN/docs/Web/Guide/Localizations_and_character_encodings

— 布鲁斯考希克
source

1

我使用以下函数将从xls文件中剥离的unicode转换为html文件，同时保留了xls文件中的特殊字符：

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

希望这对某人有用

— 斯蒂芬·艾尔伍德
source

1

对于python3使用html.unescape()：

import html
s = "&amp;"
decoded = html.unescape(s)
# &

— 佩德罗·洛比托
source

0

#!/usr/bin/env python3
import fileinput
import html

for line in fileinput.input():
    print(html.unescape(line.rstrip('\n')))

— 笑脸
source