127

我有一个HTML编码的字符串：

'''&lt;img class=&quot;size-medium wp-image-113&quot;\
 style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot;\
 src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot;\
 alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'''

我想将其更改为：

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" />

我希望将其注册为HTML，以便浏览器将其呈现为图像，而不是显示为文本。

字符串的存储方式是这样的，因为我正在使用一种名为的网络抓取工具BeautifulSoup，它将“扫描”网页并从中获取某些内容，然后以该格式返回字符串。

我已经找到了如何在C＃中而不是在Python中执行此操作。有人可以帮我吗？

有关

在Python中将XML / HTML实体转换为Unicode字符串

python django html-encode

— rksprst
source

118

给定Django用例，对此有两个答案。这是它的django.utils.html.escape功能，以供参考：

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

为了解决这个问题，Jake的答案中描述的Cheetah函数应该起作用，但是缺少单引号。此版本包含一个更新的元组，并且替换顺序相反，以避免出现对称问题：

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

但是，这不是一般的解决方案。仅适用于以编码的字符串django.utils.html.escape。更笼统地说，坚持使用标准库是一个好主意：

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

建议：将未转义的HTML存储在数据库中可能更有意义。如果可能的话，值得一探的是从BeautifulSoup获得未转义的结果，并完全避免此过程。

对于Django，转义仅在模板渲染期间发生；因此，为了防止转义，您只需告诉模板引擎不要转义您的字符串即可。为此，请在模板中使用以下选项之一：

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

— 丹尼尔·纳布（Daniel Naab）
source

1

为什么不使用Django或Cheetah？

— 垫

4

django.utils.html.escape没有相反的东西吗？

— 垫

12

我认为转义仅发生在Django模板渲染期间。因此，不需要进行转义-您只需要告诉模板引擎不要转义即可。{{context_var | safe}}或{％autoescape off％} {{context_var}} {％endautoescape％}

— Daniel Naab

3

@Daniel：请将您的评论更改为答案，以便我投票！|安全正是我（我敢肯定其他人）正在寻找的答案。

— 韦恩·科特斯

1

html.parser.HTMLParser().unescape()在3.5中已弃用。使用html.unescape()代替。

— pjvandehaar 2015年

114

使用标准库：

HTML转义

try:
    from html import escape  # python 3.x
except ImportError:
    from cgi import escape  # python 2.x

print(escape("<"))

HTML转义

try:
    from html import unescape  # python 3.4+
except ImportError:
    try:
        from html.parser import HTMLParser  # python 3.x (<3.4)
    except ImportError:
        from HTMLParser import HTMLParser  # python 2.x
    unescape = HTMLParser().unescape

print(unescape("&gt;"))

— 张江歌
source

12

我认为这是最直接的“含电池”和正确的答案。我不知道为什么人们会投票赞成Django / Cheetah。

— Daniel Baktiar 2012年

我也这么认为，只是这个答案似乎并不完整。HTMLParser被子类的需求，叫他做什么与它送入，然后输送到解析对象的任何对象的所有部件，如看到这里。同样，您仍然希望使用name2codepointdict将每个html标识转换为它表示的实际char。

— Marconius

你是对的。该unsubclassed HTMLParser不能工作，因为我们想，如果我们把一个HTML实体进去。也许我应该重命名htmlparser为_htmlparser以隐藏它，并且只将unescape方法公开为类似于辅助函数。

— 姜格张

3

注释2015年，HTMLParser.unescape在py 3.4中已弃用，在3.5中已删除。使用from html import unescape代替

— Karolis Ryselis

2

请注意，这不能处理特殊字符，例如德国Umlauts（“Ü”）

— 576i 2015年

80

对于html编码，标准库中有cgi.escape：

>> help(cgi.escape)
cgi.escape = escape(s, quote=None)
    Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.

对于html解码，我使用以下代码：

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39

def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

对于更复杂的事情，我使用BeautifulSoup。

— 用户名
source

20

如果编码字符集受到相对限制，请使用daniel的解决方案。否则，请使用众多HTML解析库之一。

我喜欢BeautifulSoup，因为它可以处理格式错误的XML / HTML：

http://www.crummy.com/software/BeautifulSoup/

对于您的问题，他们的文档中有一个示例

from BeautifulSoup import BeautifulStoneSoup
BeautifulStoneSoup("Sacr&eacute; bl&#101;u!", 
                   convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
# u'Sacr\xe9 bleu!'

— 文森特
source

BeautifulSoup不会转换十六进制实体（＆＃x65;）stackoverflow.com/questions/57708/…–

— jfs

1

对于BeautifulSoup4，等效项为：from bs4 import BeautifulSoup BeautifulSoup("Sacré bleu!").contents[0]

— radicand 2013年

10

在Python 3.4+中：

import html

html.unescape(your_string)

— 科林·安德森
source

8

请参阅此页面底部的Python Wiki，至少有2个选项可以“取消转义” html。

— 兹哥达
source

6

丹尼尔的评论作为答案：

“转义仅发生在Django模板渲染期间。因此，不需要进行转义-您只需告诉模板引擎不要转义。{{context_var | safe}}或{％autoescape off％} {{context_var}} { ％endautoescape％}”

— 弗兰科夫
source

有效，除了我的Django版本没有“安全”字样。我改用“转义”。我认为这是同一回事。

— willem

1

@willem：相反。

— Asherah 2015年

5

我在以下位置找到了很好的功能：http : //snippets.dzone.com/posts/show/4569

def decodeHtmlentities(string):
    import re
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")

    def substitute_entity(match):
        from htmlentitydefs import name2codepoint as n2cp
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)

            if cp:
                return unichr(cp)
            else:
                return match.group()

    return entity_re.subn(substitute_entity, string)[0]

— 慢药
source

使用re的好处是您可以同时匹配＆＃039; 和＆＃39; 使用相同的搜索。

— Neal Stublen

这不处理 应解码为 与和相同的内容 。

— Mike Samuel

3

如果有人在寻找通过django模板执行此操作的简单方法，则可以始终使用以下过滤器：

<html>
{{ node.description|safe }}
</html>

我有一些来自供应商的数据，我发布的所有内容实际上都是在呈现的页面上写的html标签，就像您在查看源代码一样。上面的代码极大地帮助了我。希望这对其他人有帮助。

干杯！！

— 克里斯·哈蒂
source

3

即使这是一个非常老的问题，也可能有效。

的Django 1.5.5

In [1]: from django.utils.text import unescape_entities
In [2]: unescape_entities('&lt;img class=&quot;size-medium wp-image-113&quot; style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;')
Out[2]: u'<img class="size-medium wp-image-113" style="margin-left: 15px;" title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" alt="" width="300" height="194" />'

— 詹姆士
source

1

这是唯一能够解码编码为html实体（例如）的代理对的对象"&#55349;&#56996;"。接连不断result.encode('utf-16', 'surrogatepass').decode('utf-16')，我终于有了原来的背。

— rescdsk

1

我在猎豹的源代码中找到了这个（这里）

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlDecode(s, codes=htmlCodesReversed):
    """ Returns the ASCII decoded version of the given HTML string. This does
        NOT remove normal HTML tags like <p>. It is the inverse of htmlEncode()."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s

不确定为什么要反转列表，我认为它与编码方式有关，因此对于您而言，可能不需要反转。另外，如果我是我，我会将htmlCodes更改为元组列表，而不是列表列表...尽管这将在我的库中进行:)

我也注意到您的标题也要求编码，所以这是猎豹的编码功能。

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[0], code[1])
    return s

— 杰克
source

2

该列表相反，因为始终必须对称地进行解码和编码替换。没有逆转，您可以例如。转换'＆amp; lt;' 到'＆lt;'，然后在下一步中将其错误地转换为'<'。

— bobince

1

您也可以使用django.utils.html.escape

from django.utils.html import escape

something_nice = escape(request.POST['something_naughty'])

— 塞思·戈特利布（Seth Gottlieb）
source

OP询问了转义，而不是转义。

— 粘土化

在标题ellsellf中，他还要求编码-刚刚找到您的答案并对此表示感谢。

— 西蒙·斯坦伯格

1

OP并没有要求什么，但是我发现这很有用。

— rectangletangle

0

以下是使用module的python函数htmlentitydefs。这不是完美的。htmlentitydefs我所拥有的版本不完整，它假设所有实体都解码到一个代码点，这对于像这样的实体是错误的&NotEqualTilde;：

http://www.w3.org/TR/html5/named-character-references.html

NotEqualTilde;     U+02242 U+00338    ≂̸

尽管有这些警告，但这里是代码。

def decodeHtmlText(html):
    """
    Given a string of HTML that would parse to a single text node,
    return the text value of that node.
    """
    # Fast path for common case.
    if html.find("&") < 0: return html
    return re.sub(
        '&(?:#(?:x([0-9A-Fa-f]+)|([0-9]+))|([a-zA-Z0-9]+));',
        _decode_html_entity,
        html)

def _decode_html_entity(match):
    """
    Regex replacer that expects hex digits in group 1, or
    decimal digits in group 2, or a named entity in group 3.
    """
    hex_digits = match.group(1)  # '&#10;' -> unichr(10)
    if hex_digits: return unichr(int(hex_digits, 16))
    decimal_digits = match.group(2)  # '&#x10;' -> unichr(0x10)
    if decimal_digits: return unichr(int(decimal_digits, 10))
    name = match.group(3)  # name is 'lt' when '&lt;' was matched.
    if name:
        decoding = (htmlentitydefs.name2codepoint.get(name)
            # Treat &GT; like &gt;.
            # This is wrong for &Gt; and &Lt; which HTML5 adopted from MathML.
            # If htmlentitydefs included mappings for those entities,
            # then this code will magically work.
            or htmlentitydefs.name2codepoint.get(name.lower()))
        if decoding is not None: return unichr(decoding)
    return match.group(0)  # Treat "&noSuchEntity;" as "&noSuchEntity;"

— 迈克·塞缪尔（Mike Samuel）
source

0

这是解决此问题的最简单方法-

{% autoescape on %}
   {{ body }}
{% endautoescape %}

从此页面。

— mil谐
source

0

在Django和Python中搜索此问题的最简单解决方案，我发现您可以使用内置函数来转义/转义html代码。

例

我将您的html代码保存在scraped_html和中clean_html：

scraped_html = (
    '&lt;img class=&quot;size-medium wp-image-113&quot; '
    'style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; '
    'src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; '
    'alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;'
)
clean_html = (
    '<img class="size-medium wp-image-113" style="margin-left: 15px;" '
    'title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" '
    'alt="" width="300" height="194" />'
)

Django的

您需要Django> = 1.0

逃生

要取消抓取的 HTML代码的转义，可以使用django.utils.text.unescape_entities，其中：

将所有命名和数字字符引用转换为相应的unicode字符。

>>> from django.utils.text import unescape_entities
>>> clean_html == unescape_entities(scraped_html)
True

逃逸

要转义干净的html代码，可以使用django.utils.html.escape，其中：

返回给定文本，该文本带有与符号，引号和尖括号，并编码为在HTML中使用。

>>> from django.utils.html import escape
>>> scraped_html == escape(clean_html)
True

蟒蛇

您需要Python> = 3.4

逃生

要取消抓取的 html代码，可以使用html.unescape，其中：

转换所有命名和数字字符引用（例如>，>，&x3e;到对应的Unicode字符字符串s）。

>>> from html import unescape
>>> clean_html == unescape(scraped_html)
True

逃逸

要转义干净的html代码，可以使用html.escape，其中：

转换角色&，<并>在字符串s到HTML安全序列。

>>> from html import escape
>>> scraped_html == escape(clean_html)
True

— 保罗·梅尔基奥尔（Paolo Melchiorre）
source

如何使用Python / Django执行HTML解码/编码？

有关

例

Django的

逃生

逃逸

蟒蛇

逃生

逃逸