如何使用Python/Django进行HTML解码/编码？

Question

12 浏览2023年5月4日

匿名的 2023年5月5日

0 Comments

我有一个被HTML编码的字符串:

'''<img class="size-medium wp-image-113"\
 style="margin-left: 15px;" title="su1"\
 src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"\
 alt="" width="300" height="194" />'''

我想将它改成:

我希望这被识别为HTML，这样浏览器就会将其呈现为图像，而不是显示为文本。

字符串以这种方式存储，因为我正在使用一个名为BeautifulSoup的网页抓取工具，它"扫描"网页并获取其中的某些内容，然后以该格式返回字符串。

我已经找到了如何在C#中实现这一点，但在Python中还没有。有人可以帮我吗？

3 答案

匿名的 · Answer 1 · 2023-05-26T18:36:58+00:00

问题的原因是用户想要了解在Python/Django中如何执行HTML编码/解码。

对于HTML编码，可以使用标准库中的cgi.escape函数。该函数将特殊字符 "&"、"<" 和 ">" 替换为HTML安全序列。如果传入参数quote=True，还会对引号字符进行转义。

对于HTML解码，可以使用以下代码：

import re
from htmlentitydefs import name2codepoint
# for some reason, python 2.5.2 doesn't have this one (apostrophe)
name2codepoint['#39'] = 39
def unescape(s):
    "unescape HTML code refs; c.f. http://wiki.python.org/moin/EscapingHtml"
    return re.sub('&(%s);' % '|'.join(name2codepoint),
              lambda m: unichr(name2codepoint[m.group(1)]), s)

对于更复杂的情况，可以使用BeautifulSoup库。

根据Python文档的说明："自Python 3.2版本起已弃用：该函数默认情况下不安全，因此已弃用。请改用html.escape()。" 在3.9版本及更早版本中，该函数已被删除。

以上是用户提出的问题的原因以及解决方法。

匿名的 · Answer 2 · 2023-07-16T01:25:15+00:00

如何使用Python/Django进行HTML解码/编码？

在给定的Django使用情况下，有两种答案。这是其django.utils.html.escape函数的参考代码：

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

要反向操作，Jake的答案中描述的Cheetah函数应该可以工作，但是缺少单引号。这个版本包括一个更新的元组，将替换的顺序反转以避免对称问题：

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s
unescaped = html_decode(my_string)

然而，这不是一个通用的解决方案；它只适用于使用django.utils.html.escape编码的字符串。更一般地说，最好使用标准库：

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

作为建议：在数据库中存储未转义的HTML可能更有意义。如果可能的话，值得研究一下是否可以从BeautifulSoup中获取未转义的结果，从而完全避免这个过程。

在Django中，转义只发生在模板渲染期间；因此，要防止转义，只需告诉模板引擎不要转义字符串。要做到这一点，在模板中使用以下选项之一：

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

为什么不使用Django或Cheetah？

是否没有django.utils.html.escape的反义词？

我认为在Django中只有在模板渲染期间才会进行转义。因此，不需要反转义-只需告诉模板引擎不要转义。可以使用{{ context_var|safe }}或{% autoescape off %}{{ context_var }}{% endautoescape %}。

请将您的评论更改为答案，以便我可以投票支持！对于这个问题，"safe"正是我（以及其他人）寻找的答案。

应该是'''而不是''/'。

我发现在django 1.3.x中，我的单引号没有被转义。

html.parser.HTMLParser().unescape()在3.5中已被弃用。请改用html.unescape()。

这是一个将字符串恢复到其先前状态的好选择。谢谢。

匿名的 · Answer 3 · 2023-07-21T07:57:36+00:00

如何使用Python/Django执行HTML解码/编码？

使用标准库：

try:
    from html import escape  # python 3.x
except ImportError:
    from cgi import escape  # python 2.x
print(escape("<"))

try:
    from html import unescape  # python 3.4+
except ImportError:
    try:
        from html.parser import HTMLParser  # python 3.x (<3.4)
    except ImportError:
        from HTMLParser import HTMLParser  # python 2.x
    unescape = HTMLParser().unescape
print(unescape("&gt;"))

我认为这是最直接、功能齐全和正确的答案。我不知道为什么有人会投票选择那些Django/Cheetah的解决方案。

我也是这样认为，不过这个答案似乎不完整。`HTMLParser`需要被子类化，告诉它如何处理传入的对象的所有部分，然后将对象传递给它进行解析，详情请参见这里。另外，您仍然需要使用`name2codepoint`字典将每个HTML实体转换为它所表示的实际字符。

你是对的。如果我们将HTML实体放入未子类化的`HTMLParser`中，它无法按我们的期望工作。也许我应该将`htmlparser`重命名为`_htmlparser`，以隐藏它，只公开`unescape`方法，使它看起来像一个辅助函数。

对于2015年的说明，HTMLParser.unescape在py 3.4中已被弃用，在3.5中已被移除。使用`from html import unescape`代替。

答案已经更新。谢谢！

请注意，这不会处理像德语Umlauts（"Ü"）这样的特殊字符。

你能具体说明一下吗？对我来说，使用Python2+3对我来说是可以的。

如何使用Python/Django进行HTML解码/编码？

相关

3 答案