在Python中将Unicode转换为ASCII而不出现错误

Question

20 浏览2023年5月23日

匿名的 2023年1月18日

0 Comments

我的代码只是从网页上抓取数据，然后将其转换为 Unicode。

html = urllib.urlopen(link).read()
html.encode("utf8","ignore")
self.response.out.write(html)

但是我遇到了一个 UnicodeDecodeError 错误：

Traceback (most recent call last):
  File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py", line 507, in __call__
    handler.get(*groups)
  File "/Users/greg/clounce/main.py", line 55, in get
    html.encode("utf8","ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2818: ordinal not in range(128)

我猜这意味着 HTML 中包含了某些错误的 Unicode 尝试。 我能否仅丢弃导致问题的任何代码字节而不是出现错误呢？

admin 更改状态以发布 2023年5月23日

0

2 答案

匿名的 · Answer 1 · 2023-01-18T20:57:58+00:00

作为Ignacio Vazquez-Abrams答案的扩展

>>> u'aあä'.encode('ascii', 'ignore')
'a'

有时候需要从字符中去除重音符号，并打印其基本形式。可以使用以下方法实现

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore')
'aa'

您可能还想将其他字符（如标点符号）转换为最接近的等价物，例如RIGHT SINGLE QUOTATION MARK unicode字符在编码时未被转换为ascii APOSTROPHE。

>>> print u'\u2019'
’
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
>>> u'\u2019'.encode('ascii', 'ignore')
''
# Note we get an empty string back
>>> u'\u2019'.replace(u'\u2019', u'\'').encode('ascii', 'ignore')
"'"

虽然还有更有效的方法来实现这一点。有关更多详细信息，请参见此问题Where is Python's "best ASCII for this Unicode" database?

匿名的 · Answer 2 · 2023-01-18T20:57:58+00:00

>>> u'aあä'.encode('ascii', 'ignore')
'a'

解码返回的字符串，使用响应中适当的 meta 标签或 Content-Type 头中的字符集，然后编码。

方法 encode(encoding, errors) 接受自定义的错误处理程序。除了 ignore 之外，默认值是：

>>> u'aあä'.encode('ascii', 'replace')
b'a??'
>>> u'aあä'.encode('ascii', 'xmlcharrefreplace')
b'aあä'
>>> u'aあä'.encode('ascii', 'backslashreplace')
b'a\\u3042\\xe4'

请参见 https://docs.python.org/3/library/stdtypes.html#str.encode