将Python转义的Unicode序列转换为UTF-8。

Question

24 浏览2023年5月24日

匿名的 2023年3月13日

0 Comments

这个问题已经有了答案：

使用json.dumps保存UTF-8文本，而不是\\u转义序列

我正在使用Beautiful Soup。它能给我一些HTML节点的文本，但是这些节点有一些Unicode字符，它们在字符串中被转换为转义序列。

例如，一个具有以下内容的HTML元素：

50 €通过Beautiful Soup检索出来像这样：soup.find(\"h2\").text成为这个字符串：50\\u20ac，在Python控制台中才能阅读。

但是，当它被写入JSON文件时，它变得无法阅读。

注意：我使用以下代码将其保存到JSON文件中：

with open('file.json', 'w') as fp:
        json.dump(fileToSave, fp)

我该如何将这些Unicode字符转换回UTF-8或其他能够使它们可读的格式？

admin 更改状态以发布 2023年5月24日

0

2 答案

匿名的 · Answer 1 · 2023-03-13T20:57:58+00:00

对于Python 2.7，我认为你可以使用codecs和json.dump(obj, fp, ensure_ascii=False)。例如：

import codecs
import json
with codecs.open(filename, 'w', encoding='utf-8') as fp:
    # obj is a 'unicode' which contains "50 €"
    json.dump(obj, fp, ensure_ascii=False)

匿名的 · Answer 2 · 2023-03-13T20:57:58+00:00

这是一个使用 Python3 的小型演示。如果你不使用 ensure_ascii=False 进行 JSON 转储，那么非 ASCII 字符将被写为带有 Unicode 转义码的 JSON。这不会影响 JSON 的加载能力，但在 .json 文件本身中不太易读。

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> html = '50\u20ac>> html
'50€>> soup = BeautifulSoup(html,'html')
>>> soup.find('element').text
'50€'
>>> import json
>>> with open('out.json','w',encoding='utf8') as f:
...  json.dump(soup.find('element').text,f,ensure_ascii=False)
...
>>> ^Z

out.json 的内容（以 UTF-8 编码）：

"50€"