将Python文件名转换为Unicode

Question

10 浏览2023年1月6日

匿名的 2023年1月7日

0 Comments

我正在使用Windows上的Python 2.6版本。我使用os.walk来读取文件树。文件名中可能包含非7位字符（例如德语的“ae”）。这些字符被编码为Python内部字符串表示形式。我正在使用Python库函数处理这些文件名，但由于编码错误而失败。如何将这些文件名转换为适当的（unicode？）Python字符串？我有一个文件"d:\utest\ü.txt"。将路径作为unicode传递不起作用：

>>> list(os.walk('d:\\utest'))
[('d:\\utest', [], ['\xfc.txt'])]
>>> list(os.walk(u'd:\\utest'))
[(u'd:\\utest', [], [u'\xfc.txt'])]

0

3 答案

匿名的 · Answer 1 · 2023-09-12T21:23:06+00:00

问题出现的原因是Python中处理文件名时，可能会遇到文件名的编码与Python默认的编码不一致的情况。在这种情况下，如果直接使用文件名进行操作，可能会导致编码错误或无法识别的字符。

要解决这个问题，可以通过以下方法将Python文件名转换为Unicode编码：

1. 首先，需要确定文件系统的编码。可以使用以下代码来获取文件系统的编码：

import sys
filesystem_encoding = sys.getfilesystemencoding()

2. 接下来，可以使用上述获取到的文件系统编码将文件名转换为Unicode编码。例如：

unicode_name = unicode(filename, filesystem_encoding, errors="ignore")

在上述代码中，将`filename`替换为要转换的文件名，`filesystem_encoding`替换为文件系统的编码。

另一种情况是将Unicode编码的文件名转换为其他编码（如UTF-8）。可以使用以下代码将Unicode编码的文件名转换为其他编码：

unicode_name.encode("utf-8")

在上述代码中，将`unicode_name`替换为要转换的Unicode编码的文件名，将`"utf-8"`替换为目标编码。

通过以上方法，可以将Python文件名转换为Unicode编码，以便正确处理不同编码的文件名。

匿名的 · Answer 2 · 2023-09-22T02:18:15+00:00

问题原因：

在Python 3.0+版本中，使用os.walk()函数时，如果传入的是bytes类型的路径，则导致文件名以字节形式产生。

解决方法：

1. 避免在引号关闭前使用奇怪的斜杠，这在Python中会引发SyntaxError。

2. 在os.walk()函数中传入Unicode路径，而不是bytes类型的路径。

然而，在Python 3.5版本中，由于os.scandir()函数不支持字节路径，需要使用Unicode路径。否则会出现TypeError: os.scandir() doesn't support bytes path on Windows, use Unicode instead的错误。

匿名的 · Answer 3 · 2023-03-26T12:05:06+00:00

问题出现的原因是当将Unicode字符串传递给`os.walk()`函数时，如果文件名无法解码，则在Python 2中可能会得到一个字节字符串而不是Unicode字符串。

解决方法是使用`sys.getfilesystemencoding()`函数来获取文件系统的编码，并将传递给`os.walk()`函数的字符串转换为该编码。这样可以确保无法解码的文件名以字节字符串的形式返回。

以下是解决方法的示例代码：

import os
import sys
def convert_filenames_to_unicode(path):
    encoding = sys.getfilesystemencoding()
    unicode_path = path.decode('utf-8')  # Assuming the path is UTF-8 encoded
    encoded_path = unicode_path.encode(encoding)
    for root, dirs, files in os.walk(encoded_path):
        unicode_root = root.decode(encoding)
        unicode_dirs = [d.decode(encoding) for d in dirs]
        unicode_files = [f.decode(encoding) for f in files]
        yield (unicode_root, unicode_dirs, unicode_files)
# Usage example
for entry in convert_filenames_to_unicode(ur'C:\example'):
    print(entry)

通过使用`sys.getfilesystemencoding()`和适当的编码转换，可以确保在遍历文件系统时，无法解码的文件名以字节字符串的形式返回，从而解决了该问题。