如何使用Python脚本从PDF中读取阿拉伯文本

Question

10 浏览2023年2月23日

匿名的 2023年2月24日

0 Comments

我有一个用Python编写的代码，它可以读取PDF文件并将其转换为文本文件。\n当我尝试从PDF文件中读取阿拉伯文本时出现了问题。我知道错误出现在编码和解码过程中，但我不知道如何修复它。\n系统可以转换阿拉伯文PDF文件，但文本文件为空，并显示以下错误：\n

\nTraceback (most recent call last): File\n \"C:\\Users\\test\\Downloads\\pdf-txt\\text maker.py\", line 68, in \n f.write(content) UnicodeEncodeError: \'ascii\' codec can\'t encode character u\'\\xa9\' in position 50: ordinal not in range(128)\n

\n代码：\n

import os
from os import chdir, getcwd, listdir, path
import codecs
import pyPdf
from time import strftime
def check_path(prompt):
    ''' (str) -> str
    验证提供的绝对路径是否存在。
    '''
    abs_path = raw_input(prompt)
    while path.exists(abs_path) != True:
        print "\nThe specified path does not exist.\n"
        abs_path = raw_input(prompt)
    return abs_path    
print "\n"
folder = check_path("Provide absolute path for the folder: ")
list=[]
directory=folder
for root,dirs,files in os.walk(directory):
    for filename in files:
        if filename.endswith('.pdf'):
            t=os.path.join(directory,filename)
            list.append(t)
m=len(list)
print (m)
i=0
while i<=m-1:
    path=list[i]
    print(path)
    head,tail=os.path.split(path)
    var="\\"
    tail=tail.replace(".pdf",".txt")
    name=head+var+tail
    content = ""
    # Load PDF into pyPDF
    pdf = pyPdf.PdfFileReader(file(path, "rb"))
            # Iterate pages
    for j in range(0, pdf.getNumPages()):
        # Extract text from page and add to content
        content += pdf.getPage(j).extractText() + "\n"
    print strftime("%H:%M:%S"), " pdf  -> txt "
    f=open(name,'w')
    content.encode('utf-8')
    f.write(content)
    f.close
    i=i+1

0

2 答案

匿名的 · Answer 1 · 2023-03-12T21:52:12+00:00

如何使用Python脚本从PDF中读取阿拉伯文本

在处理PDF文件时，通常会遇到一些特殊的需求，如读取包含阿拉伯文本的PDF文件。然而，使用传统的PDF处理库如pypdf或PyPDF2可能会遇到一些困难。为了解决这个问题，可以使用另一个名为pdfplumber的库。

pdfplumber是一个功能强大的PDF处理库，可以方便地从PDF中提取文本内容。在处理阿拉伯文本时，还可以使用两个额外的库：arabic_reshaper和bidi。

下面是一个使用pdfplumber、arabic_reshaper和bidi库的Python脚本示例：

import arabic_reshaper
from bidi.algorithm import get_display
with pdfplumber.open(r'example.pdf') as pdf:
    my_page = pdf.pages[10]
    thepages=my_page.extract_text()
    reshaped_text = arabic_reshaper.reshape(thepages)
    bidi_text = get_display(reshaped_text)
    print(bidi_text)

通过这个脚本，我们可以打开名为example.pdf的PDF文件，并提取第10页的文本内容。然后，使用arabic_reshaper库对文本进行重塑，以便正确显示阿拉伯文本的形状。最后，使用bidi库将文本进行适当的显示，确保阿拉伯文本的方向和对齐是正确的。

通过使用pdfplumber、arabic_reshaper和bidi库，我们能够轻松地从PDF中读取并正确显示阿拉伯文本。这为处理包含阿拉伯文本的PDF文件提供了方便和灵活的解决方案。

匿名的 · Answer 2 · 2023-06-19T16:58:39+00:00

如何使用Python脚本从PDF中读取阿拉伯文本？

在处理PDF文件时，有一些问题需要注意。首先，使用content.encode('utf-8')对内容进行编码不会产生任何效果，因为返回值是已编码的内容，但是你需要将其赋值给一个变量。更好的做法是，通过指定编码方式打开文件，并将Unicode字符串写入该文件。示例代码如下：

import io
f = io.open(name,'w',encoding='utf8')
f.write(content)

其次，如果没有正确关闭文件，你可能看不到任何内容，因为文件没有被刷新到磁盘上。你的代码中使用了f.close而不是f.close()。更好的做法是使用with语句，它可以确保在代码块退出时关闭文件。示例代码如下：

import io
with io.open(name,'w',encoding='utf8') as f:
    f.write(content)

在Python 3中，你不需要导入和使用io.open，直接使用open即可，两者是等价的。但是在Python 2中，需要使用io.open形式。

如果你在文本文件中看到了无法阅读的字符，可能是因为查看或编辑该文件的软件使用了错误的编码方式。特别是在Windows系统上，很多程序会默认使用本地化的编码方式，如美国Windows上的Windows-1252编码。你可以使用utf-8-sig编码方式写入字节顺序标记（BOM）签名，一些程序会识别这个标记并使用UTF-8编码。在处理PDF文件时，你提到使用了PDF Complete软件，而在处理文本文件时，你使用了NotePad++软件。如果你使用with io.open(name, 'w', encoding='utf-8-sig')，看看NotePad++能否正确显示文本。

通过以上方法，你应该能够使用Python脚本从PDF文件中读取阿拉伯文本，并将其保存为文本文件。