将扫描的pdf转换为文本的python代码

Question

11 浏览2023年5月27日

匿名的 2023年5月27日

0 Comments

我有一个扫描后的PDF文件，我想从中提取文本。

我尝试使用pypdfocr进行OCR，但出现了错误：

"在通常的位置找不到ghostscript"

在搜索后，我找到了这个解决方案在Windows平台上将Ghostscript与pypdfocr链接起来，我尝试下载GhostScript并将其放入环境变量中，但仍然出现相同的错误。

我如何使用Python搜索我的扫描后的PDF文件中的文本？

谢谢。

编辑：这是我的代码示例：

import os
import sys
import re
import json
import shutil
import glob
from pypdfocr import pypdfocr_gs
from pypdfocr import pypdfocr_tesseract 
from PIL import Image
path = PATH_TO_MY_SCANNED_PDF
mainL = []
kk = {}
def new_init(self, kk):
    self.lang = 'heb'   
    self.binary = "tesseract"
    self.msgs = {
            'TS_MISSING': """ 
                无法执行%s
                请确保您已正确安装Tesseract
                """ % self.binary,
            'TS_VERSION':'Tesseract版本过旧',
            'TS_img_MISSING':'找不到指定的tiff文件',
            'TS_FAILED': 'Tesseract-OCR执行失败！',
        }
pypdfocr_tesseract.PyTesseract.__init__ = new_init  
wow = pypdfocr_gs.PyGs(kk)
tt = pypdfocr_tesseract.PyTesseract(kk)
def secFile(filename,oldfilename):
    wow.make_img_from_pdf(filename)
    files = glob.glob("X:/e206333106/ocr-114/balagan/" + '*.jpg')  
    for file in files:
        im = Image.open(file)
        im.save(file + ".tiff") 
    files = glob.glob("PATH" + '*.tiff')  
    for file in files:
        tt.make_hocr_from_pnm(file)
    pdftxt = ""    
    files = glob.glob("PATH" + '*.html') 
    for file in files:
        with open(file) as myfile:
            pdftxt = pdftxt + "#" + "".join(line.rstrip() for line in myfile)
    findNum(pdftxt,oldfilename)
    folder ="PATH"
    for the_file in os.listdir(folder):
        file_path = os.path.join(folder, the_file)
        try:
            if os.path.isfile(file_path):
                os.unlink(file_path)
        except Exception, e:
            print e
def pdf2ocr(filename):
    pdffile = filename
    os.system('pypdfocr -l heb ' + pdffile)
def ocr2txt(filename):  
    pdffile = filename
    output1 = pdffile.replace(".pdf","_ocr.txt")
    output1 = "PATH" + os.path.basename(output1)
    input1 = pdffile.replace(".pdf","_ocr.pdf")
    os.system("pdf2txt" -o  + output1 + " " + input1) 
    with open(output1) as myfile:
        pdftxt="".join(line.rstrip() for line in myfile)
    findNum(pdftxt,filename)
def findNum(pdftxt,pdffile):
    l = re.findall(r'\b\d+\b', pdftxt)
    output = open('PATH' + os.path.basename(pdffile) + '.txt', 'w')
    for i in l:
        output.write(",")
        output.write(i)
    output.close()    
def is_ascii(s):
    return all(ord(c) < 128 for c in s)
i = 0     
files = glob.glob(path + '\\*.pdf') 
print path  
print files 
for file in files:
    if file.endswith(".pdf"):
        if is_ascii(file):
            print file
            pdf2ocr(file)    
            ocr2txt(file)
        else:
            newname = "PATH" + str(i) + ".pdf"
            shutil.copyfile(file, newname)
            print newname
            secFile(newname,file)
        i = i + 1
files = glob.glob(path + '\\' + '*_ocr.pdf')         
for file in files:
    print file
    shutil.copyfile(file, "PATH" + os.path.basename(file))
    os.remove(file)

0

3 答案

匿名的 · Answer 1 · 2023-07-03T04:11:50+00:00

在这段代码中，作者试图将扫描的PDF转换为文本。他使用了Python的几个库，包括wand、PIL和pytesseract。他首先将PDF转换为JPEG图像，然后使用pytesseract库将图像中的文本提取出来。但是，他在执行过程中遇到了一些问题。

问题的一个原因是权限设置。作者发现，他需要修改ImageMagick的权限设置，以便读取和写入PDF文件。为了解决这个问题，他打开了终端，并通过编辑/etc/ImageMagick-6/policy.xml文件来更改PDF行的权限设置为"read|write"。

另一个问题是内存限制。作者提到，他在提取PDF图像转换为文本时遇到了问题。为了解决这个问题，他增加了内存限制。

此外，作者还提到他有一个额外的需求，即提取文本的位置、字体、大小等信息，以便能够创建一个包含文本的PDF文件。

为了解决将扫描的PDF转换为文本的问题，作者需要编辑ImageMagick的权限设置，并增加内存限制。此外，他还需要找到一种方法来提取文本的位置、字体、大小等信息。

匿名的 · Answer 2 · 2023-08-22T09:15:45+00:00

在Python中，将扫描的PDF文件转换为文本是一个常见的需求。然而，有时候在执行转换时会遇到一些问题。下面我们来看一下一个用户在尝试使用pypdfocr库时遇到的问题以及解决方法。

用户尝试使用pypdfocr库将PDF文件转换为文本，但是发现PDF文件中还包含图像。这个库可能无法分析页面内容流，因为一些扫描仪会将单个扫描页面分成多个图像，所以无法通过ghostscript获取到文本。

用户在命令行中输入了pypdfocr filename.pdf，但是出现了以下错误：ERROR: Could not find Ghostscript in the usual place; please specify it using your config file。

询问用户使用的操作系统，用户回答是64位的Windows系统。

然后询问用户是否使用pip安装了ghostscript，提供了安装命令：pip install ghostscript。

在尝试安装32位版本的Ghostscript后，问题仍然存在。

通过以上对话，我们可以得出以下结论：用户在使用pypdfocr库时遇到了无法找到Ghostscript的问题。可能的解决方法是安装32位版本的Ghostscript，并在配置文件中指定其路径。

匿名的 · Answer 3 · 2023-07-03T13:55:49+00:00

将扫描的pdf转换为文本是一种常见的需求，可以使用Python中的pytesseract库进行OCR识别，并将pdf中的每一页导出为文本文件。下面是解决这个问题的原因和方法。

原因：

- 扫描的pdf通常是图片形式，无法直接进行文本操作和搜索。

- 需要将扫描的pdf转换为文本形式，以便进行文本处理和分析。

解决方法：

1. 安装所需的库：pytesseract、tesseract和pdf2image。可以使用以下命令进行安装：

conda install -c conda-forge pytesseract

conda install -c conda-forge tesseract

pip install pdf2image

2. 导入所需的库：pytesseract和pdf2image。可以使用以下代码导入：

import pytesseract
from pdf2image import convert_from_path
import glob

3. 获取pdf文件的路径：使用glob库获取pdf文件的路径。可以使用以下代码获取文件路径：

pdfs = glob.glob(r"yourPath\*.pdf")

4. 遍历pdf文件并转换为文本：使用pdf2image库将pdf文件的每一页转换为图像，然后使用pytesseract库将图像转换为文本，并将其保存为文本文件。可以使用以下代码实现：

for pdf_path in pdfs:
    pages = convert_from_path(pdf_path, 500)
    for pageNum,imgBlob in enumerate(pages):
        text = pytesseract.image_to_string(imgBlob,lang='eng')
        with open(f'{pdf_path[:-4]}_page{pageNum}.txt', 'w') as the_file:
            the_file.write(text)

以上就是将扫描的pdf转换为文本的原因和解决方法。通过安装所需的库，并使用pytesseract和pdf2image库的功能，可以方便地将扫描的pdf转换为文本格式进行后续处理和分析。