将PDF自动转换为图像

Question

14 浏览2023年1月9日

匿名的 2023年1月9日

0 Comments

所以我所在的州发布了一批以PDF形式的数据，但更糟糕的是，大部分（或全部？）的PDF看起来都是在办公室里打字，然后打印/传真，最后扫描成的文档（政府的最佳表现啊）。起初我以为自己疯了，但后来我开始看到许多\'倾斜\'的pdf，就像有人没有正确地将它们放在扫描仪上一样。因此，我觉得最好的办法是将每一页转换成图像，而不是提取实际的文本。\n显然，这需要自动化，如果可能的话，我宁愿使用Python。如果Ruby或Perl有某种绝对不能错过的实现方式，我也可以尝试。我尝试过使用pyPDF进行文本提取，但显然没有多大帮助。我还尝试过使用swftools，但从中得到的图像几乎完全无法使用。似乎在转换过程中字体被破坏了。我甚至不在乎图像格式，只要它们相对轻便且可读即可。

0

3 答案

匿名的 · Answer 1 · 2023-08-04T16:50:32+00:00

Converting PDF to images automatically has become a common need in various applications. One of the popular tools for this task is Ghostscript. It offers reliability, flexibility, and numerous configurable options. Additionally, Ghostscript is available under the GPL license or commercial license, making it accessible for different types of users.

There are two primary ways to utilize Ghostscript for PDF to image conversion: through the command line or using its native API. The command line approach allows users to execute Ghostscript directly from the terminal or script, making it convenient for one-time conversions or simple automation tasks. On the other hand, the native API provides a more integrated and programmatic way of interacting with Ghostscript within the application code.

For those interested in using Ghostscript, the following resources can be helpful:

- Ghostscript Main Website: The official website of Ghostscript provides comprehensive information about the tool, its features, and the latest updates.

- Ghostscript docs on Command line usage: This documentation specifically focuses on using Ghostscript through the command line, providing details about the available options and how to use them.

- Stackoverflow thread: A stackoverflow thread offers practical examples of invoking Ghostscript's command line interface from Python. This can be particularly useful for Python users looking to integrate Ghostscript into their application.

- Ghostscript API Documentation: The API documentation provides detailed information on how to utilize Ghostscript's native API to perform PDF to image conversion programmatically. It includes examples, explanations of different functions, and guidelines for proper usage.

By leveraging Ghostscript's capabilities, users can automate the process of converting PDF files to images effortlessly. Whether it's a one-time conversion or a recurring task, Ghostscript provides a reliable and versatile solution.

匿名的 · Answer 2 · 2023-09-19T15:41:12+00:00

Converting PDF to images automatically is a common requirement in many scenarios. However, it can be challenging to find a solution that does not rely on external libraries, especially when working on a shared server that restricts the installation of tools like ImageMagick or Ghostscript.

One possible solution is to use the command-line tool "pdftoppm." This tool is available for various operating systems and can be called from the command-line or using Python's subprocess module. It converts each page of the PDF to a PPM (Portable Pixmap) file.

To convert the resulting PPM files to the desired format, such as PNG or JPG, another tool like ImageMagick can be used. However, since ImageMagick is not accessible in this case, alternative methods need to be explored.

After hours of research and experimentation, it is discovered that the best approach is to stick with using pdftoppm. This tool offers excellent performance and reliability for converting PDF to images. While it produces PPM files by default, they can be easily converted to other formats using Python.

To convert PPM files to PNG or JPG, Python libraries like Pillow or OpenCV can be utilized. These libraries provide functions to read PPM files and save them in different formats, including PNG and JPG. By using these libraries, the conversion process can be performed without relying on external tools like ImageMagick.

Overall, the challenge of converting PDF to images automatically without external libraries can be overcome by leveraging the pdftoppm command-line tool and using Python libraries like Pillow or OpenCV to handle the final image format conversion. Despite the limitations of the shared server, this approach ensures a reliable and efficient solution for converting PDF files to images.

匿名的 · Answer 3 · 2023-05-20T09:45:40+00:00

问题的出现的原因是PDF文件中的数据实际上是一个巨大的图像，为了在Acrobat中可读，将其包装在PDF冗余中。因此，将PDF转换为图像并不是最佳解决方法，应该从PDF中提取图像。

解决方法是找到PDF中的图像，并将字节复制出来。可以尝试使用提供的代码来提取JPG图像。但是，由于各种原因，代码可能无法适用于所有PDF文件。但如果能成功，这将是一种快捷且无痛的方式来获取PDF文件中的图像数据。

然而，有人尝试在自己的PDF文件上运行代码时遇到问题。这是因为其PDF文件中的图像并非JPG格式，导致代码无法找到图像起始位置。

另外，还有人尝试了多种解决方案，但对于由Konica Bizhub复印机扫描的PDF文件，并没有一个可行的解决方法。这是因为该复印机将每个页面裁剪成多个小图像（可能是TIFF格式），可能是为了OCR识别的目的。因此，之前提到的从PDF中提取图像的解决方法对这种情况不适用。

最后，还有人询问是否存在一种方法可以从PDF中获取特定页面的图像。