python 爬虫实例|实例:Python处理PDF及生成多层PDF

更新时间:2021-06-29    来源:python    手机版     字体:

【www.bbyears.com--python】

Python提供了众多的PDF支持库,本文是在Python3环境下,试用了两个库来完成PDF的生成的功能。PyPDF对于读取PDF支持较好,但是没找到生成多层PDF的方法。Reportlab看起来更成熟,能够利用Canvas很方便的生成多层PDF,这样就能够实现图片扫描上来的内容也可以进行内容搜索的目标。

Reportlab

生成双层PDF

双层PDF应用PDF中的Canvas概念,先画文字,最后将图片画上去,这样就是两层的PDF。

  

importos

# import urllib2

importtime

fromreportlabimportplatypus

fromreportlab.lib.pagesizesimportletter

fromreportlab.lib.unitsimportinch

fromreportlab.platypusimportSimpleDocTemplate, Image

fromreportlab.pdfgenimportcanvas

  

image_file="./42.png"

  

# Use Canvas to generate pdf

c=canvas.Canvas('reportlab_canvas.pdf', pagesize=letter)

width, height=letter

  

c.setFillColorRGB(0,0.77,0.77)

# say hello (note after rotate the y coord needs to be negative!)

c.drawString(3*inch,3*inch,"Hello World")

c.drawImage(image_file,0,0)

c.showPage()

c.save()

PyPDF2

读取PDF

fromPyPDF2importPdfFileWriter, PdfFileReader

  

output=PdfFileWriter()

input1=PdfFileReader(open("jquery.pdf","rb"))

  

# print document info

print(input1.getDocumentInfo())

  

# print how many pages input1 has:

print("pdf_document.pdf has %d pages."%input1.getNumPages())

  

# print page content

page_content=input1.getPage(0).extractText()

print( page_content )

  

# add page 1 from input1 to output document, unchanged

output.addPage(input1.getPage(0))

  

# add page 2 from input1, but rotated clockwise 90 degrees

output.addPage(input1.getPage(1).rotateClockwise(90))

  

# finally, write "output" to document-output.pdf

outputStream=open("PyPDF2-output.pdf","wb")

output.write(outputStream)

| extractText(self) | ## | # Locate all text drawing commands, in the order they are provided in the | # content stream, and extract the text. This works well for some PDF | # files, but poorly for others, depending on the generator used. This will | # be refined in the future. Do not rely on the order of text coming out of | # this function, as it will change if this function is made more | # sophisticated. | #

 | # Stability: Added in v1.7, will exist for all future v1.x releases. May | # be overhauled to provide more ordered text in the future. | # @return a unicode string object

以上就是本文的全部内容,希望对大家的学习有所帮助。

本文来源:http://www.bbyears.com/jiaocheng/126695.html

热门标签

更多>>

本类排行