Table of contents

Introduction

Getting the details of a PDF

Extracting text from PDF

Rotating the Pages

Merging PDF files in Python

Splitting the pages of the PDF

Adding a Watermark

Frequently Asked Questions

8.1.

1. Can Python work with PDF files?

8.2.

2. What is the PyPDF2 package?

8.3.

3. What is PDFMiner in Python?

8.4.

4. What is PDF scraping?

Conclusion

Last Updated: Aug 22, 2025

Easy

Handling PDF files in Python

Author Taneesh Kaushik

Introduction

There are many use cases in which we have to create a pdf file programmatically, and sometimes we have to read the contents of pdf files. In all such cases, we can use a Python Module PyPdf2. We will learn more about various use cases which can be achieved with the help of this module.

It can be installed from pip by this command:

pip install PyPDF2

Getting the details of a PDF

There may be a case where you want to know some details about data of a pdf like an author page count, title etc., all such things are called Metadata, and we can read all this data with the help of the PyPdf2 module.

Have a look at this example

from PyPDF2 import PdfFileReader
pdf_path='DSA outline.pdf'
with open(pdf_path, 'rb') as f:
        pdf = PdfFileReader(f)
        information = pdf.getDocumentInfo()
        number_of_pages = pdf.getNumPages()
        print(information)

You can also try this code with Online Python Compiler

Run Code

Output:

{'/Title': 'DSA_Outline', '/Producer': 'Skia/PDF m95 Google Docs Renderer'}

Also Read About, Python for Data Science

Extracting text from PDF

There may be many times when we read the text from a pdf file and use it in some or other form. Basically, we need to extract text from pdf, and this can be easily done with this module.

We create a pdf file object, and then we can get the text.

Here is an example:

from PyPDF2 import PdfFileReader
pdf_path='DSA outline.pdf'
# creating a pdf file object
pdfFileObject = open(pdf_path, 'rb')
# creating a pdf reader object
pdfReader = PdfFileReader(pdfFileObject)
pageContent=""
for i in range(pdfReader.numPages):
    # creating a page object
    pageObj = pdfReader.getPage(i)
    # extracting text from page
    pageContent=pageObj.extractText()
    print(pageContent)

You can also try this code with Online Python Compiler

Run Code

Here we can see that we are moving to each page and extracting the text of each page, and then printing it, we can feed this text to some other algorithm as well and do whatever we want.

Rotating the Pages

We can rotate some pages of the pdf file if we want to.

from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_path='DSA outline.pdf'
pdf_read = PdfFileReader(pdf_path)
pdf_write = PdfFileWriter()
# Rotate page 90 degrees to the right
page1 = pdf_read.getPage(0).rotateClockwise(90)
pdf_write.addPage(page1)
with open('rotate_pages.pdf', 'wb') as fh:
    pdf_write.write(fh)

You can also try this code with Online Python Compiler

Run Code

Merging PDF files in Python

We can merge two pdf files easily.

from PyPDF2 import PdfFileMerger, PdfFileReader
 
# Call the PdfFileMerger
mergedObject = PdfFileMerger()
mergeFilesPath=['pdf1.pdf', 'pdf2.pdf', 'pdf3.pdf']
# I had 116 files in the folder that had to be merged into a single document
# Loop through all of them and append their pages
for fileNumber in range(1, 117):
    mergedObject.append(PdfFileReader('fileMerger_' + str(fileNumber)+ '.pdf', 'rb'))
 
# Write all the files into a file which is named as shown below
mergedObject.write("mergedfilesoutput.pdf")

You can also try this code with Online Python Compiler

Run Code

Splitting the pages of the PDF

We can split the pages of a pdf and save them again as pdf.

fname = os.path.splitext(os.path.basename(pdf_path))[0]
    for page in range(pdf.getNumPages()):
        pdfwrite = PdfFileWriter()
        pdfwrite.addPage(pdf.getPage(page))
        outputfilename = '{}_page_{}.pdf'.format(
            fname, page+1)
        with open(outputfilename, 'wb') as out:
            pdfwrite.write(out)
        print('Created: {}'.format(outputfilename))
 pdf = PdfFileReader(pdf_path)

You can also try this code with Online Python Compiler

Run Code

Adding a Watermark

from PyPDF2 import PdfFileWriter, PdfFileReader
def watermark(inputpdf, outputpdf, watermarkpdf):
    watermark = PdfFileReader(watermarkpdf)
    watermarkpage = watermark.getPage(0)
    pdf = PdfFileReader(inputpdf)
    pdfwrite = PdfFileWriter()
    for page in range(pdf.getNumPages()):
        pdfpage = pdf.getPage(page)
        pdfpage.mergePage(watermarkpage)
        pdfwrite.addPage(pdfpage)
    with open(outputpdf, 'wb') as fh:
        pdfwrite.write(fh)
if __name__ == '__main__':
    watermark(inputpdf='document-1.pdf',
              outputpdf='watermarked_w9.pdf',
              watermarkpdf='watermark.pdf')

You can also try this code with Online Python Compiler

Run Code

Frequently Asked Questions

1. Can Python work with PDF files?

The PyPDF2 package allows you to work with an existing PDF in Python. PyPDF2 is a Python module that may be used to perform a variety of PDF operations.

2. What is the PyPDF2 package?

The pypdf2 package is a pure-python pdf library that you can use for splitting, merging, cropping, and transforming pages in your pdfs. According to the pypdf2 website, you can also use pypdf2 to add data, viewing options, and passwords to the pdfs, too.

3. What is PDFMiner in Python?

PDFMiner is a programme for extracting text from PDF files.

4. What is PDF scraping?

PDF scrapers are a fast, powerful, and scalable technique to extract massive volumes of data from PDF documents and transform it into machine-readable structured data. Data scraped from PDFs can be easily handled in automated workflows, resulting in a significant increase in a company's bottom line.

Conclusion

In a nutshell, there are a lot of ways with which we can manipulate and make new pdfs in Python using the PyPDF2 module, we can add watermarks, retrieve text and even make format changes to a pdf file, and all of these are very powerful tools when it comes to handling pdfs in Python.