Handling PDF files in Python using PyMuPDF
In this blog, we will learn how to handle PDF files in Python using PyMuPDF, a library that provides a Pythonic interface to the MuPDF library.
The MuPDF library is a lightweight, high-quality PDF renderer that is written in portable C code. It is designed to be fast and memory-efficient, making it well-suited for use in applications that need to work with large PDF files or process a large number of PDFs in a short amount of time.
With PyMuPDF, we can open and read PDF files, extract text and images, add text and images to PDF files, and perform various other operations on PDF files. In addition to its core functionality, PyMuPDF also provides several convenience features that make it easier to work with PDFs in Python. For example, it includes support for bookmarking, annotations, and form filling, as well as support for password-protected PDFs.
To get started with PyMuPDF, you will need to install the library and its dependencies. This can be done using pip:
pip install pymupdf
Once PyMuPDF is installed, you can begin using it in your Python code.
Here are the most commonly used examples,
Open PDF file
To open a PDF file using PyMuPDF, you can use the open
function of the fitz
module. This function takes the path of the PDF file as an argument and returns a Document
object representing the PDF file.
Here is an example of how to open a PDF file using PyMuPDF:
import fitz
# Open the PDF document
doc = fitz.open("document.pdf")
This will open the document.pdf
file and return a Document
object representing the file. You can then use various methods and properties of the Document
object to access and manipulate the contents of the PDF file.
For example, you can use the page_count
property of the Document
object to get the number of pages in the PDF file, and you can use the indexing operator (e.g., doc[i]
) to get a specific page from the file.
You can also use the metadata
property of the Document
object to get metadata about the PDF file, such as the title, author, and subject.
Extract text from a PDF
To extract text from a PDF file into a list using PyMuPDF, you can use the get_text
method of the Page
object and append the extracted text to a list.
Here is an example of how to extract all the text from a PDF file and store it in a list:
import fitz
# Open the PDF document
doc = fitz.open("document.pdf")
# Create an empty list to store the text
text_list = []
# Iterate over all the pages in the document
for page in doc:
# Extract the text from the page
text = page.get_text()
# Append the text to the list
text_list.append(text)
# Print the list
print(text_list)
This will extract all the text from the document.pdf
file and store it in the text_list
variable. The text from each page will be stored as a separate element in the list.
You can also extract text from a specific page by using the indexing operator (e.g., doc[i].get_text()
) to get the desired page and then calling the get_text
method on that page.
Keep in mind that the get_text
method may not always produce perfect results, especially for complex or poorly formatted PDF files. It may miss some text or include extra characters. You may need to do additional processing to clean up the extracted text.
Add text to a PDF file
To add text to a PDF file using PyMuPDF, you can use the insert_text
method of the Page
object. This method takes the text to be added, the position of the text on the page, and the font size as arguments, and adds a textbox with the given text to the page at the specified position.
Here is an example of how to add a textbox with some text to the first page of a PDF document:
import fitz
# Open the PDF document
doc = fitz.open("document.pdf")
# Get the first page
page = doc[0]
# Set the font size
font_size = 20
# Set the position of the textbox on the page
x = 50
y = 50
# Set the text to be added
text = "This is some text"
# Add the textbox to the page
page.insert_text((x, y), text, fontsize=font_size)
# Save the changes to the PDF
doc.save("modified_document.pdf")
This will add a textbox with the text “This is some text” to the first page of the document.pdf
file at position (50, 50) with a font size of 20. The modified page will be saved to a new file called modified_document.pdf
.
You can customize the position, font size, and other formatting options of the textbox as needed. You can also add text to multiple pages by repeating the above steps for each page.
Rotate pages in a PDF document
To rotate pages in a PDF document using PyMuPDF, you can use the set_rotate
method of the Page
object. This method takes an angle as an argument and rotates the page by that angle. You can use the get_rotate
method to get the rotation value of the current page.
Here is an example of how to rotate all the pages in a PDF document by 90 degrees:
import fitz
# Open the PDF document
doc = fitz.open("document.pdf")
# Iterate over all the pages in the document
for page in doc:
# Rotate the page by 90 degrees
page.set_rotation(90)
# Save the changes to the PDF
doc.save("rotated_document.pdf")
This will rotate all the pages in the document.pdf
file by 90 degrees and save the rotated pages to a new file called rotated_document.pdf
. You can specify any angle between 0 and 360 to rotate the pages.
Extract images from a PDF file
To extract images from a PDF file using PyMuPDF, you can use the get_pixmap
method of the Page
object. This method returns an Pixmap
object, which represents an image. You can then save this image to a file using the save
method of the Pixmap
object.
Here is an example of how to extract all the images from a PDF file and save them to image files:
import fitz
# Open the PDF document
doc = fitz.open("document.pdf")
# Iterate over all the pages in the document
for i in range(doc.page_count):
# Get the current page
page = doc[i]
# Extract all the images on the page
for img in page.get_images():
# Get the image data
pix = fitz.Pixmap(doc, img)
# Save the image to a file
pix.save("image{}.png".format(i))
# Free the memory used by the Pixmap object
pix = None
This will extract all the images from the document.pdf
file and save them as image files with names like image0.png
, image1.png
, etc. The images will be saved in the same format as they appear in the PDF file.
Merge two PDFs
To merge two PDFs using the insert_pdf
method of PyMuPDF, you can use the following code:
import fitz
# Open the first PDF document
doc1 = fitz.open('document1.pdf')
# Open the second PDF document
doc2 = fitz.open('document2.pdf')
# Insert the second document into the first document
doc1.insert_pdf(doc2)
# Save the merged PDF
doc1.save('merged.pdf')
This will create a new PDF file called merged.pdf
that contains the pages from both document1.pdf
and document2.pdf
. The pages from document2.pdf
will be appended to the end of document1.pdf
.
Delete a page from a PDF
To delete a page from a PDF using PyMuPDF, you can use the delete_page
method of the Document
object. Here is an example of how to delete the second page of a PDF:
import fitz
# Open the PDF document
doc = fitz.open('document.pdf')
# Delete the second page
doc.delete_page(1)
# Save the modified PDF
doc.save('modified.pdf')
This will create a new PDF file called modified.pdf
with the second page removed. Note that the page numbering is zero-based, so to delete the second page you need to specify the index 1
.
If you want to delete multiple pages at once, you can pass a list of page indices to the delete_pages
method.
As you can see, PyMuPDF is a powerful and easy-to-use library for working with PDF files in Python. Whether you need to extract text and images from PDFs or add and modify content in existing PDFs, PyMuPDF has you covered.
PyMuPDF documentation is a great resource, which you can access from here. If you want to see more examples, you can explore them here.