How to Extract Text, Links, and Images from PDF Files Using Python

We earn commission when you buy through affiliate links.

This does not influence our reviews or recommendations.Learn more.

It can add custom data, viewing options, and passwords to PDF files.

get-pip

Importantly, though, PyPDF2 can retrieve text from PDF files.

To installpip, click onget pipto download its installation script.

Install PyPDF2 by executing the following command in the terminal:

install-pip

To start working with a PDF file you first need to launch the file.

Since our PDF file has 5 pages, we can access each page available in the PDF.

However, counting starts from 0, just like Pythons indexing convention.

pip-version

Therefore, the first page in the pdf file will be page number 0.

You thus have access to the text on the first page of the PDF file through the variable textPage1.

To use PyMuPDF, you should have Python 3.8 or later.

pypdf-install

To get started:

Install PyMuPDF by executing the following line in the terminal:

Import PyMuPDF into your Python file using the following statement:

pages-in-pdf

To pull up the PDF you want to extract links from, you first need to open it.

To open it, enter the following line:

The counting of pages starts from zero just like in data structures like arrays and dictionaries.

pypdf-extracted-text

The entire code that does this is shown below:

To extract images from a PDF file:

Import PyMuPDF, io, and PIL.

pymupdf-pagecount

fire up the PDF file you want to extract images from:

Load the page you want to extract images from:

Every image on a PDF file has a unique xref.

printed-links

Since we only have one image on the first page, there is only one tuple.

The first element in the tuple represents thexrefof the image on the page.

Therefore, thexrefof the image on the first page is 7.

actual-links

To extract thexrefvalue for the image from the list of tuples, we use the code below:

From the dictionary returned by theextract_image()function, check the file extension of the extracted image.

The file extension is stored under the key ext:

all_links

Extract the image binaries from the dictionary stored inimg_dictionary.

The image binaries are stored under the key image

Create aBytesIOobject and initialize it with the binary image data that represents the image.

image-reference

Open and parse the image data stored in theBytesIOobject namedimage_iousing the PIL library.

Specify the path where you want to save the image.

The image will be saved as image_1.png.

xref_value

The PNG extension is important for it to match the original extension of the image.

Save the image and wrap up the ByteIO object.

You may also explore some bestPDF APIsfor every business need.

image-extension

extracted-image