Extracting textual content from PDF recordsdata with Python: A complete information #Imaginations Hub

Extracting textual content from PDF recordsdata with Python: A complete information #Imaginations Hub
Image source - Pexels.com


Extracting Textual content from PDF Information with Python: A Complete Information

An entire course of to extract textual data from tables, pictures, and plain textual content from a PDF file

Picture by Giorgio Trovato on Unsplash

Introduction

Within the age of Giant Language Fashions (LLMs) and their wide-ranging functions, from easy textual content summarisation and translation to predicting inventory efficiency based mostly on sentiment and monetary report matters, the significance of textual content information has by no means been better.

There are various sorts of paperwork that share this sort of unstructured data, from internet articles and weblog posts to handwritten letters and poems. Nevertheless, a good portion of this textual content information is saved and transferred in PDF format. Extra particularly, it has been discovered that over 2 billion PDFs are opened in Outlook annually, whereas 73 million new PDF recordsdata are saved in Google Drive and electronic mail every day (2).

Creating, subsequently, a extra systematic approach to course of these paperwork and extract data from them would give us the flexibility to have an automatic circulation and higher perceive and utilise this huge quantity of textual information. And for this process, in fact, our greatest good friend might be none apart from Python.

Nevertheless, earlier than we begin our course of, we have to specify the several types of PDFs which might be round today, and extra particularly, the three most ceaselessly showing:

  1. Programmatically generated PDFs: These PDFs are created on a pc utilizing both W3C applied sciences resembling HTML, CSS, and Javascript or one other software program like Adobe Acrobat. Any such file can comprise varied elements, resembling pictures, textual content, and hyperlinks, that are all searchable and simple to edit.
  2. Conventional scanned paperwork: These PDFs are created from non-electronic mediums by means of a scanner machine or a cell app. These recordsdata are nothing greater than a set of pictures saved collectively in a PDF file. Saying that, the weather showing in these pictures, just like the textual content, or hyperlinks can’t be chosen or searched. Primarily, the PDF serves as a container for these pictures.
  3. Scanned paperwork with OCR: On this case, Optical Character Recognition (OCR) software program is employed after scanning the doc to determine the textual content inside every picture within the file, changing it into searchable and editable textual content. Then the software program provides a layer with the precise textual content to the picture, and that approach you possibly can choose it as a separate part when shopping the file. (3)

Despite the fact that these days an increasing number of machines have OCR techniques put in in them that determine the textual content from scanned paperwork, there are nonetheless paperwork that comprise full pages in a picture format. You’ve in all probability seen that while you learn a fantastic article and attempt to choose a sentence, however as an alternative you choose the entire web page. This generally is a results of a limitation within the particular OCR machine or its full absence. That approach, so as to not depart this data undetected on this article, I attempted to create a course of that additionally considers these circumstances and takes essentially the most out of our treasured and information-rich PDFs.

The Theoretical Method

With all these several types of PDF recordsdata in thoughts and the assorted objects that compose them, it’s vital to carry out an preliminary evaluation of the format of the PDF to determine the right device wanted for every part. Extra particularly, based mostly on the findings of this evaluation, we are going to apply the suitable methodology for extracting textual content from the PDF, whether or not it’s textual content rendered in a corpus block with its metadata, textual content inside pictures, or structured textual content inside tables. Within the scanned doc with out OCR, the method that identifies and extracts textual content from pictures will carry out all of the heavy lifting. The output of this course of will probably be a Python dictionary containing data extracted for every web page of the PDF file. Every key on this dictionary will current the web page variety of the doc, and its corresponding worth will probably be a listing with the next 5 nested lists containing:

  1. The textual content extracted per textual content block of the corpus
  2. The format of the textual content in every textual content block when it comes to font household and measurement
  3. The textual content extracted from the pictures on the web page
  4. The textual content extracted from tables in a structured format
  5. The entire textual content content material of the web page
Picture by the writer

That approach, we are able to obtain a extra logical separation of the extracted textual content per supply part, and it could actually generally assist us to extra simply retrieve data that often seems within the particular part (e.g., the corporate title in a emblem picture). As well as, the metadata extracted from the textual content, just like the font household and measurement, can be utilized to simply determine textual content headers or highlighted textual content of better significance that may assist us additional separate or post-process the textual content in a number of completely different chunks. Lastly, retaining the structured desk data in a approach that an LLM can perceive will improve considerably the standard of inferences made about relationships inside the extracted information. Then these outcomes could be composed as an output the all of the textual data that appeared on every web page.

You’ll be able to see a flowchart of this method within the pictures beneath.

Picture by the writer

Set up of all the mandatory libraries

Earlier than we begin this venture, although, we must always set up the mandatory libraries. We assume that you’ve Python 3.10 or above put in in your machine. In any other case, you possibly can set up it from right here. Then let’s set up the next libraries:

PyPDF2: To learn the PDF file from the repository path.

pip set up PyPDF2

Pdfminer: To carry out the format evaluation and extract textual content and format from the PDF. (the .six model of the library is the one which helps Python 3)

pip set up pdfminer.six

Pdfplumber: To determine tables in a PDF web page and extract the knowledge from them.

pip set up pdfplumber

Pdf2image: To transform the cropped PDF picture to a PNG picture.

pip set up pdf2image

PIL: To learn the PNG picture.

pip set up Pillow

Pytesseract: To extract the textual content from the pictures utilizing OCR know-how

It is a little trickier to put in as a result of first, you might want to set up Google Tesseract OCR, which is an OCR machine based mostly on an LSTM mannequin to determine line recognition and character patterns.

You’ll be able to set up this in your machine in case you are a Mac consumer by means of Brew out of your terminal, and you’re good to go.

brew set up tesseract

For Home windows customers, you possibly can comply with these steps to put in the hyperlink. Then, while you obtain and set up the software program, you might want to add their executable paths to Surroundings Variables in your laptop. Alternatively, you possibly can run the next instructions to immediately embody their paths within the Python script utilizing the next code:

pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe'

Then you possibly can set up the Python library

pip set up pytesseract

Lastly, we are going to import all of the libraries at first of our script.

# To learn the PDF
import PyPDF2
# To research the PDF format and extract textual content
from pdfminer.high_level import extract_pages, extract_text
from pdfminer.format import LTTextContainer, LTChar, LTRect, LTFigure
# To extract textual content from tables in PDF
import pdfplumber
# To extract the pictures from the PDFs
from PIL import Picture
from pdf2image import convert_from_path
# To carry out OCR to extract textual content from pictures
import pytesseract
# To take away the extra created recordsdata
import os

So now we’re all set. Let’s transfer to the enjoyable half.

Doc’s Format Evaluation with Python

Picture by the writer

For the preliminary evaluation, we used the PDFMiner Python library to separate the textual content from a doc object into a number of web page objects after which break down and look at the format of every web page. PDF recordsdata inherently lack structured data, resembling paragraphs, sentences, or phrases as seen by the human eye. As a substitute, they perceive solely the person characters of the textual content together with their place on the web page. That approach, the PDFMiner tries to reconstruct the content material of the web page into its particular person characters together with their place within the file. Then, by evaluating the distances of these characters from others it composes the suitable phrases, sentences, strains, and paragraphs of textual content. (4) To realize that, the library:

Separates the person pages from the PDF file utilizing the high-level perform extract_pages() and converts them into LTPage objects.

Then for every LTPage object, it iterates from every aspect from prime to backside and tries to determine the suitable part as both:

  • LTFigure which represents the realm of the PDF that may current figures or pictures which were embedded as one other PDF doc within the web page.
  • LTTextContainer which represents a gaggle of textual content strains in an oblong space is then analysed additional into a listing of LTTextLine objects. Every one in all them represents a listing of LTChar objects, which retailer the one characters of textual content together with their metadata. (5)
  • LTRect represents a 2-dimensional rectangle that can be utilized to border pictures, and figures or create tables in an LTPage object.

Subsequently, based mostly on this reconstruction of the web page and the classification of its parts both into LTFigure, which accommodates the pictures or figures of the web page, LTTextContainer, which represents the textual data of the web page, or LTRect, which will probably be a robust indication of the presence of a desk, we are able to apply the suitable perform to higher extract the knowledge.

for pagenum, web page in enumerate(extract_pages(pdf_path)):

# Iterate the weather that composed a web page
for aspect in web page:

# Verify if the aspect is a textual content aspect
if isinstance(aspect, LTTextContainer):
# Perform to extract textual content from the textual content block
go
# Perform to extract textual content format
go

# Verify the weather for pictures
if isinstance(aspect, LTFigure):
# Perform to transform PDF to Picture
go
# Perform to extract textual content with OCR
go

# Verify the weather for tables
if isinstance(aspect, LTRect):
# Perform to extract desk
go
# Perform to transform desk content material right into a string
go

So now that we perceive the evaluation a part of the method, let’s create the features wanted to extract the textual content from every part.

Outline the perform to extract textual content from PDF

From right here on, extracting textual content from a textual content container is de facto easy.

# Create a perform to extract textual content

def text_extraction(aspect):
# Extracting the textual content from the in-line textual content aspect
line_text = aspect.get_text()

# Discover the codecs of the textual content
# Initialize the checklist with all of the codecs that appeared within the line of textual content
line_formats = []
for text_line in aspect:
if isinstance(text_line, LTTextContainer):
# Iterating by means of every character within the line of textual content
for character in text_line:
if isinstance(character, LTChar):
# Append the font title of the character
line_formats.append(character.fontname)
# Append the font measurement of the character
line_formats.append(character.measurement)
# Discover the distinctive font sizes and names within the line
format_per_line = checklist(set(line_formats))

# Return a tuple with the textual content in every line together with its format
return (line_text, format_per_line)

So to extract textual content from a textual content container, we merely use the get_text() methodology of the LTTextContainer aspect. This methodology retrieves all of the characters that make up the phrases inside the particular corpus field, storing the output in a listing of textual content information. Every aspect on this checklist represents the uncooked textual data contained within the container.

Now, to determine this textual content’s format, we iterate by means of the LTTextContainer object to entry every textual content line of this corpus individually. In every iteration, a brand new LTTextLine object is created, representing a line of textual content on this chunk of corpus. We then look at whether or not the nested line aspect accommodates textual content. If it does, we entry every particular person character aspect as LTChar, which accommodates all of the metadata for that character. From this metadata, we extract two sorts of codecs and retailer them in a separate checklist, positioned correspondingly to the examined textual content:

  • The font household of the characters, together with whether or not the character is in daring or italic format
  • The font measurement for the character

Usually, characters inside a selected chunk of textual content are inclined to have constant formatting until some are highlighted in daring. To facilitate additional evaluation, we seize the distinctive values of textual content formatting for all characters inside the textual content and retailer them within the applicable checklist.

Picture by the writer

Outline the perform to extract textual content from Photographs

Right here I imagine it’s a extra tough half.

The right way to deal with textual content in pictures present in PDF?

Firstly, we have to set up right here that picture parts saved in PDFs usually are not in a distinct format from the file, resembling JPEG or PNG. That approach as a way to apply OCR software program on them we want first to separate them from the file after which convert them into a picture format.

# Create a perform to crop the picture parts from PDFs
def crop_image(aspect, pageObj):
# Get the coordinates to crop the picture from the PDF
[image_left, image_top, image_right, image_bottom] = [element.x0,element.y0,element.x1,element.y1]
# Crop the web page utilizing coordinates (left, backside, proper, prime)
pageObj.mediabox.lower_left = (image_left, image_bottom)
pageObj.mediabox.upper_right = (image_right, image_top)
# Save the cropped web page to a brand new PDF
cropped_pdf_writer = PyPDF2.PdfWriter()
cropped_pdf_writer.add_page(pageObj)
# Save the cropped PDF to a brand new file
with open('cropped_image.pdf', 'wb') as cropped_pdf_file:
cropped_pdf_writer.write(cropped_pdf_file)

# Create a perform to transform the PDF to pictures
def convert_to_images(input_file,):
pictures = convert_from_path(input_file)
picture = pictures[0]
output_file = "PDF_image.png"
picture.save(output_file, "PNG")

# Create a perform to learn textual content from pictures
def image_to_text(image_path):
# Learn the picture
img = Picture.open(image_path)
# Extract the textual content from the picture
textual content = pytesseract.image_to_string(img)
return textual content

To realize this, we comply with the next course of:

  1. We use the metadata from the LTFigure object detected from PDFMiner to crop the picture field, utilising its coordinates within the web page format. We then reserve it as a brand new PDF in our listing utilizing the PyPDF2 library.
  2. Then we make use of the convert_from_file() perform from the pdf2image library to transform all PDF recordsdata within the listing into a listing of pictures, saving them in PNG format.
  3. Lastly, now that we’ve got our picture recordsdata we learn them in our script utilizing the Picture package deal of the PIL module and implement the image_to_string() perform of pytesseract to extract textual content from the pictures utilizing the tesseract OCR engine.

Consequently, this course of returns the textual content from the pictures, which we then save in a 3rd checklist inside the output dictionary. This checklist accommodates the textual data extracted from the pictures on the examined web page.

Outline the perform to extract textual content from Tables

On this part, we are going to extract a extra logically structured textual content from tables on a PDF web page. It is a barely extra advanced process than extracting textual content from a corpus as a result of we have to bear in mind the granularity of the knowledge and the relationships fashioned between information factors offered in a desk.

Though there are a number of libraries used to extract desk information from PDFs, with Tabula-py being some of the well-known, we’ve got recognized sure limitations of their performance.

Essentially the most obtrusive one in our opinion comes from the best way that the library identifies the completely different rows of the desk utilizing the line-break particular character n within the desk’s textual content. This works fairly properly in many of the circumstances nevertheless it fails to seize appropriately when the textual content in a cell is wrapped into 2 or extra rows, resulting in the addition of pointless empty rows and shedding the context of the extracted cell.

You’ll be able to see the instance beneath once we tried to extract the information from a desk utilizing tabula-py:

Picture by the writer

Then, the extracted data is outputted in a Pandas DataFrame as an alternative of a string. Typically, this generally is a fascinating format however within the case of transformers that bear in mind textual content, these outcomes must be remodeled earlier than feeding right into a mannequin.

Because of this, to sort out this process we used the pdfplumber library for varied causes. Firstly, it’s constructed on pdfminer.six which we used for our preliminary evaluation, that means that it accommodates related objects. As well as, its method to desk detection is predicated on line parts together with their intersections that assemble the cell that accommodates the textual content after which the desk itself. That approach after we determine a cell of a desk, we are able to extract simply the content material contained in the cell with out carrying what number of rows wanted to be rendered. Then when we’ve got the contents of a desk, we are going to format it in a table-like string and retailer it within the applicable checklist.

# Extracting tables from the web page

def extract_table(pdf_path, page_num, table_num):
# Open the pdf file
pdf = pdfplumber.open(pdf_path)
# Discover the examined web page
table_page = pdf.pages[page_num]
# Extract the suitable desk
desk = table_page.extract_tables()[table_num]
return desk

# Convert desk into the suitable format
def table_converter(desk):
table_string = ''
# Iterate by means of every row of the desk
for row_num in vary(len(desk)):
row = desk[row_num]
# Take away the road breaker from the wrapped texts
cleaned_row = [item.replace('n', ' ') if item is not None and 'n' in item else 'None' if item is None else item for item in row]
# Convert the desk right into a string
table_string+=('|'+'|'.be part of(cleaned_row)+'|'+'n')
# Eradicating the final line break
table_string = table_string[:-1]
return table_string

To realize that, we created two features, extract_table() to extract the contents of the desk into a listing of lists, and table_converter() to hitch the contents of these lists in a table-like string.

Within the extract_table() perform:

  1. We open the PDF file.
  2. We navigate to the examined web page of the PDF file.
  3. From the checklist of tables discovered on the web page by pdfplumber, we choose the specified one.
  4. We extract the content material of the desk and output it in a listing of nested lists representing every row of the desk.

Within the table_converter() perform:

  1. We iterate in every nested checklist and clear its context from any undesirable line breaks coming from any wrapped textual content.
  2. We be part of every aspect of the row by separating them utilizing the | image to create the construction of a desk’s cell.
  3. Lastly, we add a line break on the finish to maneuver to the subsequent row.

It will lead to a string of textual content that may current the content material of the desk with out shedding the granularity of the information offered in it.

Including all collectively

Now that we’ve got all of the elements of the code prepared let’s add all of them as much as a totally purposeful code. You’ll be able to copy the code from right here or you’ll find it together with the instance PDF in my Github repo right here.

# Discover the PDF path
pdf_path = 'OFFER 3.pdf'

# create a PDF file object
pdfFileObj = open(pdf_path, 'rb')
# create a PDF reader object
pdfReaded = PyPDF2.PdfReader(pdfFileObj)

# Create the dictionary to extract textual content from every picture
text_per_page =
# We extract the pages from the PDF
for pagenum, web page in enumerate(extract_pages(pdf_path)):

# Initialize the variables wanted for the textual content extraction from the web page
pageObj = pdfReaded.pages[pagenum]
page_text = []
line_format = []
text_from_images = []
text_from_tables = []
page_content = []
# Initialize the variety of the examined tables
table_num = 0
first_element= True
table_extraction_flag= False
# Open the pdf file
pdf = pdfplumber.open(pdf_path)
# Discover the examined web page
page_tables = pdf.pages[pagenum]
# Discover the variety of tables on the web page
tables = page_tables.find_tables()


# Discover all the weather
page_elements = [(element.y1, element) for element in page._objs]
# Type all the weather as they seem within the web page
page_elements.kind(key=lambda a: a[0], reverse=True)

# Discover the weather that composed a web page
for i,part in enumerate(page_elements):
# Extract the place of the highest aspect of the aspect within the PDF
pos= part[0]
# Extract the aspect of the web page format
aspect = part[1]

# Verify if the aspect is a textual content aspect
if isinstance(aspect, LTTextContainer):
# Verify if the textual content appeared in a desk
if table_extraction_flag == False:
# Use the perform to extract the textual content and format for every textual content aspect
(line_text, format_per_line) = text_extraction(aspect)
# Append the textual content of every line to the web page textual content
page_text.append(line_text)
# Append the format for every line containing textual content
line_format.append(format_per_line)
page_content.append(line_text)
else:
# Omit the textual content that appeared in a desk
go

# Verify the weather for pictures
if isinstance(aspect, LTFigure):
# Crop the picture from the PDF
crop_image(aspect, pageObj)
# Convert the cropped pdf to a picture
convert_to_images('cropped_image.pdf')
# Extract the textual content from the picture
image_text = image_to_text('PDF_image.png')
text_from_images.append(image_text)
page_content.append(image_text)
# Add a placeholder within the textual content and format lists
page_text.append('picture')
line_format.append('picture')

# Verify the weather for tables
if isinstance(aspect, LTRect):
# If the primary rectangular aspect
if first_element == True and (table_num+1) <= len(tables):
# Discover the bounding field of the desk
lower_side = web page.bbox[3] - tables[table_num].bbox[3]
upper_side = aspect.y1
# Extract the knowledge from the desk
desk = extract_table(pdf_path, pagenum, table_num)
# Convert the desk data in structured string format
table_string = table_converter(desk)
# Append the desk string into a listing
text_from_tables.append(table_string)
page_content.append(table_string)
# Set the flag as True to keep away from the content material once more
table_extraction_flag = True
# Make it one other aspect
first_element = False
# Add a placeholder within the textual content and format lists
page_text.append('desk')
line_format.append('desk')

# Verify if we already extracted the tables from the web page
if aspect.y0 >= lower_side and aspect.y1 <= upper_side:
go
elif not isinstance(page_elements[i+1][1], LTRect):
table_extraction_flag = False
first_element = True
table_num+=1


# Create the important thing of the dictionary
dctkey = 'Page_'+str(pagenum)
# Add the checklist of checklist as the worth of the web page key
text_per_page[dctkey]= [page_text, line_format, text_from_images,text_from_tables, page_content]

# Closing the pdf file object
pdfFileObj.shut()

# Deleting the extra recordsdata created
os.take away('cropped_image.pdf')
os.take away('PDF_image.png')

# Show the content material of the web page
end result = ''.be part of(text_per_page['Page_0'][4])
print(end result)

The script above will:

Import the mandatory libraries.

Open the PDF file utilizing the pyPDF2 library.

Extract every web page of the PDF and iterate the next steps.

Look at if there are any tables on the web page and create a listing of them utilizing pdfplumner.

Discover all the weather nested within the web page and kind them as they appeared in its format.

Then for every aspect:

Look at if it’s a textual content container, and doesn’t seem in a desk aspect. Then use the text_extraction() perform to extract the textual content together with its format, else go this textual content.

Look at whether it is a picture, and use the crop_image() perform to crop the picture part from the PDF, convert it into a picture file utilizing the convert_to_images(), and extract textual content from it utilizing OCR with the image_to_text() perform.

Look at if it’s a rectangular aspect. On this case, we look at if the primary rect is a part of a web page’s desk and if sure, we transfer to the next steps:

  1. Discover the bounding field of the desk so as to not extract its textual content once more with the text_extraction() perform.
  2. Extract the content material of the desk and convert it right into a string.
  3. Then add a boolean parameter to make clear that we extract textual content from Desk.
  4. This course of will end after the final LTRect that falls into the bounding field of the desk and the subsequent aspect within the format is just not an oblong object. (All the opposite objects that compose the desk will probably be handed)

The outputs of the method will probably be saved in 5 lists per iteration, named:

  1. page_text: accommodates the textual content coming from textual content containers within the PDF (placeholder will probably be positioned when the textual content was extracted from one other aspect)
  2. line_format: accommodates the codecs of the texts extracted above (placeholder will probably be positioned when the textual content was extracted from one other aspect)
  3. text_from_images: accommodates the texts extracted from pictures on the web page
  4. text_from_tables: accommodates the table-like string with the contents of tables
  5. page_content: accommodates all of the textual content rendered on the web page in a listing of parts

All of the lists will probably be saved underneath the important thing in a dictionary that may signify the variety of the web page examined every time.

Afterwards, we are going to shut the PDF file.

Then we are going to delete all the extra recordsdata created throughout the course of.

Lastly, we are able to show the content material of the web page by becoming a member of the weather of the page_content checklist.

Conclusion

This was one method that I imagine makes use of the very best traits of many libraries and makes the method resilient to varied sorts of PDFs and parts that we are able to encounter, with PDFMiner nevertheless do the many of the heavy lifting. Additionally, the knowledge concerning the format of the textual content may help us with the identification of potential titles that may separate the textual content into distinct logical sections somewhat than simply content material per web page and may help us to determine the textual content of better significance.

Nevertheless, there’ll at all times be extra environment friendly methods to do that process and regardless that I imagine that this method is extra inclusive, I’m actually trying ahead to discussing with you new and higher methods of tackling this downside.

📖 References:

  1. https://www.techopedia.com/12-practical-large-language-model-llm-applications
  2. https://www.pdfa.org/wp-content/uploads/2018/06/1330_Johnson.pdf
  3. https://pdfpro.com/weblog/guides/pdf-ocr-guide/#:~:textual content=OCR know-how reads textual content from, a searchable and editable PDF.
  4. https://pdfminersix.readthedocs.io/en/newest/subject/converting_pdf_to_text.html#id1
  5. https://github.com/pdfminer/pdfminer.six


Extracting textual content from PDF recordsdata with Python: A complete information was initially printed in In the direction of Information Science on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.


Related articles

You may also be interested in