Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

Get started Open in app

Follow 600K Followers

You have 1 free member-only story left this month. Sign up for Medium and get an extra one

Scrape Data from PDF Files Using Python


You want to make friends with tabula-py and Pandas

Aaron Zhu Jul 12 · 5 min read

Background
Data science professionals are dealing with data in all shapes and forms. Data could be
stored in popular SQL databases, such as PostgreSQL, MySQL, or am old-fashioned excel
spreadsheet. Sometimes, data might also be saved in an unconventional format, such as
PDF. In this article, I am going to talk about how to scrape data from PDF using Python
libraries.

Required Libraries
tabula-py: to scrape text from PDF files

re: to extract data using regular expression

pandas — to construct and manipulate our panel data

Install Libraries

pip install tabula-py

pip install pandas

Import Libraries
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 1/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

import tabula as tb

import pandas as pd

import re

Scrape PDF Data in Structured Form


First, let’s talk about scraping PDF data in a structured format. In the following example,
we want to scrape the table on the bottom left corner. It is nicely-structured tabular
data, in which rows and columns are well defined.

Sample Structured Data (Source)

Scraping PDF data in structured form is actually straightforward using tabula-py . We

just need to input the location of the tabular data in the PDF page by specifying the top,
left, bottom and right coordinates of the area . If the PDF page only includes the target

table, then we don’t even need to specify the area. tabula-py should be able to detect
the rows and columns automatically.

file = 'state_population.pdf'

data = tb.read_pdf(file, area = (300, 0, 600, 800), pages = '1')

https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 2/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

Scrape PDF Data in Unstructured Form


Next, we will explore something more interesting — PFD data in an unstructured
format.

To implement statistical analysis, data visualization and machine learning model, we


need the data in tabular form (panel data). However, many data are only available in an
unstructured format. For example, HR staff likely to keep historical payroll data, which
might not be created in tabular form. In the following picture, we have an example of
payroll data, which has mixed data structures. On the left section, it has data in long
format, including employee name, net amount, pay date and pay period. On the right
section, it has pay category, pay rate, hours and pay amount.

(Created by Author)

There are few steps we need to take to transform the data into panel format.

Step 1: Import PDF data as a DataFrame

Like data in a structured format, we also use tb.read_pdf to import the unstructured
data. This time, we need to specify extra options to properly import the data.

https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 3/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

file = 'payroll_sample.pdf'

df= tb.read_pdf(file, pages = '1', area = (0, 0, 1000, 1000), columns


= [200, 265, 300, 320], pandas_options={'header': None}, stream=True)
[0]

Area and Columns: I’ve talked about area above. Here we will also need to use columns

to identify the locations all relevant columns (one column on the left section and four
columns on the right section.)

Stream and Lattice: if there are grid lines to separate each cell, we can use lattice =

True to automatically identify each cell, If not, we can use stream = True and columns to
manually specify each cell.

lattice (bool, optional) — Force PDF to be extracted using lattice-mode extraction (if
there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)

stream (bool, optional) — Force PDF to be extracted using stream-mode extraction


(if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)

(Created by Author)

Step 2: Create a Row Identifier

https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 4/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

Now we have some data to work with, we will use Python library Pandas to manipulate
the dataframe.

First, we will need to create a new column that can identify unique rows. We notice that
employee name (Superman and Batman) seem to be useful to identify border between
different records. Each employee name contains a unique pattern, which starts with a
capital letter and ends with a lower-case letter. We can use regular expression '^[A-Z].*

[a-z]$' to identify employee name, then use Pandas function cumsum (cumulative sum)
to create a row identifier.

df['border'] = df.apply(lambda x: 1 if re.findall('^[A-Z].*[a-z]$',


str(x[0])) else 0, axis = 1)

df['row'] = df['border'].transform('cumsum')

(Created by Author)

Step 3: Reshape the data (convert data from long form to wide form)

Next, we will reshape data on both the left section and right section. For the left section,
we create a new dataframe, employee that includes employee_name, net_amount,
pay_date and pay_period. For the right section, we create another dataframe, payment
that includes OT_Rate, Regular_Rate, OT_Hours, Regular_Hours, OT_Amt and
Regular_Amt. To convert the data in a wide form, we can use Pandas function, pivot .

https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 5/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

# reshape left section

employee = df[[0, 'row']]

employee = employee[employee[0].notnull()]

employee['index'] = employee.groupby('row').cumcount()+1

employee = employee.pivot(index = ['row'], columns = ['index'],


values = 0).reset_index()

employee = employee.rename(columns = {1: 'employee_name', 2:


'net_amount', 3: 'pay_date', 4: 'pay_period'})

employee['net_amount'] = employee.apply(lambda x:
x['net_amount'].replace('Net', '').strip(), axis = 1)

# reshape right section

payment = df[[1, 2, 3, 4, 'row']]

payment = payment[payment[1].notnull()]

payment = payment[payment['row']!=0]

payment = payment.pivot(index = ['row'], columns = 1, values = [2, 3,


4]).reset_index()

payment.columns = [str(col[0])+col[1] for col in


payment.columns.values]

for i in ['Regular', 'OT']:

payment = payment.rename(columns = {f'2{i}': f'{i}_Rate',


f'3{i}': f'{i}_Hours', f'4{i}': f'{i}_Amt'})

(Created by Author)

(Created by Author)

Step 4: Join the data in the left section with the data in right section

Lastly, we use the function, merge to join both employee and payment dataframes based
on the row identifier.

https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 6/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

df_clean = employee.merge(payment, on = ['row'], how = 'inner')

(Created by Author)

Final Note
As of today, companies still manually process PDF data. With the help of python
libraries, we can save time and money by automating this process of scraping data from
PDF files and converting unstructured data into panel data.

Please keep in mind that when scraping data from PDF files, you should always carefully
read the terms and conditions posted by the author and make sure you have permission
to do so.

Stay Tuned! I will talk about other tools, such as PDFQuery and PyPDF2 to work with
PDF data in future articles.

Thank you for reading !!!

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Get this newsletter

Python Pdf Scraping Data Science Tabula Pandas

https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 7/8
23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

About Write Help Legal

Get the Medium app

https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 8/8

You might also like