Scrape Data From PDF Files Using Python Towards Data Science

23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science
Get started Open in app
Follow 600K Followers
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
Scrape Data from PDF Files Using Python

You want to make friends with tabula-py and Pandas
Aaron Zhu Jul 12 · 5 min read
Background
Data science professionals are dealing with data in all shapes and forms. Data could be
stored in popular SQL databases, such as PostgreSQL, MySQL, or am old-fashioned excel
spreadsheet. Sometimes, data might also be saved in an unconventional format, such as
PDF. In this article, I am going to talk about how to scrape data from PDF using Python
libraries.
Required Libraries
tabula-py: to scrape text from PDF files
re: to extract data using regular expression
pandas — to construct and manipulate our panel data
Install Libraries
pip install tabula-py
pip install pandas
Import Libraries
https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-fe2dc96b1e68 1/8
import tabula as tb
import pandas as pd
import re
Scrape PDF Data in Structured Form

First, let’s talk about scraping PDF data in a structured format. In the following example,
we want to scrape the table on the bottom left corner. It is nicely-structured tabular
data, in which rows and columns are well defined.
Sample Structured Data (Source)
Scraping PDF data in structured form is actually straightforward using tabula-py . We
just need to input the location of the tabular data in the PDF page by specifying the top,
left, bottom and right coordinates of the area . If the PDF page only includes the target
table, then we don’t even need to specify the area. tabula-py should be able to detect
the rows and columns automatically.
file = 'state_population.pdf'
data = tb.read_pdf(file, area = (300, 0, 600, 800), pages = '1')
Scrape PDF Data in Unstructured Form

Next, we will explore something more interesting — PFD data in an unstructured
format.
To implement statistical analysis, data visualization and machine learning model, we

need the data in tabular form (panel data). However, many data are only available in an
unstructured format. For example, HR staff likely to keep historical payroll data, which
might not be created in tabular form. In the following picture, we have an example of
payroll data, which has mixed data structures. On the left section, it has data in long
format, including employee name, net amount, pay date and pay period. On the right
section, it has pay category, pay rate, hours and pay amount.
(Created by Author)
There are few steps we need to take to transform the data into panel format.
Step 1: Import PDF data as a DataFrame
Like data in a structured format, we also use tb.read_pdf to import the unstructured
data. This time, we need to specify extra options to properly import the data.
file = 'payroll_sample.pdf'
df= tb.read_pdf(file, pages = '1', area = (0, 0, 1000, 1000), columns

= [200, 265, 300, 320], pandas_options={'header': None}, stream=True)
[0]
Area and Columns: I’ve talked about area above. Here we will also need to use columns
to identify the locations all relevant columns (one column on the left section and four
columns on the right section.)
Stream and Lattice: if there are grid lines to separate each cell, we can use lattice =
True to automatically identify each cell, If not, we can use stream = True and columns to
manually specify each cell.
lattice (bool, optional) — Force PDF to be extracted using lattice-mode extraction (if
there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
stream (bool, optional) — Force PDF to be extracted using stream-mode extraction

(if there are no ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
(Created by Author)
Step 2: Create a Row Identifier
Now we have some data to work with, we will use Python library Pandas to manipulate
the dataframe.
First, we will need to create a new column that can identify unique rows. We notice that
employee name (Superman and Batman) seem to be useful to identify border between
different records. Each employee name contains a unique pattern, which starts with a
capital letter and ends with a lower-case letter. We can use regular expression '^[A-Z].*
[a-z]$' to identify employee name, then use Pandas function cumsum (cumulative sum)
to create a row identifier.
df['border'] = df.apply(lambda x: 1 if re.findall('^[A-Z].*[a-z]$',

str(x[0])) else 0, axis = 1)
df['row'] = df['border'].transform('cumsum')
(Created by Author)
Step 3: Reshape the data (convert data from long form to wide form)
Next, we will reshape data on both the left section and right section. For the left section,
we create a new dataframe, employee that includes employee_name, net_amount,
pay_date and pay_period. For the right section, we create another dataframe, payment
that includes OT_Rate, Regular_Rate, OT_Hours, Regular_Hours, OT_Amt and
Regular_Amt. To convert the data in a wide form, we can use Pandas function, pivot .
# reshape left section
employee = df[[0, 'row']]
employee = employee[employee[0].notnull()]
employee['index'] = employee.groupby('row').cumcount()+1
employee = employee.pivot(index = ['row'], columns = ['index'],

values = 0).reset_index()
employee = employee.rename(columns = {1: 'employee_name', 2:

'net_amount', 3: 'pay_date', 4: 'pay_period'})
employee['net_amount'] = employee.apply(lambda x:
x['net_amount'].replace('Net', '').strip(), axis = 1)
# reshape right section
payment = df[[1, 2, 3, 4, 'row']]
payment = payment[payment[1].notnull()]
payment = payment[payment['row']!=0]
payment = payment.pivot(index = ['row'], columns = 1, values = [2, 3,

4]).reset_index()
payment.columns = [str(col[0])+col[1] for col in

payment.columns.values]
for i in ['Regular', 'OT']:
payment = payment.rename(columns = {f'2{i}': f'{i}_Rate',

f'3{i}': f'{i}_Hours', f'4{i}': f'{i}_Amt'})
(Created by Author)
(Created by Author)
Step 4: Join the data in the left section with the data in right section
Lastly, we use the function, merge to join both employee and payment dataframes based
on the row identifier.
df_clean = employee.merge(payment, on = ['row'], how = 'inner')
(Created by Author)
Final Note
As of today, companies still manually process PDF data. With the help of python
libraries, we can save time and money by automating this process of scraping data from
PDF files and converting unstructured data into panel data.
Please keep in mind that when scraping data from PDF files, you should always carefully
read the terms and conditions posted by the author and make sure you have permission
to do so.
Stay Tuned! I will talk about other tools, such as PDFQuery and PyPDF2 to work with
PDF data in future articles.
Thank you for reading !!!
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Get this newsletter
Python Pdf Scraping Data Science Tabula Pandas
About Write Help Legal
Get the Medium app

Scrape Data From PDF Files Using Python Towards Data Science

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scrape Data From PDF Files Using Python Towards Data Science

Uploaded by

Copyright:

Available Formats

23/11/21 19:09 Scrape Data from PDF Files Using Python | by Aaron Zhu | Towards Data Science

Get started Open in app

Follow 600K Followers

Scrape Data from PDF Files Using Python

Aaron Zhu Jul 12 · 5 min read

re: to extract data using regular expression

pandas — to construct and manipulate our panel data

pip install tabula-py

pip install pandas

Scrape PDF Data in Structured Form

Sample Structured Data (Source)

Scraping PDF data in structured form is actually straightforward using tabula-py . We

data = tb.read_pdf(file, area = (300, 0, 600, 800), pages = '1')

Scrape PDF Data in Unstructured Form

To implement statistical analysis, data visualization and machine learning model, we

Step 1: Import PDF data as a DataFrame

df= tb.read_pdf(file, pages = '1', area = (0, 0, 1000, 1000), columns

stream (bool, optional) — Force PDF to be extracted using stream-mode extraction

Step 2: Create a Row Identifier

df['border'] = df.apply(lambda x: 1 if re.findall('^[A-Z].*[a-z]$',

# reshape left section

employee = df[[0, 'row']]

employee = employee.pivot(index = ['row'], columns = ['index'],

employee = employee.rename(columns = {1: 'employee_name', 2:

# reshape right section

payment = df[[1, 2, 3, 4, 'row']]

payment = payment.pivot(index = ['row'], columns = 1, values = [2, 3,

payment.columns = [str(col[0])+col[1] for col in

for i in ['Regular', 'OT']:

payment = payment.rename(columns = {f'2{i}': f'{i}_Rate',

df_clean = employee.merge(payment, on = ['row'], how = 'inner')

Thank you for reading !!!

Sign up for The Variable

Get this newsletter

Python Pdf Scraping Data Science Tabula Pandas

About Write Help Legal

Get the Medium app

You might also like