Download as pdf or txt
Download as pdf or txt
You are on page 1of 80

Python Project for Data Engineering

by IBM

Extract Transform Load (ETL) Lab


Objectives
After completing this lab you will be able to:

• Read CSV and JSON file types.


• Extract data from the above file types.
• Transform data.
• Save the transformed data in a ready-to-load format which data engineers can
use to load into an RDBMS.

Import the required modules and functions

Extract
CSV Extract Function
def extract_from_csv(file_to_process):
dataframe = pd.read_csv(file_to_process)
return dataframe

JSON Extract Function

def extract_from_json(file_to_process):
dataframe = pd.read_json(file_to_process,lines=True)
return dataframe

XML Extract Function

def extract_from_xml(file_to_process):
dataframe = pd.DataFrame(columns=["name", "height", "weight"])
tree = ET.parse(file_to_process)
root = tree.getroot()
for person in root:
name = person.find("name").text
height = float(person.find("height").text)
weight = float(person.find("weight").text)
dataframe = dataframe.append({"name":name, "height":height, "weight":weight},
ignore_index=True)
return dataframe

Extract Function
Transform
The transform function does the following tasks.

1. Convert height which is in inches to millimeter


2. Convert weight which is in pounds to kilograms

def transform(data):
#Convert height which is in inches to millimeter
#Convert the datatype of the column into float
#data.height = data.height.astype(float)
#Convert inches to meters and round off to two decimals(one inch is 0.0254 meters)
data['height'] = round(data.height * 0.0254,2)

#Convert weight which is in pounds to kilograms


#Convert the datatype of the column into float
#data.weight = data.weight.astype(float)
#Convert pounds to kilograms and round off to two decimals(one pound is 0.45359237
kilograms)
data['weight'] = round(data.weight * 0.45359237,2)
return data
Loading
def load(targetfile,data_to_load):
data_to_load.to_csv(targetfile)

Logging
def log(message):
timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-Hour-Minute-Second
now = datetime.now() # get current timestamp
timestamp = now.strftime(timestamp_format)
with open("logfile.txt","a") as f:
f.write(timestamp + ',' + message + '\n')

Running ETL Process


Peer Review Assignment - Data Engineer
- Webscraping
Objectives
In this part you will:

• Use webscraping to get bank information

For this lab, we are going to be using Python and several Python libraries. Some of these
libraries might be installed in your lab environment or in SN Labs. Others may need to
be installed by you. The cells below will install these libraries when executed.
#!mamba install pandas==1.3.3 -y
#!mamba install requests==2.26.0 -y
!mamba install bs4==4.10.0 -y
!mamba install html5lib==1.1 -y

Imports
Import any additional libraries you may need here.
from bs4 import BeautifulSoup
import html5lib
import requests
import pandas as pd

Extract Data Using Web Scraping


The wikipedia webpage https://en.wikipedia.org/wiki/List_of_largest_banks provides
information about largest banks in the world by various parameters. Scrape the data from the
table 'By market capitalization' and store it in a JSON file.

Webpage Contents
Gather the contents of the webpage in text format using the requests library and assign it to
the variable html_data
url =
"https://en.wikipedia.org/wiki/List_of_largest_banks?utm_medium=Exinfluencer&utm_source
=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-
Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0221ENSkillsNetwork23455645-
2022-01-01"

data = requests.get(url).text

Question 1 Print out the output of the following line, and remember it as it will be a
quiz question:
data[101:124]

Scraping the Data


Question 2 Using the contents and beautiful soup load the data from the By market
capitalization table into a pandas dataframe. The dataframe should have the
country Name and Market Cap (US$ Billion) as column names. Display the first five rows
using head.

Using BeautifulSoup parse the contents of the webpage.


Question 3 Display the first five rows using the head function.
Loading the Data
Usually you will Load the pandas dataframe created above into a JSON
named bank_market_cap.json using the to_json() function, but this time the data will be sent
to another team who will split the data file into two files and inspect it. If you save the
data it will interfere with the next part of the assignment.
Peer Review Assignment - Data Engineer
- Extract API Data
Objectives
In this part you will:

• Collect exchange rate data using an API


• Store the data as a CSV
For this lab, we are going to be using Python and several Python libraries. Some of these
libraries might be installed in your lab environment or in SN Labs. Others may need to be
installed by you. The cells below will install these libraries when executed.

Extract Data Using an API


Using ExchangeRate-API we will extract currency exchange rate data. Use the below steps
to get the access key and to get the data.

1. Open the url : https://exchangeratesapi.io/ and click on Get Free API Key.
2. Subscribe for Free plan and Sign-in with the Google Account.
3. Once the account is created you will be redirected to https://apilayer.com website.
4. Now, click on the user icon and click Account as shown below:
Call the API
Question 1 Using the requests library call the endpoint given above and save the text,
remember the first few characters of the output:

# Write your code here


url =
"https://api.apilayer.com/exchangerates_data/latest?base=EUR&apikey=pTuujt8jRPtpOIJ2jN0
hMUWFOuELBhel" #Make sure to chang
Save as DataFrame
Question 2 Using the data gathered turn it into a pandas dataframe. The dataframe
should have the Currency as the index and Rate as their columns. Make sure to drop
unnecessary columns.

# Turn the data into a dataframe

response = requests.get(url)

response.json()

Load the Data


Using the dataframe save it as a CSV names exchange_rates_1.csv.
# Save the Dataframe

dataframe = pd.DataFrame(response.json())

dataframe
dataframe.to_csv('exchange_rates_1.csv')
Peer Review Assignment - Data Engineer -
ETL
Estimated time needed: 20 minutes

Objectives
In this final part you will:

• Run the ETL process


• Extract bank and market cap data from the JSON file bank_market_cap.json
• Transform the market cap currency using the exchange rate data
• Load the transformed data into a seperate CSV
For this lab, we are going to be using Python and several Python libraries. Some of these
libraries might be installed in your lab environment or in SN Labs. Others may need to
be installed by you. The cells below will install these libraries when executed.

#!mamba install pandas==1.3.3 -y

#!mamba install requests==2.26.0 -y

Imports
Import any additional libraries you may need here.
import glob
import pandas as pd
from datetime import datetime

As the exchange rate fluctuates, we will download the same dataset to make marking
simpler. This will be in the same format as the dataset you used in the last section
Extract
JSON Extract Function
This function will extract JSON files.
def extract_from_json(file_to_process):
dataframe = pd.read_json(file_to_process)
return dataframe
Extract Function
Define the extract function that finds JSON file bank_market_cap_1.json and calls the
function created above to extract data from them. Store the data in a pandas dataframe.
Use the following list for the columns.
columns=['Name','Market Cap (US$ Billion)']
In [14]:
def extract():
# Write your code here
extracted_data = pd.DataFrame(columns=['Name','Market Cap
(US$ Billion)'])
# Process json file
for jsonfile in glob.glob("*.json"):
extracted_data =
extracted_data.append(extract_from_json(jsonfile),
ignore_index=True)

return extracted_data

Question 1 Load the file exchange_rates.csv as a dataframe and find the exchange rate for
British pounds with the symbol GBP, store it in the variable exchange_rate, you will be
asked for the number. Hint: set the parameter index_col to 0.
In [28]:
# Write your code here
#log("Extract phase Started")
extracted_data = extract()
#log("Extract phase Ended")
extracted_data

Transform
Using exchange_rate and the exchange_rates.csv file find the exchange rate of USD to
GBP. Write a transform function that

1. Changes the Market Cap (US$ Billion) column from USD to GBP
2. Rounds the Market Cap (US$ Billion)` column to 3 decimal places
3. Rename Market Cap (US$ Billion) to Market Cap (GBP$ Billion)

def transform(data):
# Write your code here
data['Market Cap (US$ Billion)'] = round(0.75 * data['Market
Cap (US$ Billion)'], 3)
data.rename(columns={'Market Cap (US$ Billion)': 'Market Cap
(GBP$ Billion)'}, inplace=True)
return data

transformed_data = transform(extracted_data)
transformed_data
Load
Create a function that takes a dataframe and load it to a csv
named bank_market_cap_gbp.csv. Make sure to set index to False.
def load(target_file, data_to_load):
# Write your code here
data_to_load.to_csv(target_file, index=False)

Logging Function
Write the logging function log to log your data:
def log(message):
# Write your code here
timestamp_format = '%Y-%h-%d-%H:%M:%S' # Year-Monthname-Day-
Hour-Minute-Second
now = datetime.now() # get current timestamp
timestamp = now.strftime(timestamp_format)
with open("logfile.txt","a") as f:
f.write(timestamp + ',' + message + '\n')
Running the ETL Process
Log the process accordingly using the following "ETL Job Started" and "Extract phase
Started"
In [34]:
# Write your code here
log("ETL Job Started")
log("Extract phase Started")

Extract
Question 2 Use the function extract, and print the first 5 rows, take a screen shot:
In [38]:
# Call the function here
extracted_data = extract()
# Print the rows here
extracted_data.head()
Application Programming Interface
Estimated time needed: 15 minutes
Objectives
After completing this lab you will be able to:

• Create and Use APIs in Python

Introduction
An API lets two pieces of software talk to each other. Just like a function, you don’t have
to know how the API works only its inputs and outputs. An essential type of API is a
REST API that allows you to access resources via the internet. In this lab, we will review
the Pandas Library in the context of an API, we will also review a basic REST API.

Table of Contents
• Pandas is an API

• REST APIs Basics

• Quiz on Tuples

!pip install pycoingecko

!pip install plotly

!pip install mplfinance

Pandas is an API
Pandas is actually set of software components , much of which is not even written in Python.
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.offline import plot
import matplotlib.pyplot as plt
import datetime
from pycoingecko import CoinGeckoAPI
from mplfinance.original_flavor import candlestick2_ohlc
You create a dictionary, this is just data.
dict_={'a':[11,21,31],'b':[12,22,32]}

When you create a Pandas object with the Dataframe constructor in API lingo, this is an
"instance". The data in the dictionary is passed along to the pandas API. You then use
the dataframe to communicate with the API.
df=pd.DataFrame(dict_)
type(df)
REST APIs
Rest API’s function by sending a request, the request is communicated via HTTP message.
The HTTP message usually contains a JSON file. This contains instructions for what
operation we would like the service or resource to perform. In a similar manner, API
returns a response, via an HTTP message, this response is usually contained within a JSON.

In cryptocurrency a popular method to display the movements of the price of a currency.


In this lab, we will be using the <a
href=https://www.coingecko.com/en/api?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_
content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2022-01-
01>CoinGecko API to create one of these candlestick graphs for Bitcoin. We will use the API to get the
price data for 30 days with 24 observation per day, 1 per hour. We will find the max, min, open, and
close price per day meaning we will have 30 candlesticks and use that to generate the candlestick graph.
Although we are using the CoinGecko API we will use a Python client/wrapper for the API called <a
href=https://github.com/man-
c/pycoingecko?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_te
rm=10006555&utm_id=NA-SkillsNetwork-Channel-
SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2022-01-
01>PyCoinGecko. PyCoinGecko will make performing the requests easy and it will deal with the enpoint
targeting.

Lets start off by getting the data we need. Using the get_coin_market_chart_by_id(id,
vs_currency, days). id is the name of the coin you want, vs_currency is the currency you
want the price in, and days is how many days back from today you want.

cg = CoinGeckoAPI()
bitcoin_data = cg.get_coin_market_chart_by_id(id='bitcoin', vs_currency='usd', days=30)
Now that we have the DataFrame we will convert the timestamp to datetime and save it
as a column called Date. We will map our unix_to_datetime to each timestamp and
convert it to a readable datetime.
2D Numpy in Python
Estimated time needed: 20 minutes
Objectives
After completing this lab you will be able to:

• Operate comfortably with numpy


• Perform complex operations with numpy
Table of Contents
• Create a 2D Numpy Array
• Accessing different elements of a Numpy Array
• Basic Operations
Create a 2D Numpy Array
# Import the libraries

import numpy as np

import matplotlib.pyplot as plt

We can use the attribute ndim to obtain the number of axes or dimensions, referred to as the
rank.
Accessing different elements of a Numpy Array
We can use rectangular brackets to access the different elements of the array. The
correspondence between the rectangular brackets and the list and the rectangular
representation is shown in the following figure for a 3x3 array:
We can also use the following notation to obtain the elements:
We can also use slicing in numpy arrays. Consider the following figure. We would like to obtain
the first two columns in the first row

Similarly, we can obtain the first two rows of the 3rd column as follows:
Basic Operations
We can also add arrays. The process is identical to matrix addition. Matrix addition
of X and Y is shown in the following figure:

Multiplying a numpy array by a scaler is identical to multiplying a matrix by a scaler. If


we multiply the matrix Y by the scaler 2, we simply multiply every element in the matrix
by 2, as shown in the figure.
We can also perform matrix multiplication with the numpy arrays A and B as follows:
First, we define matrix A and B:
Write and Save Files in Python
Estimated time needed: 25 minutes
Objectives
After completing this lab you will be able to:

• Write to files using Python libraries


Table of Contents
• Writing Files
• Appending Files
• Additional File modes
• Copy a File
Writing Files
We can open a file object using the method write() to save the text file to a list. To
write to a file, the mode argument must be set to w. Let’s write a
file Example2.txt with the line: “This is line A”

# Write line to file


exmp2 = '/resources/data/Example2.txt'
with open(exmp2, 'w') as writefile:
writefile.write("This is line A")

We can read the file to see if it worked:


# Read file

with open(exmp2, 'r') as testwritefile:


print(testwritefile.read())

We can write multiple lines:


# Write lines to file

with open(exmp2, 'w') as writefile:


writefile.write("This is line A\n")
writefile.write("This is line B\n")

The method .write() works similar to the method .readline(), except instead of reading a
new line it writes a new line. The process is illustrated in the figure. The different colour
coding of the grid represents a new line added to the file after each method call.
Appending Files
We can write to files without losing any of the existing data as follows by setting the
mode argument to append: a. you can append a new line as follows:

Additional modes
It's fairly ineffecient to open the file in a or w and then reopening it in r to read any
lines. Luckily we can access the file in the following modes:
• r+ : Reading and writing. Cannot truncate the file.
• w+ : Writing and reading. Truncates the file.
• a+ : Appending and Reading. Creates a new file, if none exists.
You dont have to dwell on the specifics of each mode for this lab.
Let's try out the a+ mode:

with open('Example2.txt', 'a+') as testwritefile:


testwritefile.write("This is line E\n")
print(testwritefile.read())

There were no errors but read() also did not output anything. This is because of our
location in the file.
Most of the file methods we've looked at work in a certain location in the
file. .write() writes at a certain location in the file. .read() reads at a certain location
in the file and so on. You can think of this as moving your pointer around in the
notepad to make changes at specific location.
Opening the file in w is akin to opening the .txt file, moving your cursor to the
beginning of the text file, writing new text and deleting everything that follows.
Whereas opening the file in a is similiar to opening the .txt file, moving your cursor
to the very end and then adding the new pieces of text.
It is often very useful to know where the 'cursor' is in a file and be able to control it.
The following methods allow us to do precisely this -
• .tell() - returns the current position in bytes
• .seek(offset,from) - changes the position by 'offset' bytes with respect to
'from'. From can take the value of 0,1,2 corresponding to beginning, relative
to current position and end
Now lets revisit a+

Finally, a note on the difference between w+ and r+. Both of these modes allow
access to read and write methods, however, opening a file in w+ overwrites it
and deletes all pre-existing data.
To work with a file on existing data, use r+ and a+. While using r+, it can be
useful to add a .truncate() method at the end of your data. This will reduce the
file to your data and delete everything that follows.
In the following code block, Run the code as it is first and then run it with
the .truncate().

Copy a File
Let's copy the file Example2.txt to the file Example3.txt:
Reading Files Python
Estimated time needed: 40 minutes
Objectives
After completing this lab you will be able to:

• Read text files using Python libraries


Table of Contents
• Download Data
• Reading Text Files
• A Better Way to Open a File

Reading Text Files


One way to read or write a file in Python is to use the built-in open function.
The open function provides a File object that contains the methods and attributes
you need in order to read, save, and manipulate the file. In this notebook, we will
only cover .txt files. The first parameter you need is the file path and the file name.
An example is shown as follow:
We can read the file and assign it to a variable :
# Read the file
FileContent = file1.read()
FileContent

A Better Way to Open a File


Using the with statement is better practice, it automatically closes the file even if the
code encounters an exception. The code will run everything in the indent block then
close the file object.
We don’t have to read the entire file, for example, we can read the first 4
characters by entering three as a parameter to the method .read():
Here is an example using the same file, but instead we read 16, 5, and then 9
characters at a time:
Print the second line
FileasList[1]
# Print the third line
FileasList[2]

Classes and Objects in Python


Estimated time needed: 40 minutes
Objectives
After completing this lab you will be able to:

• Work with classes and objects


• Identify and define attributes and methods
Table of Contents
• Introduction to Classes and Objects
▪ Creating a class
▪ Instances of a Class: Objects and Attributes
▪ Methods
• Creating a class
• Creating an instance of a class Circle
• The Rectangle Class
Introduction to Classes and Objects
Creating a Class
The first step in creating a class is giving it a name. In this notebook, we will create
two classes: Circle and Rectangle. We need to determine all the data that make up
that class, which we call attributes. Think about this step as creating a blue print that
we will use to create objects. In figure 1 we see two classes, Circle and Rectangle.
Each has their attributes, which are variables. The class Circle has the attribute
radius and color, while the Rectangle class has the attribute height and width. Let’s
use the visual examples of these shapes before we get to the code, as this will help
you get accustomed to the vocabulary.
Instances of a Class: Objects and Attributes
An instance of an object is the realisation of a class, and in Figure 2 we see three instances
of the class circle. We give each object a name: red circle, yellow circle, and green circle.
Each object has different attributes, so let's focus on the color attribute for each object.

Methods
Methods give you a way to change or interact with the object; they are functions that
interact with objects. For example, let’s say we would like to increase the radius of a
circle by a specified amount. We can create a method called add_radius(r) that
increases the radius by r. This is shown in figure 3, where after applying the method
to the "orange circle object", the radius of the object increases accordingly. The “dot”
notation means to apply the method to the object, which is essentially applying a
function to the information in the object.
Creating a Class
Now we are going to create a class Circle, but first, we are going to import a library to
draw the objects:

The next step is a special method called a constructor __init__, which is used to
initialize the object. The inputs are data attributes. The term self contains all the
attributes in the set. For example the self.color gives the value of the attribute
color and self.radius will give you the radius of the object. We also have the
method add_radius() with the parameter r, the method adds the value of r to the
attribute radius. To access the radius we use the syntax self.radius. The labeled
syntax is summarized in Figure 5:

Creating an instance of a class Circle


Let’s create the object RedCircle of type Circle to do the following:
We can increase the radius of the circle by applying the method add_radius().
Let's increases the radius by 2 and then by 5:
Compare the x and y axis of the figure to the figure for RedCircle; they are different.
Let’s create the object SkinnyBlueRectangle of type Rectangle. Its width will be 2
and height will be 3, and the color will be blue:
Loops in Python
Estimated time needed: 20 minutes
Objectives
After completing this lab you will be able to:

• work with the loop statements in Python, including for-loop and while-loop.

Loops in Python
Welcome! This notebook will teach you about the loops in the Python Programming
Language. By the end of this lab, you'll know how to use the loop statements in
Python, including for loop, and while loop.
Table of Contents
• Loops
▪ Range
▪ What is for loop?
▪ What is while loop?
• Quiz on Loops

Loops
Range
Sometimes, you might want to repeat a given operation many times. Repeated
executions like this are performed by loops. We will look at two types of
loops, for loops and while loops.
Before we discuss loops lets discuss the range object. It is helpful to think of the
range object as an ordered list. For now, let's look at the simplest case. If we would
like to generate an object that contains elements ordered from 0 to 2 we simply use
the following command:
What is for loop?
The for loop enables you to execute a code block multiple times. For example, you
would use this if you would like to print out every element in a list.
Let's try to use a for loop to print all the years presented in the list dates:
This can be done as follows:
What is while loop?
As you can see, the for loop is used for a controlled flow of repetition. However, what
if we don't know when we want to stop the loop? What if we want to keep executing a
code block until a certain condition is met? The while loop exists as a tool for
repeated execution based on a condition. The code block will keep being executed
until the given logical condition returns a False boolean value.
Let’s say we would like to iterate through list dates and stop at the year 1973, then
print out the number of iterations. This can be done with the following block of code:
Web Scraping Lab
Estimated time needed: 30 minutes
Objectives
After completing this lab you will be able to:
Table of Contents
• Beautiful Soup Object
▪ Tag
▪ Children, Parents, and Siblings
▪ HTML Attributes
▪ Navigable String
• Filter
▪ find All
▪ find
▪ HTML Attributes
▪ Navigable String
• Downloading And Scraping The Contents Of A Web
Estimated time needed: 25 min

For this lab, we are going to be using Python and several Python libraries. Some of
these libraries might be installed in your lab environment or in SN Labs. Others may
need to be installed by you. The cells below will install these libraries when
executed.
Beautiful Soup Objects
Beautiful Soup is a Python library for pulling data out of HTML and XML files, we
will focus on HTML files. This is accomplished by representing the HTML as a set of
objects with methods used to parse the HTML. We can navigate the HTML as a tree
and/or filter out what we are looking for.

Consider the following HTML:


We can use the method prettify() to display the HTML in the nested structure:
Tags
Let's say we want the title of the page and the name of the top paid player we can
use the Tag. The Tag object corresponds to an HTML tag in the original document,
for example, the tag title.
Children, Parents, and Siblings
As stated above the Tag object is a tree of objects we can access the child of the tag
or navigate down the branch as follows:

HTML Attributes
If the tag has attributes, the tag id="boldest" has an attribute id whose value
is boldest. You can access a tag’s attributes by treating the tag like a dictionary:
Navigable String
A string corresponds to a bit of text or content within a tag. Beautiful Soup uses
the NavigableString class to contain this text. In our HTML we can obtain the name
of the first player by extracting the sting of the Tag object tag_child as follows:

Filter
Filters allow you to find complex patterns, the simplest filter is a string. In this section we
will pass a string to a different filter method and Beautiful Soup will perform a match
against that exact string. Consider the following HTML of rocket launchs:
find All
The find_all() method looks through a tag’s descendants and retrieves all
descendants that match your filters.
The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)
Name
When we set the name parameter to a tag name, the method will extract all the tags
with that name and its children.
Attributes
If the argument is not recognized it will be turned into a filter on the tag’s attributes.
For example the id argument, Beautiful Soup will filter against each
tag’s id attribute. For example, the first td elements have a value of id of flight,
therefore we can filter based on that id value.
find
The find_all() method scans the entire document looking for results, it’s if you are
looking for one element you can use the find() method to find the first element in
the document. Consider the following two table:
Assume that we are looking for the 10 most densly populated countries table, we can look
through the tables list and find the right one we are look for based on the data in each table or
we can search for the table name if it is in the table but this option might not always work.
Scrape data from HTML tables into a DataFrame using
BeautifulSoup and read_html
Using the same url, data, soup, and tables object as in the last section we can use
the read_html function to create a DataFrame.
Remember the table we need is located in tables[table_index]
We can now use the pandas function read_html and give it the string version of the
table as well as the flavor which is the parsing engine bs4.

The function read_html always returns a list of DataFrames so we must pick the
one we want out of the list.

import pandas as pd
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]
population_data_read_html

You might also like