Edx Course Lab Programs

MACHINE LEARNING WITH PYTHON FOR FINANCE PROFESSIONALS
EXERCISE: 1
We can use Python to count how many times a word is used within a sentence or document.
This can be useful for text analysis projects and other types of reporting.
PROGRAM:
sentence = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
# Let's handle the capital letters by making everything lowercase
sentence.lower()
# Remove the question mark
sentence.replace("?", "")
# We can use "method chaining" to do both in one go!
sentence.lower().replace("?", "")
# You can use the split function to extract the words from the sentence
sentence.split()
# Let's put it all together and assign to a variable ready to use in the next step
words = sentence.lower().replace("?", "").split()
words
# Create a dictionary ready to hold the words (as keys) and their counts (as values)
word_counts = {}
# Loop through the words
for word in words:
# Is the word in the dictionary yet?
if word in word_counts:
# If it is, add +1 to the current count
word_counts[word]
else:
# If it isn't, put it in with a count of 1
word_counts[word] = 1
print(word_counts)
output:
{'how': 1, 'much': 1, 'wood': 2, 'would': 1, 'a': 2, 'woodchuck': 2,
'chuck': 2,
'if': 1, 'could': 1}
Exercise: 2
1. Adapt your `count_words()` function from the previous exercise so that it strips out
punctuation using the code tip provided above.
2. Try out your adapted function with a long-form piece of text of your choice, using triple
quotes (`"""`) to store this information in a single multi-line string variable.
3. Find the top 20 words in your chosen text using the `sort_dictionary()` function provided
above. (This returns a list, so you can use list slicing to reduce it down to just the top 20.)
Program:
def sort_dictionary(input_dictionary, reverse=True):
return sorted(input_dictionary.items(), key=lambda x: x[1], reverse=reverse)
def count_words(sentence):
words = (
sentence.lower()
.replace("?", "") # remove ?
.replace(".", "") # remove .
.replace(",", "") # remove ,
.split()
)
word_counts = {}
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
return word_counts
sentence="""
HELLO WORLD WELCOME TO MACHINE LEARNING LAB"""
sort_dictionary(count_words(sentence))
OUTPUT:
[('hello', 1),
('world', 1),
('welcome', 1),
('to', 1),
('machine', 1),
('learning', 1),
('lab', 1)]
Exercise: 3
So far we've been working with only 2019 data. Let's now read in the full Dream Destination
dataset containing 66k orders across a ten year period from 2010-2019.
> 1. Using `pd.read_excel()` read in the `"Order Database"` sheet from the file `"Hotel
Industry - Order and Finance Database.xlsx"`.
> 2. As before, assign it to a variable called `orders`.
PROGRAM:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
# Read in the Dream Destination hotel data
orders = pd.read_excel("abc.xlsx",sheet_name="Order Database")
orders.head(3)
orders.Year.hist();
df = pd.merge(left=orders, right=finance, on='Booking ID', how='left')
len(df)
df.columns
df.head()
df[['Total Booking Amount', 'Discount Amount', 'Net Sales']].head(3)
sns.barplot(x='Total Booking Amount', y='Origin Country', estimator=sum, data=df);
output:
Exercise: 4
> Create a Seaborn barplot showing Total Booking Amount (x-axis) by Origin Country (y-
axis).
Program:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.barplot(x='Total Booking Amount', y='Origin Country', estimator=sum, data=df)
plt.figure(figsize=(15,10))
sns.barplot(
x='Revenue',
y='State',
hue='Origin Country',
dodge=False,
orient='h',
data=top_states
);
Output:
APACHE SPARK FOR DATA ENGINEERING AND MACHINE LEARNING
Exercise: 1
Building and Training a Prediction Model using Linear Regression
a. Load a dataset(diamond dataset)
b. Identify the target column and data column
c. Build and Train a new linear regression model
d. Evaluate the model
e. Predict the Price of the diamond
AIM: To Building and Training a Prediction Model Using Linear Regression.

a. Loading a dataset(diamond dataset)
b. Identifying the target column and data column
c. Building and Training a new linear regression model
d. Evaluating the model
e. Predicting the Price of the diamond
DATASET:
Diamonds dataset. Available at https://www.openml.org/search?
type=data&sort=runs&id=42225&status=active
PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv"
df = pd.read_csv(data)
target = df["price"]
features = df[["carat","depth"]]
lr = LinearRegression()
lr.fit(features,target)
print("Model Score:", lr.score(features,target))
print("Predicted price of the diamond:",lr.predict([[0.3, 60]]))
OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,normalize=False)
Model Score: 0.8506754571636563
Predicted price of the diamond: [244.95605225]

Exercise: 2
Build a Classifier Model using Logistic Regression
a. Load a Dataset
b. Identify the target column and data column
c. Build and Train a new classifier
d. Evaluate the model
e. Find out if a tumor is cancerous
AIM:Building a Classifier Model using Logistic Regression

a. Loading a Dataset
b. Identifying the target column and data column
c. Building and Training a new classifier
d. Evaluating the model
e. Finding out if a tumor is cancerous
Dataset: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/cancer.csv"
df = pd.read_csv(data)
target = df["diagnosis"]
features = df[['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
'compactness_mean', 'concavity_mean', 'symmetry_mean']]
classifier = LogisticRegression()
classifier.fit(features,target)
classifier.score(features,target)
classifier.predict([[13.45,86.6,555.1,0.1022,0.08165,0.03974,0.1638]])
OUTPUT:
LogisticRegression()
0.8963093145869947
array(['Benign'], dtype=object)
Exercise: 3
.Connecting to Spark Cluster using SN Labs
Create a Spark Session
A. Load the dataset into a dataframe
B. Explore the data
C. Print the top 5 rows of the dataframe
D. Stop the spark session
Aim:Connecting to Spark Cluster using SN Labs (External resource)

A. Creating a Spark Session
B. Loading the dataset into a dataframe
C. Exploring the data
D. Print the top 5 rows of the dataframe
E. Stop the spark session
Dataset:
Download the data set from :"https://cf-courses-data.s3.us.cloud-object-
storage.appdomain.cloud/IBM-BD0231EN
SkillsNetwork/datasets/diamonds.csv"
Program:
#import findspark
findspark.init()
#import SparkSession
from pyspark.sql import SparkSession
#Create SparkSession
spark = SparkSession.builder.appName("Getting Started with Spark").getOrCreate()
#Download the data file
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv
#Load diamond dataset into a dataframe named diamond_data
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)
#Print the schema of the dataframe
diamond_data.printSchema()
#Print the top 5 rows of the dataframe
diamond_data.head(5)
#Stop the spark session
spark.stop()
OUTPUT:
root
|-- s: integer (nullable = true)
|-- carat: double (nullable = true)
|-- cut: string (nullable = true)
|-- color: string (nullable = true)
|-- clarity: string (nullable = true)
|-- depth: double (nullable = true)
|-- table: double (nullable = true)
|-- price: integer (nullable = true)
|-- x: double (nullable = true)
|-- y: double (nullable = true)
|-- z: double (nullable = true)
[Row(s=1, carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55.0, price=326,

x=3.95, y=3.98, z=2.43),
Row(s=2, carat=0.21, cut='Premium', color='E', clarity='SI1', depth=59.8, table=61.0,
price=326, x=3.89, y=3.84, z=2.31),
Row(s=3, carat=0.23, cut='Good', color='E', clarity='VS1', depth=56.9, table=65.0,
price=327, x=4.05, y=4.07, z=2.31),
Row(s=4, carat=0.29, cut='Premium', color='I', clarity='VS2', depth=62.4, table=58.0,
price=334, x=4.2, y=4.23, z=2.63),
Row(s=5, carat=0.31, cut='Good', color='J', clarity='SI2', depth=63.3, table=58.0, price=335,
x=4.34, y=4.35, z=2.75)]
Exercise: 4
. Regression using SparkML
A. Create a spark session
B. Load the data in a csv file into a dataframe
C. Identify the label column and the input columns
D. Split the data
E. Build and Train a Linear Regression Model
F. Evaluate the model
AIM:Regression using SparkML
A. Create a spark session
B. Load the data in a csv file into a dataframe
C. Identify the label column and the input columns
D. Split the data
E. Build and Train a Linear Regression Model
F. Evaluate the model
Dataset:Download the data set : https://cf-courses-data.s3.us.cloud-object-

storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv
Program:
#install libraries
!pip install pyspark==3.1.2 -q
!pip install findspark -q
#Ignore Warnings
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')
#Importing Required Libraries
import findspark
findspark.init()
#import functions/Classes for sparkml
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator
#Create a spark session with appname "Diamond Price Prediction"
spark = SparkSession.builder.appName("Diamond Price Prediction").getOrCreate()
#Download the data set
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv
#Load the dataset into a spark dataframe
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)
#Display sample data from dataset
diamond_data.show(5)
#Identify the label column and the input columns
#use the price column as label column
#use the columns carat,depth and table as features
#Assemble the columns columnscarat,depth and table into a single column named "features"
assembler = VectorAssembler(inputCols=["carat", "depth", "table"], outputCol="features")
diamond_transformed_data = assembler.transform(diamond_data)
#Print the vectorized features and label columns
diamond_transformed_data.select("features","price").show()
#Split the dataset into training and testing sets in the ratio of 70:30.
(training_data, testing_data) = diamond_transformed_data.randomSplit([0.7, 0.3])
#Build a linear regression and train it
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(training_data)
#Predict the values using the test data
predictions = model.transform(testing_data)
#Print the metrics :
#R squared
#mean absolute error
#root mean squared error
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction",
metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)
metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE =", mae)
metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE =", rmse)
Output:
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| s|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
+----------------+-----+
| features|price|
+----------------+-----+
|[0.23,61.5,55.0]| 326|
|[0.21,59.8,61.0]| 326|
|[0.23,56.9,65.0]| 327|
|[0.29,62.4,58.0]| 334|
|[0.31,63.3,58.0]| 335|
|[0.24,62.8,57.0]| 336|
|[0.24,62.3,57.0]| 336|
|[0.26,61.9,55.0]| 337|
|[0.22,65.1,61.0]| 337|
|[0.23,59.4,61.0]| 338|
| [0.3,64.0,55.0]| 339|
|[0.23,62.8,56.0]| 340|
|[0.22,60.4,61.0]| 342|
|[0.31,62.2,54.0]| 344|
| [0.2,60.2,62.0]| 345|
|[0.32,60.9,58.0]| 345|
| [0.3,62.0,54.0]| 348|
| [0.3,63.4,54.0]| 351|
| [0.3,63.8,56.0]| 351|
| [0.3,62.7,59.0]| 351|
+----------------+-----+
R Squared = 0.8521786458835734
MAE = 993.2267089991868
RMSE = 1513.0984941194174
Exercise: 5
. ETL using Apache Spark
A. Extract
B. Transform
C. Load
Aim: ETL using Apache Spark

A. Extract
B. Transform
C. Load
Dataset :https://jupyterlab-0-labs-prod-jupyterlab-us-east-0.labs.cognitiveclass.ai/user/
u21781a0533/files/labs/authoride/IBMSkillsNetwork%2BBD0231EN/labs/
student_transformed.csv/part-00000-0f447bfa-2692-481c-ab2b-c707160dec5c-c000.csv?
_xsrf=MnwxOjB8MTA6MTcxNTE2Nzg2MXw1Ol94c3JmfDEzMjpOVFkxTURWaU1qRTJ
PR001TkdNME9UaGhaakk0T0RJM1kyTTVPR0psT0RBNlpXSm1OR0kwT1RJeVlqWTJN
bU00TlRBeVpEY3dZbVJqWWpOaU1UUm1ZVGxqWlRjM09EazBPRFl6TVRjMk1EY3d
ObU0wTWpFeFlqYzNZek13TkdVMU5nPT18NjM4YmNlMzc3NWM4NjQ1ZjJjY2RhYTU
0NDExY2ZiNGMwMTdlZjdmMjJmOTRlZWQ3OTAwY2Y2ZmI2MDliMWY0Mw
Program:
#Installing Required Libraries

!pip install pyspark==3.1.2 -q
!pip install findspark -q
#Importing Required Libraries

#Ignore Warnings
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')
# FindSpark simplifies the process of using Apache Spark with Python
import findspark
findspark.init()
#Create SparkSession
spark = SparkSession.builder.appName("Exercises - ETL using Spark").getOrCreate()
# Extract
#Load data from student_transformed.csv into a dataframe
# Load student dataset
df = spark.read.csv("student_transformed.csv", header=True, inferSchema=True)
# display dataframe
df.show()
#Transform
#Convert cm to meters
# Divide the column height_cm by 100 a new column height_cm
df = df.withColumn("height_meters", expr("height_cm / 100"))
# display dataframe
df.show()
#Create a column named bmi

# compute bmi using the below formula
# BMI = weight/(height * height)
# weight must be in kgs
# height must be in meters
df = df.withColumn("bmi", expr("weight_kg/(height_meters*height_meters)"))
# display dataframe
df.show()
#Drop the columns height_inches, weight_pounds
# drop the columns "height_inches","weight_pounds"
df = df.drop("height_cm","weight_kg","height_meters")
# display dataframe
df.show()
from pyspark.sql.functions import col, round

df = df.withColumn("bmi_rounded", round(col("bmi")))
df.show()
#Load
#Save the dataframe into a parquet file
#Write the data to a Parquet file, set the mode to overwrite
df.write.mode("overwrite").parquet("student_transformed.parquet")
#Stop Spark Session
spark.stop()
Output:
+--------+---------+---------+
| student|height_cm|weight_kg|
+--------+---------+---------+
|student6| 157.48| 38.55532|
|student3| 175.26| 43.09124|
|student2| 149.86| 45.3592|
|student7| 165.1| 36.28736|
|student1| 162.56| 40.82328|
|student5| 152.4| 36.28736|
+--------+---------+---------+
+--------+---------+---------+------------------+
| student|height_cm|weight_kg| height_meters|
+--------+---------+---------+------------------+
|student6| 157.48| 38.55532| 1.5748|
|student3| 175.26| 43.09124| 1.7526|
|student2| 149.86| 45.3592|1.4986000000000002|
|student7| 165.1| 36.28736| 1.651|
|student1| 162.56| 40.82328| 1.6256|
|student5| 152.4| 36.28736| 1.524|
+--------+---------+---------+------------------+
+--------+---------+---------+------------------+------------------+
| student|height_cm|weight_kg| height_meters| bmi|
+--------+---------+---------+------------------+------------------+
|student6| 157.48| 38.55532| 1.5748|15.546531093062187|
|student3| 175.26| 43.09124| 1.7526|14.028892161964118|
|student2| 149.86| 45.3592|1.4986000000000002|20.197328530250278|
|student7| 165.1| 36.28736| 1.651|13.312549228648752|
|student1| 162.56| 40.82328| 1.6256|15.448293591899683|
|student5| 152.4| 36.28736| 1.524|15.623755691955827|
+--------+---------+---------+------------------+------------------+
+--------+------------------+
| student| bmi|
+--------+------------------+
|student6|15.546531093062187|
|student3|14.028892161964118|
|student2|20.197328530250278|
|student7|13.312549228648752|
|student1|15.448293591899683|
|student5|15.623755691955827|
+--------+------------------+
+--------+------------------+-----------+
| student| bmi|bmi_rounded|
+--------+------------------+-----------+
|student6|15.546531093062187| 16.0|
|student3|14.028892161964118| 14.0|
|student2|20.197328530250278| 20.0|
|student7|13.312549228648752| 13.0|
|student1|15.448293591899683| 15.0|
|student5|15.623755691955827| 16.0|
+--------+------------------+-----------+

Edx Course Lab Programs

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Edx Course Lab Programs

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING WITH PYTHON FOR FINANCE PROFESSIONALS

AIM: To Building and Training a Prediction Model Using Linear Regression.

Model Score: 0.8506754571636563

Predicted price of the diamond: [244.95605225]

AIM:Building a Classifier Model using Logistic Regression

Aim:Connecting to Spark Cluster using SN Labs (External resource)

[Row(s=1, carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55.0, price=326,

Dataset:Download the data set : https://cf-courses-data.s3.us.cloud-object-

Aim: ETL using Apache Spark

#Installing Required Libraries

#Importing Required Libraries

# FindSpark simplifies the process of using Apache Spark with Python

from pyspark.sql import SparkSession

#Create a column named bmi

from pyspark.sql.functions import col, round

You might also like