Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

MACHINE LEARNING WITH PYTHON FOR FINANCE PROFESSIONALS

EXERCISE: 1
We can use Python to count how many times a word is used within a sentence or document.
This can be useful for text analysis projects and other types of reporting.

PROGRAM:
sentence = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
# Let's handle the capital letters by making everything lowercase
sentence.lower()
# Remove the question mark
sentence.replace("?", "")
# We can use "method chaining" to do both in one go!
sentence.lower().replace("?", "")
# You can use the split function to extract the words from the sentence
sentence.split()
# Let's put it all together and assign to a variable ready to use in the next step
words = sentence.lower().replace("?", "").split()
words
# Create a dictionary ready to hold the words (as keys) and their counts (as values)
word_counts = {}
# Loop through the words
for word in words:
# Is the word in the dictionary yet?
if word in word_counts:
# If it is, add +1 to the current count
word_counts[word]
else:
# If it isn't, put it in with a count of 1
word_counts[word] = 1
print(word_counts)
output:
{'how': 1, 'much': 1, 'wood': 2, 'would': 1, 'a': 2, 'woodchuck': 2,
'chuck': 2,
'if': 1, 'could': 1}
Exercise: 2
1. Adapt your `count_words()` function from the previous exercise so that it strips out
punctuation using the code tip provided above.
2. Try out your adapted function with a long-form piece of text of your choice, using triple
quotes (`"""`) to store this information in a single multi-line string variable.
3. Find the top 20 words in your chosen text using the `sort_dictionary()` function provided
above. (This returns a list, so you can use list slicing to reduce it down to just the top 20.)

Program:
def sort_dictionary(input_dictionary, reverse=True):
return sorted(input_dictionary.items(), key=lambda x: x[1], reverse=reverse)
def count_words(sentence):
words = (
sentence.lower()
.replace("?", "") # remove ?
.replace(".", "") # remove .
.replace(",", "") # remove ,
.split()
)
word_counts = {}
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
return word_counts
sentence="""
HELLO WORLD WELCOME TO MACHINE LEARNING LAB"""
sort_dictionary(count_words(sentence))
OUTPUT:
[('hello', 1),
('world', 1),
('welcome', 1),
('to', 1),
('machine', 1),
('learning', 1),
('lab', 1)]

Exercise: 3

So far we've been working with only 2019 data. Let's now read in the full Dream Destination
dataset containing 66k orders across a ten year period from 2010-2019.
> 1. Using `pd.read_excel()` read in the `"Order Database"` sheet from the file `"Hotel
Industry - Order and Finance Database.xlsx"`.
> 2. As before, assign it to a variable called `orders`.

PROGRAM:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
# Read in the Dream Destination hotel data
orders = pd.read_excel("abc.xlsx",sheet_name="Order Database")
orders.head(3)
orders.Year.hist();
df = pd.merge(left=orders, right=finance, on='Booking ID', how='left')
len(df)
df.columns
df.head()
df[['Total Booking Amount', 'Discount Amount', 'Net Sales']].head(3)
sns.barplot(x='Total Booking Amount', y='Origin Country', estimator=sum, data=df);

output:
Exercise: 4
> Create a Seaborn barplot showing Total Booking Amount (x-axis) by Origin Country (y-
axis).

Program:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.barplot(x='Total Booking Amount', y='Origin Country', estimator=sum, data=df)
plt.figure(figsize=(15,10))
sns.barplot(
x='Revenue',
y='State',
hue='Origin Country',
dodge=False,
orient='h',
data=top_states
);

Output:
APACHE SPARK FOR DATA ENGINEERING AND MACHINE LEARNING

Exercise: 1
Building and Training a Prediction Model using Linear Regression
a. Load a dataset(diamond dataset)
b. Identify the target column and data column
c. Build and Train a new linear regression model
d. Evaluate the model
e. Predict the Price of the diamond

AIM: To Building and Training a Prediction Model Using Linear Regression.


a. Loading a dataset(diamond dataset)
b. Identifying the target column and data column
c. Building and Training a new linear regression model
d. Evaluating the model
e. Predicting the Price of the diamond

DATASET:
Diamonds dataset. Available at https://www.openml.org/search?
type=data&sort=runs&id=42225&status=active
PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv"
df = pd.read_csv(data)
target = df["price"]
features = df[["carat","depth"]]
lr = LinearRegression()
lr.fit(features,target)
print("Model Score:", lr.score(features,target))
print("Predicted price of the diamond:",lr.predict([[0.3, 60]]))
OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,normalize=False)

Model Score: 0.8506754571636563

Predicted price of the diamond: [244.95605225]


Exercise: 2
Build a Classifier Model using Logistic Regression
a. Load a Dataset
b. Identify the target column and data column
c. Build and Train a new classifier
d. Evaluate the model
e. Find out if a tumor is cancerous

AIM:Building a Classifier Model using Logistic Regression


a. Loading a Dataset
b. Identifying the target column and data column
c. Building and Training a new classifier
d. Evaluating the model
e. Finding out if a tumor is cancerous

Dataset: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/cancer.csv"
df = pd.read_csv(data)
target = df["diagnosis"]
features = df[['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
'compactness_mean', 'concavity_mean', 'symmetry_mean']]
classifier = LogisticRegression()
classifier.fit(features,target)
classifier.score(features,target)
classifier.predict([[13.45,86.6,555.1,0.1022,0.08165,0.03974,0.1638]])
OUTPUT:
LogisticRegression()
0.8963093145869947
array(['Benign'], dtype=object)
Exercise: 3
.Connecting to Spark Cluster using SN Labs
Create a Spark Session
A. Load the dataset into a dataframe
B. Explore the data
C. Print the top 5 rows of the dataframe
D. Stop the spark session

Aim:Connecting to Spark Cluster using SN Labs (External resource)


A. Creating a Spark Session
B. Loading the dataset into a dataframe
C. Exploring the data
D. Print the top 5 rows of the dataframe
E. Stop the spark session

Dataset:
Download the data set from :"https://cf-courses-data.s3.us.cloud-object-
storage.appdomain.cloud/IBM-BD0231EN
SkillsNetwork/datasets/diamonds.csv"
Program:
#import findspark
findspark.init()
#import SparkSession
from pyspark.sql import SparkSession
#Create SparkSession
spark = SparkSession.builder.appName("Getting Started with Spark").getOrCreate()
#Download the data file
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv
#Load diamond dataset into a dataframe named diamond_data
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)
#Print the schema of the dataframe
diamond_data.printSchema()
#Print the top 5 rows of the dataframe
diamond_data.head(5)
#Stop the spark session
spark.stop()
OUTPUT:
root
|-- s: integer (nullable = true)
|-- carat: double (nullable = true)
|-- cut: string (nullable = true)
|-- color: string (nullable = true)
|-- clarity: string (nullable = true)
|-- depth: double (nullable = true)
|-- table: double (nullable = true)
|-- price: integer (nullable = true)
|-- x: double (nullable = true)
|-- y: double (nullable = true)
|-- z: double (nullable = true)

[Row(s=1, carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55.0, price=326,


x=3.95, y=3.98, z=2.43),
Row(s=2, carat=0.21, cut='Premium', color='E', clarity='SI1', depth=59.8, table=61.0,
price=326, x=3.89, y=3.84, z=2.31),
Row(s=3, carat=0.23, cut='Good', color='E', clarity='VS1', depth=56.9, table=65.0,
price=327, x=4.05, y=4.07, z=2.31),
Row(s=4, carat=0.29, cut='Premium', color='I', clarity='VS2', depth=62.4, table=58.0,
price=334, x=4.2, y=4.23, z=2.63),
Row(s=5, carat=0.31, cut='Good', color='J', clarity='SI2', depth=63.3, table=58.0, price=335,
x=4.34, y=4.35, z=2.75)]
Exercise: 4
. Regression using SparkML
A. Create a spark session
B. Load the data in a csv file into a dataframe
C. Identify the label column and the input columns
D. Split the data
E. Build and Train a Linear Regression Model
F. Evaluate the model
AIM:Regression using SparkML
A. Create a spark session
B. Load the data in a csv file into a dataframe
C. Identify the label column and the input columns
D. Split the data
E. Build and Train a Linear Regression Model
F. Evaluate the model

Dataset:Download the data set : https://cf-courses-data.s3.us.cloud-object-


storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/diamonds.csv
Program:
#install libraries
!pip install pyspark==3.1.2 -q
!pip install findspark -q
#Ignore Warnings
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')
#Importing Required Libraries
import findspark
findspark.init()
from pyspark.sql import SparkSession
#import functions/Classes for sparkml
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
# import functions/Classes for metrics
from pyspark.ml.evaluation import RegressionEvaluator
#Create a spark session with appname "Diamond Price Prediction"
spark = SparkSession.builder.appName("Diamond Price Prediction").getOrCreate()
#Download the data set
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv
#Load the dataset into a spark dataframe
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)
#Display sample data from dataset
diamond_data.show(5)
#Identify the label column and the input columns
#use the price column as label column
#use the columns carat,depth and table as features
#Assemble the columns columnscarat,depth and table into a single column named "features"
assembler = VectorAssembler(inputCols=["carat", "depth", "table"], outputCol="features")
diamond_transformed_data = assembler.transform(diamond_data)
#Print the vectorized features and label columns
diamond_transformed_data.select("features","price").show()
#Split the dataset into training and testing sets in the ratio of 70:30.
(training_data, testing_data) = diamond_transformed_data.randomSplit([0.7, 0.3])
#Build a linear regression and train it
lr = LinearRegression(featuresCol="features", labelCol="price")
model = lr.fit(training_data)
#Predict the values using the test data
predictions = model.transform(testing_data)
#Print the metrics :
#R squared
#mean absolute error
#root mean squared error
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction",
metricName="r2")
r2 = evaluator.evaluate(predictions)
print("R Squared =", r2)
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction",
metricName="mae")
mae = evaluator.evaluate(predictions)
print("MAE =", mae)
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction",
metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE =", rmse)

Output:

+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| s|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
+----------------+-----+
| features|price|
+----------------+-----+
|[0.23,61.5,55.0]| 326|
|[0.21,59.8,61.0]| 326|
|[0.23,56.9,65.0]| 327|
|[0.29,62.4,58.0]| 334|
|[0.31,63.3,58.0]| 335|
|[0.24,62.8,57.0]| 336|
|[0.24,62.3,57.0]| 336|
|[0.26,61.9,55.0]| 337|
|[0.22,65.1,61.0]| 337|
|[0.23,59.4,61.0]| 338|
| [0.3,64.0,55.0]| 339|
|[0.23,62.8,56.0]| 340|
|[0.22,60.4,61.0]| 342|
|[0.31,62.2,54.0]| 344|
| [0.2,60.2,62.0]| 345|
|[0.32,60.9,58.0]| 345|
| [0.3,62.0,54.0]| 348|
| [0.3,63.4,54.0]| 351|
| [0.3,63.8,56.0]| 351|
| [0.3,62.7,59.0]| 351|
+----------------+-----+

R Squared = 0.8521786458835734
MAE = 993.2267089991868
RMSE = 1513.0984941194174
Exercise: 5
. ETL using Apache Spark
A. Extract
B. Transform
C. Load

Aim: ETL using Apache Spark


A. Extract
B. Transform
C. Load

Dataset :https://jupyterlab-0-labs-prod-jupyterlab-us-east-0.labs.cognitiveclass.ai/user/
u21781a0533/files/labs/authoride/IBMSkillsNetwork%2BBD0231EN/labs/
student_transformed.csv/part-00000-0f447bfa-2692-481c-ab2b-c707160dec5c-c000.csv?
_xsrf=MnwxOjB8MTA6MTcxNTE2Nzg2MXw1Ol94c3JmfDEzMjpOVFkxTURWaU1qRTJ
PR001TkdNME9UaGhaakk0T0RJM1kyTTVPR0psT0RBNlpXSm1OR0kwT1RJeVlqWTJN
bU00TlRBeVpEY3dZbVJqWWpOaU1UUm1ZVGxqWlRjM09EazBPRFl6TVRjMk1EY3d
ObU0wTWpFeFlqYzNZek13TkdVMU5nPT18NjM4YmNlMzc3NWM4NjQ1ZjJjY2RhYTU
0NDExY2ZiNGMwMTdlZjdmMjJmOTRlZWQ3OTAwY2Y2ZmI2MDliMWY0Mw

Program:

#Installing Required Libraries


!pip install pyspark==3.1.2 -q
!pip install findspark -q

#Importing Required Libraries


#Ignore Warnings
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

# FindSpark simplifies the process of using Apache Spark with Python

import findspark
findspark.init()

from pyspark.sql import SparkSession

#Create SparkSession
spark = SparkSession.builder.appName("Exercises - ETL using Spark").getOrCreate()

# Extract
#Load data from student_transformed.csv into a dataframe
# Load student dataset
df = spark.read.csv("student_transformed.csv", header=True, inferSchema=True)
# display dataframe
df.show()

#Transform
#Convert cm to meters
# Divide the column height_cm by 100 a new column height_cm
df = df.withColumn("height_meters", expr("height_cm / 100"))
# display dataframe
df.show()

#Create a column named bmi


# compute bmi using the below formula
# BMI = weight/(height * height)
# weight must be in kgs
# height must be in meters
df = df.withColumn("bmi", expr("weight_kg/(height_meters*height_meters)"))
# display dataframe
df.show()
#Drop the columns height_inches, weight_pounds
# drop the columns "height_inches","weight_pounds"
df = df.drop("height_cm","weight_kg","height_meters")
# display dataframe
df.show()

from pyspark.sql.functions import col, round


df = df.withColumn("bmi_rounded", round(col("bmi")))
df.show()
#Load
#Save the dataframe into a parquet file
#Write the data to a Parquet file, set the mode to overwrite
df.write.mode("overwrite").parquet("student_transformed.parquet")
#Stop Spark Session
spark.stop()

Output:

+--------+---------+---------+
| student|height_cm|weight_kg|
+--------+---------+---------+
|student6| 157.48| 38.55532|
|student3| 175.26| 43.09124|
|student2| 149.86| 45.3592|
|student7| 165.1| 36.28736|
|student1| 162.56| 40.82328|
|student5| 152.4| 36.28736|
+--------+---------+---------+
+--------+---------+---------+------------------+
| student|height_cm|weight_kg| height_meters|
+--------+---------+---------+------------------+
|student6| 157.48| 38.55532| 1.5748|
|student3| 175.26| 43.09124| 1.7526|
|student2| 149.86| 45.3592|1.4986000000000002|
|student7| 165.1| 36.28736| 1.651|
|student1| 162.56| 40.82328| 1.6256|
|student5| 152.4| 36.28736| 1.524|
+--------+---------+---------+------------------+

+--------+---------+---------+------------------+------------------+
| student|height_cm|weight_kg| height_meters| bmi|
+--------+---------+---------+------------------+------------------+
|student6| 157.48| 38.55532| 1.5748|15.546531093062187|
|student3| 175.26| 43.09124| 1.7526|14.028892161964118|
|student2| 149.86| 45.3592|1.4986000000000002|20.197328530250278|
|student7| 165.1| 36.28736| 1.651|13.312549228648752|
|student1| 162.56| 40.82328| 1.6256|15.448293591899683|
|student5| 152.4| 36.28736| 1.524|15.623755691955827|
+--------+---------+---------+------------------+------------------+

+--------+------------------+
| student| bmi|
+--------+------------------+
|student6|15.546531093062187|
|student3|14.028892161964118|
|student2|20.197328530250278|
|student7|13.312549228648752|
|student1|15.448293591899683|
|student5|15.623755691955827|
+--------+------------------+

+--------+------------------+-----------+
| student| bmi|bmi_rounded|
+--------+------------------+-----------+
|student6|15.546531093062187| 16.0|
|student3|14.028892161964118| 14.0|
|student2|20.197328530250278| 20.0|
|student7|13.312549228648752| 13.0|
|student1|15.448293591899683| 15.0|
|student5|15.623755691955827| 16.0|
+--------+------------------+-----------+

You might also like