Professional Documents
Culture Documents
Edx Course Lab Programs
Edx Course Lab Programs
EXERCISE: 1
We can use Python to count how many times a word is used within a sentence or document.
This can be useful for text analysis projects and other types of reporting.
PROGRAM:
sentence = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
# Let's handle the capital letters by making everything lowercase
sentence.lower()
# Remove the question mark
sentence.replace("?", "")
# We can use "method chaining" to do both in one go!
sentence.lower().replace("?", "")
# You can use the split function to extract the words from the sentence
sentence.split()
# Let's put it all together and assign to a variable ready to use in the next step
words = sentence.lower().replace("?", "").split()
words
# Create a dictionary ready to hold the words (as keys) and their counts (as values)
word_counts = {}
# Loop through the words
for word in words:
# Is the word in the dictionary yet?
if word in word_counts:
# If it is, add +1 to the current count
word_counts[word]
else:
# If it isn't, put it in with a count of 1
word_counts[word] = 1
print(word_counts)
output:
{'how': 1, 'much': 1, 'wood': 2, 'would': 1, 'a': 2, 'woodchuck': 2,
'chuck': 2,
'if': 1, 'could': 1}
Exercise: 2
1. Adapt your `count_words()` function from the previous exercise so that it strips out
punctuation using the code tip provided above.
2. Try out your adapted function with a long-form piece of text of your choice, using triple
quotes (`"""`) to store this information in a single multi-line string variable.
3. Find the top 20 words in your chosen text using the `sort_dictionary()` function provided
above. (This returns a list, so you can use list slicing to reduce it down to just the top 20.)
Program:
def sort_dictionary(input_dictionary, reverse=True):
return sorted(input_dictionary.items(), key=lambda x: x[1], reverse=reverse)
def count_words(sentence):
words = (
sentence.lower()
.replace("?", "") # remove ?
.replace(".", "") # remove .
.replace(",", "") # remove ,
.split()
)
word_counts = {}
for word in words:
word_counts[word] = word_counts.get(word, 0) + 1
return word_counts
sentence="""
HELLO WORLD WELCOME TO MACHINE LEARNING LAB"""
sort_dictionary(count_words(sentence))
OUTPUT:
[('hello', 1),
('world', 1),
('welcome', 1),
('to', 1),
('machine', 1),
('learning', 1),
('lab', 1)]
Exercise: 3
So far we've been working with only 2019 data. Let's now read in the full Dream Destination
dataset containing 66k orders across a ten year period from 2010-2019.
> 1. Using `pd.read_excel()` read in the `"Order Database"` sheet from the file `"Hotel
Industry - Order and Finance Database.xlsx"`.
> 2. As before, assign it to a variable called `orders`.
PROGRAM:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
# Read in the Dream Destination hotel data
orders = pd.read_excel("abc.xlsx",sheet_name="Order Database")
orders.head(3)
orders.Year.hist();
df = pd.merge(left=orders, right=finance, on='Booking ID', how='left')
len(df)
df.columns
df.head()
df[['Total Booking Amount', 'Discount Amount', 'Net Sales']].head(3)
sns.barplot(x='Total Booking Amount', y='Origin Country', estimator=sum, data=df);
output:
Exercise: 4
> Create a Seaborn barplot showing Total Booking Amount (x-axis) by Origin Country (y-
axis).
Program:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
sns.barplot(x='Total Booking Amount', y='Origin Country', estimator=sum, data=df)
plt.figure(figsize=(15,10))
sns.barplot(
x='Revenue',
y='State',
hue='Origin Country',
dodge=False,
orient='h',
data=top_states
);
Output:
APACHE SPARK FOR DATA ENGINEERING AND MACHINE LEARNING
Exercise: 1
Building and Training a Prediction Model using Linear Regression
a. Load a dataset(diamond dataset)
b. Identify the target column and data column
c. Build and Train a new linear regression model
d. Evaluate the model
e. Predict the Price of the diamond
DATASET:
Diamonds dataset. Available at https://www.openml.org/search?
type=data&sort=runs&id=42225&status=active
PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv"
df = pd.read_csv(data)
target = df["price"]
features = df[["carat","depth"]]
lr = LinearRegression()
lr.fit(features,target)
print("Model Score:", lr.score(features,target))
print("Predicted price of the diamond:",lr.predict([[0.3, 60]]))
OUTPUT:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,normalize=False)
Dataset: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
PROGRAM:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/cancer.csv"
df = pd.read_csv(data)
target = df["diagnosis"]
features = df[['radius_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean',
'compactness_mean', 'concavity_mean', 'symmetry_mean']]
classifier = LogisticRegression()
classifier.fit(features,target)
classifier.score(features,target)
classifier.predict([[13.45,86.6,555.1,0.1022,0.08165,0.03974,0.1638]])
OUTPUT:
LogisticRegression()
0.8963093145869947
array(['Benign'], dtype=object)
Exercise: 3
.Connecting to Spark Cluster using SN Labs
Create a Spark Session
A. Load the dataset into a dataframe
B. Explore the data
C. Print the top 5 rows of the dataframe
D. Stop the spark session
Dataset:
Download the data set from :"https://cf-courses-data.s3.us.cloud-object-
storage.appdomain.cloud/IBM-BD0231EN
SkillsNetwork/datasets/diamonds.csv"
Program:
#import findspark
findspark.init()
#import SparkSession
from pyspark.sql import SparkSession
#Create SparkSession
spark = SparkSession.builder.appName("Getting Started with Spark").getOrCreate()
#Download the data file
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-
SkillsNetwork/datasets/diamonds.csv
#Load diamond dataset into a dataframe named diamond_data
diamond_data = spark.read.csv("diamonds.csv", header=True, inferSchema=True)
#Print the schema of the dataframe
diamond_data.printSchema()
#Print the top 5 rows of the dataframe
diamond_data.head(5)
#Stop the spark session
spark.stop()
OUTPUT:
root
|-- s: integer (nullable = true)
|-- carat: double (nullable = true)
|-- cut: string (nullable = true)
|-- color: string (nullable = true)
|-- clarity: string (nullable = true)
|-- depth: double (nullable = true)
|-- table: double (nullable = true)
|-- price: integer (nullable = true)
|-- x: double (nullable = true)
|-- y: double (nullable = true)
|-- z: double (nullable = true)
Output:
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| s|carat| cut|color|clarity|depth|table|price| x| y| z|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
| 1| 0.23| Ideal| E| SI2| 61.5| 55.0| 326|3.95|3.98|2.43|
| 2| 0.21|Premium| E| SI1| 59.8| 61.0| 326|3.89|3.84|2.31|
| 3| 0.23| Good| E| VS1| 56.9| 65.0| 327|4.05|4.07|2.31|
| 4| 0.29|Premium| I| VS2| 62.4| 58.0| 334| 4.2|4.23|2.63|
| 5| 0.31| Good| J| SI2| 63.3| 58.0| 335|4.34|4.35|2.75|
+---+-----+-------+-----+-------+-----+-----+-----+----+----+----+
+----------------+-----+
| features|price|
+----------------+-----+
|[0.23,61.5,55.0]| 326|
|[0.21,59.8,61.0]| 326|
|[0.23,56.9,65.0]| 327|
|[0.29,62.4,58.0]| 334|
|[0.31,63.3,58.0]| 335|
|[0.24,62.8,57.0]| 336|
|[0.24,62.3,57.0]| 336|
|[0.26,61.9,55.0]| 337|
|[0.22,65.1,61.0]| 337|
|[0.23,59.4,61.0]| 338|
| [0.3,64.0,55.0]| 339|
|[0.23,62.8,56.0]| 340|
|[0.22,60.4,61.0]| 342|
|[0.31,62.2,54.0]| 344|
| [0.2,60.2,62.0]| 345|
|[0.32,60.9,58.0]| 345|
| [0.3,62.0,54.0]| 348|
| [0.3,63.4,54.0]| 351|
| [0.3,63.8,56.0]| 351|
| [0.3,62.7,59.0]| 351|
+----------------+-----+
R Squared = 0.8521786458835734
MAE = 993.2267089991868
RMSE = 1513.0984941194174
Exercise: 5
. ETL using Apache Spark
A. Extract
B. Transform
C. Load
Dataset :https://jupyterlab-0-labs-prod-jupyterlab-us-east-0.labs.cognitiveclass.ai/user/
u21781a0533/files/labs/authoride/IBMSkillsNetwork%2BBD0231EN/labs/
student_transformed.csv/part-00000-0f447bfa-2692-481c-ab2b-c707160dec5c-c000.csv?
_xsrf=MnwxOjB8MTA6MTcxNTE2Nzg2MXw1Ol94c3JmfDEzMjpOVFkxTURWaU1qRTJ
PR001TkdNME9UaGhaakk0T0RJM1kyTTVPR0psT0RBNlpXSm1OR0kwT1RJeVlqWTJN
bU00TlRBeVpEY3dZbVJqWWpOaU1UUm1ZVGxqWlRjM09EazBPRFl6TVRjMk1EY3d
ObU0wTWpFeFlqYzNZek13TkdVMU5nPT18NjM4YmNlMzc3NWM4NjQ1ZjJjY2RhYTU
0NDExY2ZiNGMwMTdlZjdmMjJmOTRlZWQ3OTAwY2Y2ZmI2MDliMWY0Mw
Program:
import findspark
findspark.init()
#Create SparkSession
spark = SparkSession.builder.appName("Exercises - ETL using Spark").getOrCreate()
# Extract
#Load data from student_transformed.csv into a dataframe
# Load student dataset
df = spark.read.csv("student_transformed.csv", header=True, inferSchema=True)
# display dataframe
df.show()
#Transform
#Convert cm to meters
# Divide the column height_cm by 100 a new column height_cm
df = df.withColumn("height_meters", expr("height_cm / 100"))
# display dataframe
df.show()
Output:
+--------+---------+---------+
| student|height_cm|weight_kg|
+--------+---------+---------+
|student6| 157.48| 38.55532|
|student3| 175.26| 43.09124|
|student2| 149.86| 45.3592|
|student7| 165.1| 36.28736|
|student1| 162.56| 40.82328|
|student5| 152.4| 36.28736|
+--------+---------+---------+
+--------+---------+---------+------------------+
| student|height_cm|weight_kg| height_meters|
+--------+---------+---------+------------------+
|student6| 157.48| 38.55532| 1.5748|
|student3| 175.26| 43.09124| 1.7526|
|student2| 149.86| 45.3592|1.4986000000000002|
|student7| 165.1| 36.28736| 1.651|
|student1| 162.56| 40.82328| 1.6256|
|student5| 152.4| 36.28736| 1.524|
+--------+---------+---------+------------------+
+--------+---------+---------+------------------+------------------+
| student|height_cm|weight_kg| height_meters| bmi|
+--------+---------+---------+------------------+------------------+
|student6| 157.48| 38.55532| 1.5748|15.546531093062187|
|student3| 175.26| 43.09124| 1.7526|14.028892161964118|
|student2| 149.86| 45.3592|1.4986000000000002|20.197328530250278|
|student7| 165.1| 36.28736| 1.651|13.312549228648752|
|student1| 162.56| 40.82328| 1.6256|15.448293591899683|
|student5| 152.4| 36.28736| 1.524|15.623755691955827|
+--------+---------+---------+------------------+------------------+
+--------+------------------+
| student| bmi|
+--------+------------------+
|student6|15.546531093062187|
|student3|14.028892161964118|
|student2|20.197328530250278|
|student7|13.312549228648752|
|student1|15.448293591899683|
|student5|15.623755691955827|
+--------+------------------+
+--------+------------------+-----------+
| student| bmi|bmi_rounded|
+--------+------------------+-----------+
|student6|15.546531093062187| 16.0|
|student3|14.028892161964118| 14.0|
|student2|20.197328530250278| 20.0|
|student7|13.312549228648752| 13.0|
|student1|15.448293591899683| 15.0|
|student5|15.623755691955827| 16.0|
+--------+------------------+-----------+