Professional Documents
Culture Documents
Microsoft Ai Automate
Microsoft Ai Automate
UNIT 1/9:
Introduction
Completed100 XP
2 minutes
Unsurprisingly, the role of a data scientist primarily involves exploring and analyzing
data. The results of an analysis might form the basis of a report or a machine learning
model, but it all begins with data, with Python being the most popular programming
language for data scientists.
For example, suppose a university professor collects data from their students, including
the number of lectures attended, the hours spent studying, and the final grade achieved
on the end of term exam. The professor could analyze the data to determine if there is a
relationship between the amount of studying a student undertakes and the final grade
they achieve. The professor might use the data to test a hypothesis that only students
who study for a minimum number of hours can expect to achieve a passing grade.
Prerequisites
Knowledge of basic mathematics
Some experience programming in Python
Learning objectives
In this module, you will:
Data scientists can use various tools and techniques to explore, visualize, and
manipulate data. One of the most common ways in which data scientists work with data
is to use the Python language and some specific packages for data processing.
What is NumPy
NumPy is a Python library that gives functionality comparable to mathematical tools
such as MATLAB and R. While NumPy significantly simplifies the user experience, it also
offers comprehensive mathematical functions.
What is Pandas
Pandas is an extremely popular Python library for data analysis and manipulation.
Pandas is like excel for Python - providing easy-to-use functionality for data tables.
Explore data in a Jupyter notebook
Jupyter notebooks are a popular way of running basic scripts using your web browser.
Typically, these notebooks are a single webpage, broken up into text sections and code
sections that are executed on the server rather than your local machine. This means you
can get started quickly without needing to install Python or other tools.
Testing hypotheses
Data exploration and analysis is typically an iterative process, in which the data scientist
takes a sample of data and performs the following kinds of task to analyze it and test
hypotheses:
10 minutes
Email is required to activate a sandbox or lab
Your Microsoft account must be linked to a valid email to activate a sandbox or lab. Go
to Microsoft Account Settings to link your email and try again.
Retry activating
Run all
Save
Save
Editing
In this notebook, we'll explore some of these packages, and apply basic techniques to
analyze data. This is not intended to be a comprehensive Python programming exercise;
or even a deep dive into data analysis. Rather, it's intended as a crash course in some of
the common ways in which data scientists can use Python to work with data.
Note: If you've never used the Jupyter Notebooks environment before, there are a few
things you should be aware of:
Notebooks are made up of cells. Some cells (like this one) contain markdown text, while
others (like the one beneath this one) contain code.
You can run each code cell by using the ► Run button. the ► Run button will show up
when you hover over the cell.
The output from each code cell will be displayed immediately below the cell.
Even though the code cells can be run individually, some variables used in the code are
global to the notebook. That means that you should run all of the code cells in order.
There may be dependencies between code cells, so if you skip a cell, subsequent cells
might not run correctly.
Suppose a college takes a sample of student grades for a data science class.
Run the code in the cell below by clicking the ► Run button to see the data.
[ ]
data = [50,50,47,97,49,3,53,42,26,74,82,62,37,15,70,27,36,35,48,52,63,64]
print(data)
import numpy as np
grades = np.array(data)
print(grades)
Just in case you're wondering about the differences between a list and a NumPy array,
let's compare how these data types behave when we use them in an expression that
multiplies them by 2.
[ ]
print (type(data),'x 2:', data * 2)
print('---')
print (type(grades),'x 2:', grades * 2)
Press shift + enter to run
CodeMarkdown
Note that multiplying a list by 2 creates a new list of twice the length with the original
sequence of list elements repeated. Multiplying a NumPy array on the other hand
performs an element-wise calculation in which the array behaves like a vector, so we end
up with an array of the same size in which each element has been multiplied by 2.
The key takeaway from this is that NumPy arrays are specifically designed to support
mathematical operations on numeric data - which makes them more useful for data
analysis than a generic list.
You might have spotted that the class type for the numpy array above is
a numpy.ndarray. The nd indicates that this is a structure that can consists of
multiple dimensions (it can have n dimensions). Our specific instance has a single
dimension of student grades.
grades.shape
The shape confirms that this array has only one dimension, which contains 22 elements
(there are 22 grades in the original list). You can access the individual elements in the
array by their zero-based ordinal position. Let's get the first element (the one in position
0).
[ ]
grades[0]
Press shift + enter to run
Alright, now you know your way around a NumPy array, it's time to perform some
analysis of the grades data.
You can apply aggregations across the elements in the array, so let's find the simple
average grade (in other words, the mean grade value).
[ ]
grades.mean()
So the mean grade is just around 50 - more or less in the middle of the possible range
from 0 to 100.
Let's add a second set of data for the same students, this time recording the typical
number of hours per week they devoted to studying.
[ ]
# Define an array of study hours
study_hours = [10.0,11.5,9.0,16.0,9.25,1.0,11.5,9.0,8.5,14.5,15.5,
13.75,9.0,8.0,15.5,8.0,9.0,6.0,10.0,12.0,12.5,12.0]
# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])
# display the array
student_data
# Show shape of 2D array
student_data.shape
To navigate this structure, you need to specify the position of each element in the
hierarchy. So to find the first value in the first array (which contains the study hours
data), you can use the following code.
[ ]
# Show the first element of the first element
student_data[0][0]
Now you have a multidimensional array containing both the student's study time and
grade information, which you can use to compare data. For example, how does the
mean study time compare to the mean grade?
[ ]
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()
print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grad
e))
Run the following cell to import the Pandas library and create a DataFrame with three
columns. The first column is a list of student names, and the second and third columns
are the NumPy arrays containing the study time and grade data.
[ ]
import pandas as pd
df_students = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vick
y', 'Frederic', 'Jimmie',
'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Nai
yana', 'Kian', 'Jenny',
'Jakeem','Helena','Ismat','Anila','Skye','Daniel
','Aisha'],
'StudyHours':student_data[0],
'Grade':student_data[1]})
df_students
You can use the DataFrame's loc method to retrieve data for a specific index value, like
this.
[ ]
# Get the data for index value 5
df_students.loc[5]
You can also get the data at a range of index values, like this:
[ ]
# Get the rows with index values from 0 to 5
df_students.loc[0:5]
In addition to being able to use the loc method to find rows based on the index, you
can use the iloc method to find rows based on their ordinal position in the DataFrame
(regardless of the index):
[ ]
# Get data in the first five rows
df_students.iloc[0:5]
Press shift + enter to run
The loc method returned rows with index label in the list of values from 0 to 5 - which
includes 0, 1, 2, 3, 4, and 5 (six rows). However, the iloc method returns the rows in
the positions included in the range 0 to 5, and since integer ranges don't include the
upper-bound value, this includes positions 0, 1, 2, 3, and 4 (five rows).
df_students.iloc[0,[1,2]]
Let's return to the loc method, and see how it works with columns. Remember
that loc is used to locate data items based on index values rather than positions. In the
absence of an explicit index column, the rows in our dataframe are indexed as integer
values, but the columns are identified by name:
[ ]
df_students.loc[0,'Grade']
Press shift + enter to run
Here's another useful trick. You can use the loc method to find indexed rows based on a
filtering expression that references named columns other than the index, like this:
[ ]
df_students.loc[df_students['Name']=='Aisha']
Actually, you don't need to explicitly use the loc method to do this - you can simply
apply a DataFrame filtering expression, like this:
[ ]
df_students[df_students['Name']=='Aisha']
And for good measure, you can achieve the same results by using the
DataFrame's query method, like this:
[ ]
df_students.query('Name=="Aisha"')
The three previous examples underline an occassionally confusing truth about working
with Pandas. Often, there are multiple ways to achieve the same results. Another
example of this is the way you refer to a DataFrame column name. You can specify the
column name as a named index value (as in the df_students['Name'] examples we've
seen so far), or you can use the column as a property of the DataFrame, like this:
[ ]
df_students[df_students.Name == 'Aisha']
We constructed the DataFrame from some existing arrays. However, in many real-world
scenarios, data is loaded from sources such as files. Let's replace the student grades
DataFrame with the contents of a text file.
[ ]
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
df_students.head()
The DataFrame's read_csv method is used to load data from text files. As you can see in
the example code, you can specify options such as the column delimiter and which row
(if any) contains column headers (in this case, the delimiter is a comma and the first row
contains the column names - these are the default settings, so the parameters could
have been omitted).
Handling missing values
One of the most common issues data scientists need to deal with is incomplete or
missing data. So how would we know that the DataFrame contains missing values? You
can use the isnull method to identify which individual values are null, like this:
[ ]
df_students.isnull()
Of course, with a larger DataFrame, it would be inefficient to review all of the rows and
columns individually; so we can get the sum of missing values for each column, like this:
[ ]
df_students.isnull().sum()
To see them in context, we can filter the dataframe to include only rows where any of
the columns (axis 1 of the DataFrame) are null.
[ ]
df_students[df_students.isnull().any(axis=1)]
Press shift + enter to run
When the DataFrame is retrieved, the missing numeric values show up as NaN (not a
number).
So now that we've found the null values, what can we do about them?
df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())
df_students
Alternatively, it might be important to ensure that you only use data you know to be
absolutely correct; so you can drop rows or columns that contains null values by using
the dropna method. In this case, we'll remove rows (axis 0 of the DataFrame) where any
of the columns contain null values.
[ ]
df_students = df_students.dropna(axis=0, how='any')
df_students
Now that we've cleaned up the missing values, we're ready to explore the data in the
DataFrame. Let's start by comparing the mean study hours and grades.
[ ]
# Get the mean study hours using to column name as an index
mean_study = df_students['StudyHours'].mean()
# Get the mean grade using the column name as a property (just to make the point!)
mean_grade = df_students.Grade.mean()
# Print the mean study hours and mean grade
print('Average weekly study hours: {:.2f}\nAverage grade: {:.2f}'.format(mean_study,
mean_grade))
OK, let's filter the DataFrame to find only the students who studied for more than the
average amount of time.
[ ]
# Get students who studied for the mean or more hours
df_students[df_students.StudyHours > mean_study]
Note that the filtered result is itself a DataFrame, so you can work with its columns just
like any other DataFrame.
For example, let's find the average grade for students who undertook more than the
average amount of study time.
[ ]
# What was their mean grade?
df_students[df_students.StudyHours > mean_study].Grade.mean()
Let's assume that the passing grade for the course is 60.
We can use that information to add a new column to the DataFrame, indicating whether
or not each student passed.
First, we'll create a Pandas Series containing the pass/fail indicator (True or False), and
then we'll concatenate that series as a new column (axis 1) in the DataFrame.
[ ]
passes = pd.Series(df_students['Grade'] >= 60)
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)
df_students
DataFrames are designed for tabular data, and you can use them to perform many of
the kinds of data analytics operation you can do in a relational database; such as
grouping and aggregating tables of data.
For example, you can use the groupby method to group the student data into groups
based on the Pass column you added previously, and count the number of names in
each group - in other words, you can determine how many students passed and failed.
[ ]
print(df_students.groupby(df_students.Pass).Name.count())
Press shift + enter to run
You can aggregate multiple fields in a group using any available aggregation function.
For example, you can find the mean study time and grade for the groups of students
who passed and failed the course.
[ ]
print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())
DataFrames are amazingly versatile, and make it easy to manipulate data. Many
DataFrame operations return a new copy of the DataFrame; so if you want to modify a
DataFrame but keep the existing variable, you need to assign the result of the operation
to the existing variable. For example, the following code sorts the student data into
descending order of Grade, and assigns the resulting sorted DataFrame to the
original df_students variable.
[ ]
# Create a DataFrame with the data sorted by Grade (descending)
df_students = df_students.sort_values('Grade', ascending=False)
# Show the DataFrame
df_students
Numpy and DataFrames are the workhorses of data science in Python. They provide us
ways to load, explore, and analyze tabular data. As we will see in subsequent modules,
even advanced analysis methods typically rely on Numpy and Pandas for these
important roles.
In our next workbook, we'll take a look at how create graphs and explore your data in
more interesting ways.
UNIT 4/9:
Visualize data
Completed100 XP
3 minutes
Data scientists visualize data to understand it better. This can mean looking at the raw
data, summary measures such as averages, or graphing the data. Graphs are a powerful
means of viewing data, as we can discern moderately complex patterns quickly without
needing to define mathematical summary measures.
While sometimes we know ahead of time what kind of graph will be most useful, other
times we use graphs in an exploratory way. To understand the power of data
visualization, consider the data below: the location (x,y) of a self-driving car. In its raw
form, it's hard to see any real patterns. The mean or average, tells us that its path was
centred around x=0.2 and y=0.3, and the range of numbers appears to be between
about -2 and 2.
If we now plot Location-X over time, we can see that we appear to have some missing
values between times 7 and 12.
If we graph X vs Y, we end up with a map of where the car has driven. It’s instantly
obvious that the car has been driving in a circle, but at some point drove to the center
of that circle.
Graphs aren't limited to 2D scatter plots like those above, but can be used to explore
other kinds of data, like proportions - shown through pie charts, stacked bar graphs -
how data are spread - with histograms, box and whisker plots - and how two data sets
differ. Often, when we're trying to understand raw data or results, we may experiment
with different types of graphs until we come across one that explains the data in a
visually intuitive way.
UNIT 5/9:
CodeMarkdown
[ ]
import pandas as pd
# Load data from a text file
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
# Remove any rows with missing data
df_students = df_students.dropna(axis=0, how='any')
# Calculate who passed, assuming '60' is the grade needed to pass
passes = pd.Series(df_students['Grade'] >= 60)
# Save who passed to the Pandas dataframe
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)
# Print the result out into this notebook
df_students
Press shift + enter to run
Let's start with a simple bar chart that shows the grade of each student.
[ ]
# Ensure plots are displayed inline in the notebook
%matplotlib inline
from matplotlib import pyplot as plt
# Create a bar plot of name vs grade
plt.bar(x=df_students.Name, height=df_students.Grade)
# Display the plot
plt.show()
Well, that worked; but the chart could use some improvements to make it clearer what
we're looking at.
Note that you used the pyplot class from Matplotlib to plot the chart. This class
provides a whole bunch of ways to improve the visual elements of the plot. For example,
the following code:
Specifies the color of the bar chart.
Adds a title to the chart (so we know what it represents)
Adds labels to the X and Y (so we know which axis shows which data)
Adds a grid (to make it easier to determine the values for the bars)
Rotates the X markers (so we can read them)
[ ]
# Create a bar plot of name vs grade
plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')
# Customize the chart
plt.title('Student Grades')
plt.xlabel('Student')
plt.ylabel('Grade')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=90)
# Display the plot
plt.show()
A plot is technically contained with a Figure. In the previous examples, the figure was
created implicitly for you; but you can create it explicitly. For example, the following
code creates a figure with a specific size.
[ ]
# Create a Figure
fig = plt.figure(figsize=(8,3))
# Create a bar plot of name vs grade
plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')
# Customize the chart
plt.title('Student Grades')
plt.xlabel('Student')
plt.ylabel('Grade')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=90)
# Show the figure
plt.show()
For example, the following code creates a figure with two subplots - one is a bar chart
showing student grades, and the other is a pie chart comparing the number of passing
grades to non-passing grades.
[ ]
# Create a figure for 2 subplots (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize = (10,4))
# Create a bar plot of name vs grade on the first axis
ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')
ax[0].set_title('Grades')
ax[0].set_xticklabels(df_students.Name, rotation=90)
# Create a pie chart of pass counts on the second axis
pass_counts = df_students['Pass'].value_counts()
ax[1].pie(pass_counts, labels=pass_counts)
ax[1].set_title('Passing Grades')
ax[1].legend(pass_counts.keys().tolist())
# Add a title to the Figure
fig.suptitle('Student Data')
# Show the figure
fig.show()
Press shift + enter to run
Until now, you've used methods of the Matplotlib.pyplot object to plot charts. However,
Matplotlib is so foundational to graphics in Python that many packages, including
Pandas, provide methods that abstract the underlying Matplotlib functions and simplify
plotting. For example, the DataFrame provides its own methods for plotting data, as
shown in the following example to plot a bar chart of study hours.
[ ]
df_students.plot.bar(x='Name', y='StudyHours', color='teal', figsize=(6,4))
A lot of data science is rooted in statistics, so we'll explore some basic statistical
techniques.
Note: This is not intended to teach you statistics - that's much too big a topic for this
notebook. It will however introduce you to some statistical concepts and techniques that
data scientists use as they explore data in preparation for machine learning modeling.
When examining a variable (for example a sample of student grades), data scientists are
particularly interested in its distribution (in other words, how are all the different grade
values spread across the sample). The starting point for this exploration is often to
visualize the data as a histogram, and see how frequently each value for the variable
occurs.
[ ]
# Get the variable to examine
var_data = df_students['Grade']
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.hist(var_data)
# Add titles and labels
plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Show the figure
fig.show()
The histogram for grades is a symmetric shape, where the most frequently occurring
grades tend to be in the middle of the range (around 50), with fewer grades at the
extreme ends of the scale.
Let's calculate these values, along with the minimum and maximum values for
comparison, and show them on the histogram.
*Of course, in some sample sets , there may be a tie for the most common value - in
which case the dataset is described as bimodal or even multimodal.
[ ]
# Get the variable to examine
var = df_students['Grade']
# Get statistics
min_val = var.min()
max_val = var.max()
mean_val = var.mean()
med_val = var.median()
mod_val = var.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,
mean_val,
med_val,
mod_val,
max_val))
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.hist(var)
# Add lines for the statistics
plt.axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
plt.axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
plt.axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
plt.axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
plt.axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)
# Add titles and labels
plt.title('Data Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
# Show the figure
fig.show()
For the grade data, the mean, median, and mode all seem to be more or less in the
middle of the minimum and maximum, at around 50.
# Get the variable to examine
var = df_students['Grade']
# Create a Figure
fig = plt.figure(figsize=(10,4))
# Plot a histogram
plt.boxplot(var)
# Add titles and labels
plt.title('Data Distribution')
# Show the figure
fig.show()
For learning, it can be useful to combine histograms and box plots, with the box plot's
orientation changed to align it with the histogram (in some ways, it can be helpful to
think of the histogram as a "front elevation" view of the distribution, and the box plot as
a "plan" view of the distribution from above.)
[ ]
# Create a function that we can re-use
def show_distribution(var_data):
from matplotlib import pyplot as plt
# Get statistics
min_val = var_data.min()
max_val = var_data.max()
mean_val = var_data.mean()
med_val = var_data.median()
mod_val = var_data.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,
mean_val,
med_val,
mod_val,
max_val))
# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (10,4))
# Plot the histogram
ax[0].hist(var_data)
ax[0].set_ylabel('Frequency')
# Add lines for the mean, median, and mode
ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)
# Plot the boxplot
ax[1].boxplot(var_data, vert=False)
ax[1].set_xlabel('Value')
# Add a title to the Figure
fig.suptitle('Data Distribution')
# Show the figure
fig.show()
# Get the variable to examine
col = df_students['Grade']
# Call the function
show_distribution(col)
All of the measurements of central tendency are right in the middle of the data
distribution, which is symmetric with values becoming progressively lower in both
directions from the middle.
To explore this distribution in more detail, you need to understand that statistics is
fundamentally about taking samples of data and using probability functions to
extrapolate information about the full population of data.
What does this mean? Samples refer to the data we have on hand - such as information
about these 22 students' study habits and grades. The population refers to all possible
data we could collect - such as every student's grades and study habits across every
educational institution throughout the history of time. Usually we're interested in the
population but it's simply not practical to collect all of that data. Instead, we need to try
estimate what the population is like from the small amount of data (samples) that we
have.
The pyplot class from Matplotlib provides a helpful plot function to show this density.
[ ]
def show_density(var_data):
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(10,4))
# Plot density
var_data.plot.density()
# Add titles and labels
plt.title('Data Density')
# Show the mean, median, and mode
plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)
plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2
)
plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth
= 2)
# Show the figure
plt.show()
# Get the density of Grade
col = df_students['Grade']
show_density(col)
Summary
Well done! There were a number of new concepts in here, so let's summarise.
Here we have:
1. Made graphs with matplotlib
2. Seen how to customise these graphs
3. Calculated basic statistics, such as medians
4. Looked at the spread of data using box plots and histograms
5. Learned about samples vs populations
6. Estimated what the population of graphse might look like from a sample of grades.
In our next notebook we will look at spotting unusual data, and finding relationships
between data.
Further Reading
To learn more about the Python packages you explored in this notebook, see the
following documentation:
NumPy
Pandas
Matplotlib
CodeMarkdown
Because of the complexity of ‘real world’ data, raw data has to be inspected for issues
before being used.
As such, best practice is to inspect the raw data and process it before use, which reduces
errors or issues, typically by removing erroneous data points or modifying the data into
a more useful form.
It's important to realize that most real-world data are influenced by factors that weren't
recorded at the time. For example, we might have a table of race-car track times
alongside engine sizes, but various other factors that weren't written down—such as the
weather—probably also played a role. If problematic, the influence of these factors can
often be reduced by increasing the size of the dataset.
In other situations data points that are clearly outside of what is expected—also known
as ‘outliers’—can sometimes be safely removed from analyses, though care must be
taken to not remove data points that provide real insights.
Another common issue in real-world data is bias. Bias refers to a tendency to select
certain types of values more frequently than others, in a way that misrepresents the
underlying population, or ‘real world’. Bias can sometimes be identified by exploring
data while keeping in mind basic knowledge about where the data came from.
Remember, real-world data will always have issues, but this is often a surmountable
problem. Remember to:
Check for missing values and badly recorded data
Consider removal of obvious outliers
Consider what real-world factors might affect your analysis and consider if your
dataset size is large enough to handle this
Check for biased raw data and consider your options to fix this, if found
UNIT 7/9:
Last time, we looked at grades for our student data, and estimated from this sample
what the full population of grades might look like. Just to refresh, lets take a look at this
data again.
Run the code below to print out the data and make a histogram + boxplot that show
the grades for our sample of students.
CodeMarkdown
[ ]
import pandas as pd
from matplotlib import pyplot as plt
# Load data from a text file
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/grades.csv
df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')
# Remove any rows with missing data
df_students = df_students.dropna(axis=0, how='any')
# Calculate who passed, assuming '60' is the grade needed to pass
passes = pd.Series(df_students['Grade'] >= 60)
# Save who passed to the Pandas dataframe
df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)
# Print the result out into this notebook
print(df_students)
# Create a function that we can re-use
def show_distribution(var_data):
'''
This function will make a distribution (graph) and display it
'''
# Get statistics
min_val = var_data.min()
max_val = var_data.max()
mean_val = var_data.mean()
med_val = var_data.median()
mod_val = var_data.mode()[0]
print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,
mean_val,
med_val,
mod_val,
max_val))
# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (10,4))
# Plot the histogram
ax[0].hist(var_data)
ax[0].set_ylabel('Frequency')
# Add lines for the mean, median, and mode
ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)
ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)
# Plot the boxplot
ax[1].boxplot(var_data, vert=False)
ax[1].set_xlabel('Value')
# Add a title to the Figure
fig.suptitle('Data Distribution')
# Show the figure
fig.show()
show_distribution(df_students['Grade'])
As you might recall, our data had the mean and mode at the center, with data spread
symmetrically from there.
Now let's take a look at the distribution of the study hours data.
[ ]
# Get the variable to examine
col = df_students['StudyHours']
# Call the function
show_distribution(col)
The distribution of the study time data is significantly different from that of the grades.
Note that the whiskers of the box plot only begin at around 6.0, indicating that the vast
majority of the first quarter of the data is above this value. The minimum is marked with
an o, indicating that it is statistically an outlier - a value that lies significantly outside the
range of the rest of the distribution.
Outliers can occur for many reasons. Maybe a student meant to record "10" hours of
study time, but entered "1" and missed the "0". Or maybe the student was abnormally
lazy when it comes to studying! Either way, it's a statistical anomaly that doesn't
represent a typical student. Let's see what the distribution looks like without it.
[ ]
# Get the variable to examine
# We will only get students who have studied more than one hour
col = df_students[df_students.StudyHours>1]['StudyHours']
# Call the function
show_distribution(col)
For learning purposes we have just treated the value 1 is a true outlier here and
excluded it. In the real world, though, it would be unusual to exclude data at the
extremes without more justification when our sample size is so small. This is because the
smaller our sample size, the more likely it is that our sampling is a bad representation of
the whole population (here, the population means grades for all students, not just our
22). For example, if we sampled study time for another 1000 students, we might find
that it's actually quite common to not study much!
When we have more data available, our sample becomes more reliable. This makes it
easier to consider outliers as being values that fall below or above percentiles within
which most of the data lie. For example, the following code uses the
Pandas quantile function to exclude observations below the 0.01th percentile (the value
above which 99% of the data reside).
[ ]
# calculate the 0.01th percentile
q01 = df_students.StudyHours.quantile(0.01)
# Get the variable to examine
col = df_students[df_students.StudyHours>q01]['StudyHours']
# Call the function
show_distribution(col)
Tip: You can also eliminate outliers at the upper end of the distribution by defining a
threshold at a high percentile value - for example, you could use the quantile function
to find the 0.99 percentile below which 99% of the data reside.
With the outliers removed, the box plot shows all data within the four quartiles. Note
that the distribution is not symmetric like it is for the grade data though - there are
some students with very high study times of around 16 hours, but the bulk of the data is
between 7 and 13 hours; The few extremely high values pull the mean towards the
higher end of the scale.
def show_density(var_data):
fig = plt.figure(figsize=(10,4))
# Plot density
var_data.plot.density()
# Add titles and labels
plt.title('Data Density')
# Show the mean, median, and mode
plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)
plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2
)
plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth
= 2)
# Show the figure
plt.show()
# Get the density of StudyHours
show_density(col)
Press shift + enter to run
This kind of distribution is called right skewed. The mass of the data is on the left side of
the distribution, creating a long tail to the right because of the values at the extreme
high end; which pull the mean to the right.
Measures of variance
So now we have a good idea where the middle of the grade and study hours data
distributions are. However, there's another aspect of the distributions we should
examine: how much variability is there in the data?
for col_name in ['Grade','StudyHours']:
col = df_students[col_name]
rng = col.max() - col.min()
var = col.var()
std = col.std()
print('\n{}:\n - Range: {:.2f}\n - Variance: {:.2f}\n - Std.Dev: {:.2f}
'.format(col_name, rng, var, std))
When working with a normal distribution, the standard deviation works with the
particular characteristics of a normal distribution to provide even greater insight. Run
the cell below to see the relationship between standard deviations and the data in the
normal distribution.
[ ]
import scipy.stats as stats
# Get the Grade column
col = df_students['Grade']
# get the density
density = stats.gaussian_kde(col)
# Plot the density
col.plot.density()
# Get the mean and standard deviation
s = col.std()
m = col.mean()
# Annotate 1 stdev
x1 = [m-s, m+s]
y1 = density(x1)
plt.plot(x1,y1, color='magenta')
plt.annotate('1 std (68.26%)', (x1[1],y1[1]))
# Annotate 2 stdevs
x2 = [m-(s*2), m+(s*2)]
y2 = density(x2)
plt.plot(x2,y2, color='green')
plt.annotate('2 std (95.45%)', (x2[1],y2[1]))
# Annotate 3 stdevs
x3 = [m-(s*3), m+(s*3)]
y3 = density(x3)
plt.plot(x3,y3, color='orange')
plt.annotate('3 std (99.73%)', (x3[1],y3[1]))
# Show the location of the mean
plt.axvline(col.mean(), color='cyan', linestyle='dashed', linewidth=1)
plt.axis('off')
plt.show()
The horizontal lines show the percentage of data within 1, 2, and 3 standard deviations
of the mean (plus or minus).
So, since we know that the mean grade is 49.18, the standard deviation is 21.74, and
distribution of grades is approximately normal; we can calculate that 68.26% of students
should achieve a grade between 27.44 and 70.92.
The descriptive statistics we've used to understand the distribution of the student data
variables are the basis of statistical analysis; and because they're such an important part
of exploring your data, there's a built-in Describe method of the DataFrame object that
returns the main descriptive statistics for all numeric columns.
[ ]
df_students.describe()
Press shift + enter to run
Comparing data
Now that you know something about the statistical distribution of the data in your
dataset, you're ready to examine your data to identify any apparent relationships
between variables.
First of all, let's get rid of any rows that contain outliers so that we have a sample that is
representative of a typical class of students. We identified that the StudyHours column
contains some outliers with extremely low values, so we'll remove those rows.
[ ]
df_sample = df_students[df_students['StudyHours']>1]
df_sample
To make this comparison, let's create box plots showing the distribution of StudyHours
for each possible Pass value (true and false).
[ ]
df_sample.boxplot(column='StudyHours', by='Pass', figsize=(8,5))
Now let's compare two numeric variables. We'll start by creating a bar chart that shows
both grade and study hours.
[ ]
# Create a bar plot of name vs grade and study hours
df_sample.plot(x='Name', y=['Grade','StudyHours'], kind='bar', figsize=(8,5))
The chart shows bars for both grade and study hours for each student; but it's not easy
to compare because the values are on different scales. Grades are measured in grade
points, and range from 3 to 97; while study time is measured in hours and ranges from 1
to 16.
from sklearn.preprocessing import MinMaxScaler
# Get a scaler object
scaler = MinMaxScaler()
# Create a new dataframe for the scaled values
df_normalized = df_sample[['Name', 'Grade', 'StudyHours']].copy()
# Normalize the numeric columns
df_normalized[['Grade','StudyHours']] = scaler.fit_transform(df_normalized[['Grade','
StudyHours']])
# Plot the normalized values
df_normalized.plot(x='Name', y=['Grade','StudyHours'], kind='bar', figsize=(8,5))
With the data normalized, it's easier to see an apparent relationship between grade and
study time. It's not an exact match, but it definitely seems like students with higher
grades tend to have studied more.
So there seems to be a correlation between study time and grade; and in fact, there's a
statistical correlation measurement we can use to quantify the relationship between
these columns.
[ ]
df_normalized.Grade.corr(df_normalized.StudyHours)
The correlation statistic is a value between -1 and 1 that indicates the strength of a
relationship. Values above 0 indicate a positive correlation (high values of one variable
tend to coincide with high values of the other), while values below 0 indicate
a negative correlation (high values of one variable tend to coincide with low values of
the other). In this case, the correlation value is close to 1; showing a strongly positive
correlation between study time and grade.
Note: Data scientists often quote the maxim "correlation is not causation". In other
words, as tempting as it might be, you shouldn't interpret the statistical correlation as
explaining why one of the values is high. In the case of the student data, the statistics
demonstrates that students with high grades tend to also have high amounts of study
time; but this is not the same as proving that they achieved high grades because they
studied a lot. The statistic could equally be used as evidence to support the nonsensical
conclusion that the students studied a lot because their grades were going to be high.
Another way to visualise the apparent correlation between two numeric columns is to
use a scatter plot.
[ ]
# Create a scatter plot
df_sample.plot.scatter(title='Study Time vs Grade', x='StudyHours', y='Grade')
Again, it looks like there's a discernible pattern in which the students who studied the
most hours are also the students who got the highest grades.
We can see this more clearly by adding a regression line (or a line of best fit) to the plot
that shows the general trend in the data. To do this, we'll use a statistical technique
called least squares regression.
Warning - Math Ahead!
Cast your mind back to when you were learning how to solve linear equations in school,
and recall that the slope-intercept form of a linear equation looks like this:
y=mx+by=mx+b
In this equation, y and x are the coordinate variables, m is the slope of the line, and b is
the y-intercept (where the line goes through the Y-axis).
In the case of our scatter plot for our student data, we already have our values
for x (StudyHours) and y (Grade), so we just need to calculate the intercept and slope of
the straight line that lies closest to those points. Then we can form a linear equation that
calculates a new y value on that line for each of our x (StudyHours) values - to avoid
confusion, we'll call this new y value f(x) (because it's the output from a linear
equation function based on x). The difference between the original y (Grade) value and
the f(x) value is the error between our regression line and the actual Grade achieved by
the student. Our goal is to calculate the slope and intercept for a line with the lowest
overall error.
Specifically, we define the overall error by taking the error for each point, squaring it,
and adding all the squared errors together. The line of best fit is the line that gives us
the lowest value for the sum of the squared errors - hence the name least squares
regression.
from scipy import stats
df_regression = df_sample[['Grade', 'StudyHours']].copy()
# Get the regression slope and intercept
m, b, r, p, se = stats.linregress(df_regression['StudyHours'], df_regression['Grade']
)
print('slope: {:.4f}\ny-intercept: {:.4f}'.format(m,b))
print('so...\n f(x) = {:.4f}x + {:.4f}'.format(m,b))
# Use the function (mx + b) to calculate f(x) for each x (StudyHours) value
df_regression['fx'] = (m * df_regression['StudyHours']) + b
# Calculate the error between f(x) and the actual y (Grade) value
df_regression['error'] = df_regression['fx'] - df_regression['Grade']
# Create a scatter plot of Grade vs StudyHours
df_regression.plot.scatter(x='StudyHours', y='Grade')
# Plot the regression line
plt.plot(df_regression['StudyHours'],df_regression['fx'], color='cyan')
# Display the plot
plt.show()
The slope and intercept coefficients calculated for the regression line are shown above
the plot.
Some of the errors, particularly at the extreme ends, and quite large (up to over 17.5
grade points); but in general, the line is pretty close to the actual grades.
[ ]
# Show the original x,y values, the f(x) value, and the error
df_regression[['StudyHours', 'Grade', 'fx', 'error']]
Now that you have the regression coefficients for the study time and grade relationship,
you can use them in a function to estimate the expected grade for a given amount of
study.
[ ]
# Define a function based on our regression coefficients
def f(x):
m = 6.3134
b = -17.9164
return m*x + b
study_time = 14
# Get f(x) for study time
prediction = f(study_time)
# Grade can't be less than 0 or more than 100
expected_grade = max(0,min(100,prediction))
#Print the estimated grade
print ('Studying for {} hours per week may result in a grade of {:.0f}'.format(study_
time, expected_grade))
This technique is in fact the basic premise of machine learning. You can take a set of
sample data that includes one or more features (in this case, the number of hours
studied) and a known label value (in this case, the grade achieved) and use the sample
data to derive a function that calculates predicted label values for any given set of
features.
Summary
Here we've looked at:
1. What an outlier is and how to remove them
2. How data can be skewed
3. How to look at the spread of data
4. Basic ways to compare variables, such as grades and study time
Further Reading
To learn more about the Python packages you explored in this notebook, see the
following documentation:
NumPy
Pandas
Matplotlib
CodeMarkdown
[ ]
Knowledge check
200 XP
3 minutes
1.
You have a NumPy array with the shape (2,20). What does this tell you about the
elements in the array?
The array is two dimensional, consisting of two arrays each with 20 elements
ANSWER: 1
2.
You have a Pandas DataFrame named df_sales containing daily sales data. The
DataFrame contains the following columns: year, month, day_of_month, sales_total. You
want to find the average sales_total value. Which code should you use?
df_sales['sales_total'].avg()
df_sales['sales_total'].mean()
mean(df_sales['sales_total'])
ANSWER: 2
3. You have a DataFrame containing data about daily ice cream sales. You use the corr
method to compare the avg_temp and units_sold columns, and get a result of 0.97.
What does this result indicate?
On the day with the maximum units_sold value, the avg_temp value was 0.97
Days with high avg_temp values tend to coincide with days that have high units_sold
values
ANSWER: 2
UNIT 9/9:
Summary
Completed100 XP
1 minute
In this module, you learned how to use Python to explore, visualize, and manipulate
data. Data exploration is at the core of data science, and is a key element in data
analysis and machine learning.
Machine learning is a subset of data science that deals with predictive modeling. In
other words, machine learning uses data to creates predictive models, in order to
predict unknown values. You might use machine learning to predict how much food a
supermarket needs to order, or to identify plants in photographs.
Machine learning works by identifying relationships between data values that describe
characteristics of something—its features, such as the height and color of a plant—and
the value we want to predict—the label, such as the species of plant. These relationships
are built into a model through a training process.
Note
The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!
MODULE 2:
UNIT 1/9:
Introduction
Completed100 XP
2 minutes
In machine learning, the goal of regression is to create a model that can predict a
numeric, quantifiable value, such as a price, amount, size, or other scalar number.
In real world situations, particularly when little data are available, regression models are
very useful for making predictions. For example, if a company that rents bicycles wants
to predict the expected number of rentals on a given day in the future, a regression
model can predict this number. A model could be created using existing data such as
the number of bicycles that were rented on days where the season, day of the week, and
so on, were also recorded.
Prerequisites
Knowledge of basic mathematics
Some experience programming in Python
Learning objectives
In this module, you will:
What is regression?
Completed100 XP
8 minutes
To train the model, we start with a data sample containing the features, as well as
known values for the label - so in this case we need historical data that includes dates,
weather conditions, and the number of bicycle rentals.
The use of historic data with known label values to train a model makes regression an
example of supervised machine learning.
A simple example
Let's take a simple example to see how the training and evaluation process works in
principle. Suppose we simplify the scenario so that we use a single feature—average
daily temperature—to predict the bicycle rentals label.
We start with some data that includes known values for the average daily temperature
feature and the bicycle rentals label.
Temperature Rentals
56 115
Temperature Rentals
61 126
67 137
72 140
76 152
82 156
54 114
62 129
Now we'll randomly select five of these observations and use them to train a regression
model. When we're talking about ‘training a model’, what we mean is finding a function
(a mathematical equation; let’s call it f) that can use the temperature feature (which we’ll
call x) to calculate the number of rentals (which we’ll call y). In other words, we need to
define the following function: f(x) = y.
x y
56 115
61 126
67 137
72 140
76 152
The line represents a linear function that can be used with any value of x to apply
the slope of the line and its intercept (where the line crosses the y axis when x is 0) to
calculate y. In this case, if we extended the line to the left we'd find that when x is 0, y is
around 20, and the slope of the line is such that for each unit of x you move along to
the right, y increases by around 1.7. Our f function therefore can be calculated as 20 +
1.7x.
Now that we've defined our predictive function, we can use it to predict labels for the
validation data we held back and compare the predicted values (which we typically
indicate with the symbol ŷ, or "y-hat") with the actual known y values.
x y ŷ
82 156 159.4
54 114 111.8
62 129 125.4
The plotted points that are on the function line are the predicted ŷ values calculated by
the function, and the other plotted points are the actual y values.
There are various ways we can measure the variance between the predicted and actual
values, and we can use these metrics to evaluate how well the model predicts.
Note
Machine learning is based in statistics and math, and it's important to be aware of
specific terms that statisticians and mathematicians (and therefore data scientists) use.
You can think of the difference between a predicted label value and the actual label
value as a measure of error. However, in practice, the "actual" values are based on
sample observations (which themselves may be subject to some random variance). To
make it clear that we're comparing a predicted value (ŷ) with an observed value (y) we
refer to the difference between them as the residuals. We can summarize the residuals
for all of the validation data predictions to calculate the overall loss in the model as a
measure of its predictive performance.
One of the most common ways to measure the loss is to square the individual residuals,
sum the squares, and calculate the mean. Squaring the residuals has the effect of basing
the calculation on absolute values (ignoring whether the difference is negative or
positive) and giving more weight to larger differences. This metric is called the Mean
Squared Error.
y ŷ y - ŷ (y - ŷ)2
Sum ∑ 29.36
Mean x̄ 9.79
So the loss for our model based on the MSE metric is 9.79.
So is that any good? It's difficult to tell because MSE value isn't expressed in a
meaningful unit of measurement. We do know that the lower the value is, the less loss
there is in the model; and therefore, the better it is predicting. This makes it a useful
metric to compare two models and find the one that performs best.
Sometimes, it's more useful to express the loss in the same unit of measurement as the
predicted label value itself - in this case, the number of rentals. It's possible to do this by
calculating the square root of the MSE, which produces a metric known, unsurprisingly,
as the Root Mean Squared Error (RMSE).
√9.79 = 3.13
So our model's RMSE indicates that the loss is just over 3, which you can interpret
loosely as meaning that on average, incorrect predictions are wrong by around 3 rentals.
There are many other metrics that can be used to measure loss in a regression. For
example, R2 (R-Squared) (sometimes known as coefficient of determination) is the
correlation between x and y squared. This produces a value between 0 and 1 that
measures the amount of variance that can be explained by the model. Generally, the
closer this value is to 1, the better the model predicts.
UNIT 3/9:
8 minutes
Email is required to activate a sandbox or lab
Your Microsoft account must be linked to a valid email to activate a sandbox or lab. Go
to Microsoft Account Settings to link your email and try again.
Retry activating
No kernels available
Kernel
Viewing
Regression
Supervised machine learning techniques involve training a model to operate on a set
of features and predict a label using a dataset that includes some already-known label
values. The training process fits the features to the known labels to define a general
function that can be applied to new features for which the labels are unknown, and
predict them. You can think of this function like this, in which y represents the label we
want to predict and x represents the features the model uses to predict it.
y=f(x)y=f(x)
The goal of training the model is to find a function that performs some kind of
calculation to the x values that produces the result y. We do this by applying a machine
learning algorithm that tries to fit the x values to a calculation that
produces y reasonably accurately for all of the cases in the training dataset.
There are lots of machine learning algorithms for supervised learning, and they can be
broadly divided into two types:
Regression algorithms: Algorithms that predict a y value that is a numeric value, such as the
price of a house or the number of sales transactions.
Classification algorithms: Algorithms that predict to which category, or class, an observation
belongs. The y value in a classification model is a vector of probability values between 0 and 1,
one for each class, indicating the probability of the observation belonging to each class.
In this notebook, we'll focus on regression, using an example based on a real study in
which data for a bicycle sharing scheme was collected and used to predict the number
of rentals based on seasonality and weather conditions. We'll use a simplified version of
the dataset from that study.
Citation: The data used in this exercise is derived from Capital Bikeshare and is used in
accordance with the published license agreement.
CodeMarkdown
[ ]
import pandas as pd
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('daily-bike-share.csv')
bike_data.head()
bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day
bike_data.head(32)
OK, let's start our analysis of the data by examining a few key descriptive statistics. We
can use the dataframe's describe method to generate these for the numeric features as
well as the rentals label column.
[ ]
numeric_features = ['temp', 'atemp', 'hum', 'windspeed']
bike_data[numeric_features + ['rentals']].describe()
The statistics reveal some information about the distribution of the data in each of the
numeric fields, including the number of observations (there are 731 records), the mean,
standard deviation, minimum and maximum values, and the quartile values (the
threshold values for 25%, 50% - which is also the median, and 75% of the data). From
this, we can see that the mean number of daily rentals is around 848; but there's a
comparatively large standard deviation, indicating a lot of variance in the number of
rentals per day.
We might get a clearer idea of the distribution of rentals values by visualizing the data.
Common plot types for visualizing numeric data distributions are histograms and box
plots, so let's use Python's matplotlib library to create one of each of these for
the rentals column.
[ ]
import pandas as pd
import matplotlib.pyplot as plt
# This ensures plots are displayed inline in the Jupyter notebook
%matplotlib inline
# Get the label column
label = bike_data['rentals']
# Create a figure for 2 subplots (2 rows, 1 column)
fig, ax = plt.subplots(2, 1, figsize = (9,12))
# Plot the histogram
ax[0].hist(label, bins=100)
ax[0].set_ylabel('Frequency')
# Add lines for the mean, median, and mode
ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)
# Plot the boxplot
ax[1].boxplot(label, vert=False)
ax[1].set_xlabel('Rentals')
# Add a title to the Figure
fig.suptitle('Rental Distribution')
# Show the figure
fig.show()
Press shift + enter to run
The plots show that the number of daily rentals ranges from 0 to just over 3,400.
However, the mean (and median) number of daily rentals is closer to the low end of that
range, with most of the data between 0 and around 2,200 rentals. The few values above
this are shown in the box plot as small circles, indicating that they are outliers - in other
words, unusually high or low values beyond the typical range of most of the data.
We can do the same kind of visual exploration of the numeric features. Let's create a
histogram for each of these.
[ ]
# Plot a histogram for each numeric feature
for col in numeric_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = bike_data[col]
feature.hist(bins=100, ax = ax)
ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()
The numeric features seem to be more normally distributed, with the mean and median
nearer the middle of the range of values, coinciding with where the most commonly
occurring values are.
Note: The distributions are not truly normal in the statistical sense, which would result in
a smooth, symmetric "bell-curve" histogram with the mean and mode (the most
common value) in the center; but they do generally indicate that most of the
observations have a value somewhere near the middle.
We've explored the distribution of the numeric values in the dataset, but what about the
categorical features? These aren't continuous numbers on a scale, so we can't use
histograms; but we can plot a bar chart showing the count of each discrete value for
each category.
[ ]
import numpy as np
# plot a bar plot for each categorical feature count
categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit'
, 'day']
for col in categorical_features:
counts = bike_data[col].value_counts().sort_index()
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
counts.plot.bar(ax = ax, color='steelblue')
ax.set_title(col + ' counts')
ax.set_xlabel(col)
ax.set_ylabel("Frequency")
plt.show()
Now that we know something about the distribution of the data in our columns, we can
start to look for relationships between the features and the rentals label we want to be
able to predict.
For the numeric features, we can create scatter plots that show the intersection of
feature and label values. We can also calculate the correlation statistic to quantify the
apparent relationship..
[ ]
for col in numeric_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = bike_data[col]
label = bike_data['rentals']
correlation = feature.corr(label)
plt.scatter(x=feature, y=label)
plt.xlabel(col)
plt.ylabel('Bike Rentals')
ax.set_title('rentals vs ' + col + '- correlation: ' + str(correlation))
plt.show()
The results aren't conclusive, but if you look closely at the scatter plots
for temp and atemp, you can see a vague diagonal trend showing that higher rental
counts tend to coincide with higher temperatures; and a correlation value of just over
0.5 for both of these features supports this observation. Conversely, the plots
for hum and windspeed show a slightly negative correlation, indicating that there are
fewer rentals on days with high humidity or windspeed.
Now let's compare the categorical features to the label. We'll do this by creating box
plots that show the distribution of rental counts for each category.
[ ]
# plot a boxplot for the label by each categorical feature
for col in categorical_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
bike_data.boxplot(column = 'rentals', by = col, ax = ax)
ax.set_title('Label by ' + col)
ax.set_ylabel("Bike Rentals")
plt.show()
The plots show some variance in the relationship between some category values and
rentals. For example, there's a clear difference in the distribution of rentals on weekends
(weekday 0 or 6) and those during the working week (weekday 1 to 5). Similarly, there
are notable differences for holiday and workingday categories. There's a noticeable
trend that shows different rental distributions in spring and summer months compared
to winter and fall months. The weathersit category also seems to make a difference in
rental distribution. The day feature we created for the day of the month shows little
variation, indicating that it's probably not predictive of the number of rentals.
Train a Regression Model
Now that we've explored the data, it's time to use it to train a regression model that
uses the features we've identified as potentially predictive to predict the rentals label.
The first thing we need to do is to separate the features we want to use to train the
model from the label we want it to predict.
[ ]
# Separate features and labels
X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','tem
p', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values
print('Features:',X[:10], '\nLabels:', y[:10], sep='\n')
Press shift + enter to run
After separating the dataset, we now have numpy arrays named X containing the
features, and y containing the labels.
We could train a model using all of the data; but it's common practice in supervised
learning to split the data into two subsets; a (typically larger) set with which to train the
model, and a smaller "hold-back" set with which to validate the trained model. This
enables us to evaluate how well the model performs when used with the validation
dataset by comparing the predicted labels to the known labels. It's important to split the
data randomly (rather than say, taking the first 70% of the data for training and keeping
the rest for validation). This helps ensure that the two subsets of data are statistically
comparable (so we validate the model with data that has a similar statistical distribution
to the data on which it was trained).
from sklearn.model_selection import train_test_split
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)
print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape
[0]))
Now we're ready to train a model by fitting a suitable regression algorithm to the
training data. We'll use a linear regression algorithm, a common starting point for
regression that works by trying to find a linear relationship between the X values and
the y label. The resulting model is a function that conceptually defines a line where
every possible X and y value combination intersect.
In Scikit-Learn, training algorithms are encapsulated in estimators, and in this case we'll
use the LinearRegression estimator to train a linear regression model.
[ ]
# Train the model
from sklearn.linear_model import LinearRegression
# Fit a linear regression model on the training set
model = LinearRegression().fit(X_train, y_train)
print (model)
Now that we've trained the model, we can use it to predict rental counts for the features
we held back in our validation dataset. Then we can compare these predictions to the
actual label values to evaluate how well (or not!) the model is working.
[ ]
import numpy as np
predictions = model.predict(X_test)
np.set_printoptions(suppress=True)
print('Predicted labels: ', np.round(predictions)[:10])
print('Actual labels : ' ,y_test[:10])
Press shift + enter to run
Comparing each prediction with its corresponding "ground truth" actual value isn't a
very efficient way to determine how well the model is predicting. Let's see if we can get
a better indication by visualizing a scatter plot that compares the predictions to the
actual labels. We'll also overlay a trend line to get a general sense for how well the
predicted labels align with the true labels.
[ ]
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
There's a definite diagonal trend, and the intersections of the predicted and actual
values are generally following the path of the trend line; but there's a fair amount of
difference between the ideal function represented by the line and the results. This
variance represents the residuals of the model - in other words, the difference between
the label predicted when the model applies the coefficients it learned during training to
the validation data, and the actual value of the validation label. These residuals when
evaluated from the validation data indicate the expected level of error when the model
is used with new data for which the label is unknown.
You can quantify the residuals by calculating a number of commonly used evaluation
metrics. We'll focus on the following three:
Mean Square Error (MSE): The mean of the squared differences between predicted and actual
values. This yields a relative metric in which the smaller the value, the better the fit of the model
Root Mean Square Error (RMSE): The square root of the MSE. This yields an absolute metric in
the same unit as the label (in this case, numbers of rentals). The smaller the value, the better the
model (in a simplistic sense, it represents the average number of rentals by which the
predictions are wrong!)
Coefficient of Determination (usually known as R-squared or R2): A relative metric in which
the higher the value, the better the fit of the model. In essence, this metric represents how much
of the variance between predicted and actual label values the model is able to explain.
Note: You can find out more about these and other metrics for evaluating regression
models in the Scikit-Learn documentation
Let's use Scikit-Learn to calculate these metrics for our model, based on the predictions
it generated for the validation data.
[ ]
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
So now we've quantified the ability of our model to predict the number of rentals. It
definitely has some predictive power, but we can probably do better!
Summary
Here we've explored our data and fit a basic regression model. In the next notebook, we
will try a number of other regression algorithms to improve performance
Further Reading
To learn more about Scikit-Learn, see the Scikit-Learn documentation.
UNIT 4/9:
In Unit 2, we looked at fitting a straight line to data points. However, regression can fit
many kinds of relationships, including those with multiple factors, and those where the
importance of one factor depends on another.
Linear regression is the simplest form of regression, with no limit to the number of
features used. Linear regression comes in many forms - often named by the number of
features used and the shape of the curve that fits.
Ensemble algorithms construct not just one decision tree, but a large number of trees -
allowing better predictions on more complex data. Ensemble algorithms, such as
Random Forest, are widely used in machine learning and science due to their strong
prediction abilities.
Data scientists often experiment with using different models. In the following exercise,
we'll experiment with different types of models to compare how they perform on the
same data.
UNIT 5/9:
Let's start by loading the bicycle sharing data as a Pandas DataFrame and viewing the first few
rows. We'll also split our data into training and test datasets.
CodeMarkdown
[ ]
# Import modules we'll need for this notebook
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('daily-bike-share.csv')
bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day
numeric_features = ['temp', 'atemp', 'hum', 'windspeed']
categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit'
, 'day']
bike_data[numeric_features + ['rentals']].describe()
print(bike_data.head())
# Separate features and labels
# After separating the dataset, we now have numpy arrays named **X** containing the f
eatures, and **y** containing the labels.
X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','tem
p', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)
print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape
[0]))
Now we're ready to train a model by fitting a suitable regression algorithm to the training data.
Let's try training our regression model by using a Lasso algorithm. We can do this by just
changing the estimator in the training code.
[ ]
from sklearn.linear_model import Lasso
# Fit a lasso model on the training set
model = Lasso().fit(X_train, y_train)
print (model, "\n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Press shift + enter to run
As an alternative to a linear model, there's a category of algorithms for machine learning that
uses a tree-based approach in which the features in the dataset are examined in a series of
evaluations, each of which results in a branch in a decision tree based on the feature value. At
the end of each series of branches are leaf-nodes with the predicted label value based on the
feature values.
It's easiest to see how this works with an example. Let's train a Decision Tree regression model
using the bike rental data. After training the model, the code below will print the model
definition and a text representation of the tree it uses to predict label values.
[ ]
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text
# Train the model
model = DecisionTreeRegressor().fit(X_train, y_train)
print (model, "\n")
# Visualize the model tree
tree = export_text(model)
print(tree)
So now we have a tree-based model; but is it any good? Let's evaluate it with the test data.
[ ]
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
The tree-based model doesn't seem to have improved over the linear model, so what else could
we try?
Ensemble algorithms work by combining multiple base estimators to produce an optimal model,
either by applying an aggregate function to a collection of base models (sometimes referred to
a bagging) or by building a sequence of models that build on one another to improve predictive
performance (referred to as boosting).
For example, let's try a Random Forest model, which applies an averaging function to multiple
Decision Tree models for a better overall model.
[ ]
from sklearn.ensemble import RandomForestRegressor
# Train the model
model = RandomForestRegressor().fit(X_train, y_train)
print (model, "\n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
For good measure, let's also try a boosting ensemble algorithm. We'll use a Gradient Boosting
estimator, which like a Random Forest algorithm builds multiple trees, but instead of building
them all independently and taking the average result, each tree is built on the outputs of the
previous one in an attempt to incrementally reduce the loss (error) in the model.
[ ]
# Train the model
from sklearn.ensemble import GradientBoostingRegressor
# Fit a lasso model on the training set
model = GradientBoostingRegressor().fit(X_train, y_train)
print (model, "\n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Summary
Here we've tried a number of new regression algorithms to improve performance. In our
notebook we'll look at 'tuning' these algorithms to improve performance.
Further Reading
To learn more about Scikit-Learn, see the Scikit-Learn documentation.
[ ]
5 minutes
Simple models with small datasets can often be fit in a single step, while larger datasets
and more complex models must be fit by repeatedly using the model with training data
and comparing the output with the expected label. If the prediction is accurate enough,
we consider the model trained. If not, we adjust the model slightly and loop again.
Hyperparameters are values that change the way that the model is fit during these
loops. Learning rate, for example, is a hyperparameter that sets how much a model is
adjusted during each training cycle. A high learning rate means a model can be trained
faster, but if it’s too high the adjustments can be so large that the model is never ‘finely
tuned’ and not optimal.
Preprocessing data
Preprocessing refers to changes you make to your data before it is passed to the model.
We have previously read that preprocessing can involve cleaning your dataset. While
this is important, preprocessing can also include changing the format of your data, so
it's easier for the model to use. For example, data described as ‘red’, ‘orange’, ‘yellow’,
‘lime’, and ‘green’, may work better if converted into a format more native to computers,
such as numbers stating the amount of red and the amount of green.
Scaling features
The most common preprocessing step is to scale features so they fall between zero and
one. For example, the weight of a bike and the distance a person travels on a bike may
be two very different numbers, but by scaling both numbers to between zero and one
allows models to learn more effectively from the data.
In machine learning, you can also use categorical features such as 'bicycle', 'skateboard’
or 'car'. These features are represented by 0 or 1 values in one-hot vectors - vectors
that have a 0 or 1 for each possible value. For example, bicycle, skateboard, and car
might respectively be (1,0,0), (0,1,0), and (0,0,1).
UNIT 7/9:
Let's start by loading the bicycle sharing data as a Pandas DataFrame and viewing the
first few rows. As usual, we'll also split our data into training and test datasets.
CodeMarkdown
[ ]
# Import modules we'll need for this notebook
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/daily-bike-share.csv
bike_data = pd.read_csv('daily-bike-share.csv')
bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day
numeric_features = ['temp', 'atemp', 'hum', 'windspeed']
categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit'
, 'day']
bike_data[numeric_features + ['rentals']].describe()
print(bike_data.head())
# Separate features and labels
# After separating the dataset, we now have numpy arrays named **X** containing the f
eatures, and **y** containing the labels.
X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','tem
p', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)
print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape
[0]))
Now we're ready to train a model by fitting a boosting ensemble algorithm, as in our last
notebook. Recall that a Gradient Boosting estimator, is like a Random Forest algorithm,
but instead of building them all trees independently and taking the average result, each
tree is built on the outputs of the previous one in an attempt to incrementally reduce
the loss (error) in the model.
[ ]
# Train the model
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
# Fit a lasso model on the training set
model = GradientBoostingRegressor().fit(X_train, y_train)
print (model, "\n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Optimize Hyperparameters
Take a look at the GradientBoostingRegressor estimator definition in the output
above, and note that it, like the other estimators we tried previously, includes a large
number of parameters that control the way the model is trained. In machine learning,
the term parameters refers to values that can be determined from data; values that you
specify to affect the behavior of a training algorithm are more correctly referred to
as hyperparameters.
The specific hyperparameters for an estimator vary based on the algorithm that the
estimator encapsulates. In the case of the GradientBoostingRegressor estimator, the
algorithm is an ensemble that combines multiple decision trees to create an overall
predictive model. You can learn about the hyperparameters for this estimator in
the Scikit-Learn documentation.
We won't go into the details of each hyperparameter here, but they work together to
affect the way the algorithm trains a model. In many cases, the default values provided
by Scikit-Learn will work well; but there may be some advantage in modifying
hyperparameters to get better predictive performance or reduce training time.
So how do you know what hyperparameter values you should use? Well, in the absence
of a deep understanding of how the underlying algorithm works, you'll need to
experiment. Fortunately, SciKit-Learn provides a way to tune hyperparameters by trying
multiple combinations and finding the best result for a given performance metric.
Let's try using a grid search approach to try combinations from a grid of possible values
for the learning_rate and n_estimators hyperparameters of
the GradientBoostingRegressor estimator.
[ ]
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score
# Use a Gradient Boosting algorithm
alg = GradientBoostingRegressor()
# Try these hyperparameter values
params = {
'learning_rate': [0.1, 0.5, 1.0],
'n_estimators' : [50, 100, 150]
}
# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)
gridsearch.fit(X_train, y_train)
print("Best parameter combination:", gridsearch.best_params_, "\n")
# Get the best model
model=gridsearch.best_estimator_
print(model, "\n")
# Evaluate the model using the test data
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
# overlay the regression line
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Note: The use of random values in the Gradient Boosting algorithm results in slightly
different metrics each time. In this case, the best model produced by hyperparameter
tuning is unlikely to be significantly better than one trained with the default
hyperparameter values; but it's still useful to know about the hyperparameter tuning
technique!
In practice, it's common to perform some preprocessing of the data to make it easier for
the algorithm to fit a model to it. There's a huge range of preprocessing transformations
you can perform to get your data ready for modeling, but we'll limit ourselves to a few
common techniques:
Normalizing numeric features so they're on the same scale prevents features with large
values from producing coefficients that disproportionately affect the predictions. For
example, suppose your data includes the following numeric features:
A B C
48
3 65
0
Normalizing these features to the same scale may result in the following values
(assuming A contains values from 0 to 10, B contains values from 0 to 1000, and C
contains values from 0 to 100):
A B C
There are multiple ways you can scale numeric data, such as calculating the minimum
and maximum values for each column and assigning a proportional value between 0
and 1, or by using the mean and standard deviation of a normally distributed variable to
maintain the same spread of values on a different scale.
Machine learning models work best with numeric features rather than text values, so
you generally need to convert categorical features into numeric representations. For
example, suppose your data includes the following categorical feature.
Size
You can apply ordinal encoding to substitute a unique integer value for each category,
like this:
Size
Size_
Size_S Size_L
M
1 0 0
0 1 0
Size_
Size_S Size_L
M
0 0 1
To apply these preprocessing transformations to the bike rental, we'll make use of a
Scikit-Learn feature named pipelines. These enable us to define a set of preprocessing
steps that end with an algorithm. You can then fit the entire pipeline to the data, so that
the model encapsulates all of the preprocessing steps as well as the regression
algorithm. This is useful, because when we want to use the model to predict values from
new data, we need to apply the same transformations (based on the same statistical
distributions and category encodings used with the training data).
Note: The term pipeline is used extensively in machine learning, often to mean very
different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn,
but you may see it used elsewhere to mean something else.
[ ]
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
import numpy as np
# Define preprocessing for numeric columns (scale them)
numeric_features = [6,7,8,9]
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
# Define preprocessing for categorical features (encode them)
categorical_features = [0,1,2,3,4,5]
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])
# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
OK, the model is trained, including the preprocessing steps. Let's see how it performs
with the validation data.
[ ]
# Get predictions
predictions = model.predict(X_test)
# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
The pipeline is composed of the transformations and the algorithm used to train the
model. To try an alternative algorithm you can just change that step to a different kind
of estimator.
[ ]
# Use a different estimator in the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor())])
# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model, "\n")
# Get predictions
predictions = model.predict(X_test)
# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)
# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions - Preprocessed')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
We've now seen a number of common techniques used to train predictive models for
regression. In a real project, you'd likely try a few more algorithms, hyperparameters,
and preprocessing transformations; but by now you should have got the general idea.
Let's explore how you can use the trained model with new data.
import joblib
# Save the model as a pickle file
filename = './bike-share.pkl'
joblib.dump(model, filename)
Now, we can load it whenever we need it, and use it to predict labels for new data. This
is often called scoring or inferencing.
[ ]
# Load the model from the file
loaded_model = joblib.load(filename)
# Create a numpy array containing a new observation (for example tomorrow's seasonal
and weather forecast information)
X_new = np.array([[1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]).astype('float64')
print ('New sample: {}'.format(list(X_new[0])))
# Use the model to predict tomorrow's rentals
result = loaded_model.predict(X_new)
print('Prediction: {:.0f} rentals'.format(np.round(result[0])))
# An array of features based on five-day weather forecast
X_new = np.array([[0,1,1,0,0,1,0.344167,0.363625,0.805833,0.160446],
[0,1,0,1,0,1,0.363478,0.353739,0.696087,0.248539],
[0,1,0,2,0,1,0.196364,0.189405,0.437273,0.248309],
[0,1,0,3,0,1,0.2,0.212122,0.590435,0.160296],
[0,1,0,4,0,1,0.226957,0.22927,0.436957,0.1869]])
# Use the model to predict rentals
results = loaded_model.predict(X_new)
print('5-day rental predictions:')
for prediction in results:
print(np.round(prediction))
Press shift + enter to run
Summary
That concludes the notebooks for this module on regression. In this notebook we ran a
complex regression, tuned it, saved the model, and used it to predict outcomes for the
future.
Further Reading
To learn more about Scikit-Learn, see the Scikit-Learn documentation.
UNIT 8/9:
Knowledge check
200 XP
3 minutes
1.
You are using scikit-learn to train a regression model from a dataset of sales data. You
want to be able to evaluate the model to ensure it will predict accurately with new data.
What should you do?
Use all of the data to train the model. Then use all of the data to evaluate it
Train the model using only the feature columns, and then evaluate it using only the label
column
Split the data randomly into two subsets. Use one subset to train the model, and the
other to evaluate it
You have created a model object using the scikit-learn LinearRegression class. What
should you do to train the model?
Call the predict() method of the model object, specifying the training feature and label
arrays
Call the fit() method of the model object, specifying the training feature and label arrays
Call the score() method of the model object, specifying the training feature and test
feature arrays
You train a regression model using scikit-learn. When you evaluate it with test data, you
determine that the model achieves an R-squared metric of 0.95. What does this metric
tell you about the model?
The model explains most of the variance between predicted and actual values.
ANSWER:1 (The R-squared metric is a measure of how much of the variance can be
explained by the model.)
UNIT 9/9:
Summary
Completed100 XP
1 minute
In this module, you learned how regression can be used to create a machine learning
model that predicts numeric values. You then used the scikit-learn framework in Python
to train and evaluate a regression model.
While scikit-learn is a popular framework for writing code to train regression models,
you can also create machine learning solutions for regression using the graphical tools
in Microsoft Azure Machine Learning. You can learn more about no-code development
of regression models using Azure Machine Learning in the Create a Regression Model
with Azure Machine Learning designer module.
Note
The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!
MODULE 3:
Introduction
Completed100 XP
2 minutes
Classification is a form of machine learning in which you train a model to predict which
category an item belongs to. For example, a health clinic might use diagnostic data such
as a patient's height, weight, blood pressure, blood-glucose level to predict whether or
not the patient is diabetic.
Categorical data has distinct 'classes', rather than numeric values. Some kinds of data
can be either numeric or categorical: the time to run a race could be a time in seconds,
or we could split times into classes of ‘fast’, ‘medium’ and ‘slow’ - categorical. While
other kinds of data can only be categorical, such as a type of shape - ‘circle’, ‘triangle’, or
‘square’.
Prerequisites
Knowledge of basic mathematics
Some experience programming in Python
Learning objectives
In this module, you will:
What is classification?
Completed100 XP
5 minutes
A simple example
Let's explore a simple example to help explain the key principles. Suppose we have the
following patient data, which consists of a single feature (blood-glucose level) and a
class label 0 for non-diabetic, 1 for diabetic.
Blood-Glucose Diabetic
82 0
Blood-Glucose Diabetic
92 0
112 1
102 0
115 1
107 1
87 0
120 1
83 0
119 1
104 1
105 0
86 0
109 1
We'll use the first eight observations to train a classification model, and we'll start by
plotting the blood-glucose feature (which we'll call x) and the predicted diabetic label
(which we'll call y).
What we need is a function that calculates a probability value for y based on x (in other
words, we need the function f(x) = y). You can see from the chart that patients with a
low blood-glucose level are all non-diabetic, while patients with a higher blood-glucose
level are diabetic. It seems like the higher the blood-glucose level, the more probable it
is that a patient is diabetic, with the inflexion point being somewhere between 100 and
110. We need to fit a function that calculates a value between 0 and 1 for y to these
values.
One such function is a logistic function, which forms a sigmoidal (S-shaped) curve, like
this:
Now we can use the function to calculate a probability value that y is positive, meaning
the patient is diabetic, from any value of x by finding the point on the function line
for x. We can set a threshold value of 0.5 as the cut-off point for the class label
prediction.
Now we can compare the label predictions based on the logistic function encapsulated
in the model (which we'll call ŷ, or "y-hat") to the actual class labels (y).
x y ŷ
83 0 0
119 1 1
104 1 0
105 0 1
86 0 0
109 1 1
UNIT 3/9:
Classification is a form of supervised machine learning in which you train a model to use
the features (the x values in our function) to predict a label (y) that calculates the
probability of the observed case belonging to each of a number of possible classes, and
predicting an appropriate label. The simplest form of classification
is binary classification, in which the label is 0 or 1, representing one of two classes; for
example, "True" or "False"; "Internal" or "External"; "Profitable" or "Non-Profitable"; and
so on.
Binary Classification
In this notebook, we will focus on an example of binary classification, where the model
must predict a label that belongs to one of two classes. In this exercise, we'll train a
binary classifier to predict whether or not a patient should be tested for diabetes based
on some medical data.
Run the following cell to load a CSV file of patent data into a Pandas dataframe:
Citation: The diabetes dataset used in this exercise is based on data originally collected
by the National Institute of Diabetes and Digestive and Kidney Diseases.
CodeMarkdown
import pandas as pd
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/diabetes.csv
diabetes = pd.read_csv('diabetes.csv')
diabetes.head()
This data consists of diagnostic information about some patients who have been tested
for diabetes. Scroll to the right if necessary, and note that the final column in the dataset
(Diabetic) contains the value 0 for patients who tested negative for diabetes, and 1 for
patients who tested positive. This is the label that we will train our model to predict;
most of the other columns (Pregnancies,PlasmaGlucose,DiastolicBloodPressure, and
so on) are the features we will use to predict the Diabetic label.
Let's separate the features from the labels - we'll call the features X and the label y:
[ ]
# Separate features and labels
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThi
ckness','SerumInsulin','BMI','DiabetesPedigree','Age']
label = 'Diabetic'
X, y = diabetes[features].values, diabetes[label].values
for n in range(0,4):
print("Patient", str(n+1), "\n Features:",list(X[n]), "\n Label:", y[n])
Now let's compare the feature distributions for each label value.
[ ]
from matplotlib import pyplot as plt
%matplotlib inline
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThi
ckness','SerumInsulin','BMI','DiabetesPedigree','Age']
for col in features:
diabetes.boxplot(column=col, by='Diabetic', figsize=(6,6))
plt.title(col)
plt.show()
Press shift + enter to run
For some of the features, there's a noticeable difference in the distribution for each label
value. In particular, Pregnancies and Age show markedly different distributions for
diabetic patients than for non-diabetic patients. These features may help predict
whether or not a patient is diabetic.
Our dataset includes known values for the label, so we can use this to train a classifier so
that it finds a statistical relationship between the features and the label value; but how
will we know if our model is any good? How do we know it will predict correctly when
we use it with new data that it wasn't trained with? Well, we can take advantage of the
fact we have a large dataset with known label values, use only some of it to train the
model, and hold back some to test the trained model - enabling us to compare the
predicted labels with the already known labels in the test set.
from sklearn.model_selection import train_test_split
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)
print ('Training cases: %d\nTest cases: %d' % (X_train.shape[0], X_test.shape
[0]))
OK, now we're ready to train our model by fitting the training features (X_train) to the
training labels (y_train). There are various algorithms we can use to train the model. In
this example, we'll use Logistic Regression, which (despite its name) is a well-established
algorithm for classification. In addition to the training features and labels, we'll need to
set a regularization parameter. This is used to counteract any bias in the sample, and
help the model generalize well by avoiding overfitting the model to the training data.
Note: Parameters for machine learning algorithms are generally referred to
as hyperparameters (to a data scientist, parameters are values in the data itself
- hyperparameters are defined externally from the data!)
[ ]
# Train the model
from sklearn.linear_model import LogisticRegression
# Set regularization rate
reg = 0.01
# train a logistic regression model on the training set
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)
print (model)
Now we've trained the model using the training data, we can use the test data we held
back to evaluate how well it predicts. Again, scikit-learn can help us do this. Let's start
by using the model to predict labels for our test set, and compare the predicted labels
to the known labels:
[ ]
predictions = model.predict(X_test)
print('Predicted labels: ', predictions)
print('Actual labels: ' ,y_test)
The most obvious thing you might want to do is to check the accuracy of the
predictions - in simple terms, what proportion of the labels did the model predict
correctly?
[ ]
from sklearn.metrics import accuracy_score
print('Accuracy: ', accuracy_score(y_test, predictions))
The accuracy is returned as a decimal value - a value of 1.0 would mean that the model
got 100% of the predictions right; while an accuracy of 0.0 is, well, pretty useless!
Summary
Here we prepared our data by splitting it into test and train datasets, and applied
logistic regression - a way of applying binary labels to our data. Our model was able to
predict whether patients had diabetes with what appears like reasonable accuracy. But is
this good enough? In the next notebook we will look at alternatives to accuracy that can
be much more useful in machine learning.
[ ]
from sklearn. metrics import classification_report
print(classification_report(y_test, predictions))
Press shift + enter to run
The classification report includes the following metrics for each class (0 and 1)
note that the header row may not line up with the values!
Precision: Of the predictions the model made for this class, what proportion were correct?
Recall: Out of all of the instances of this class in the test dataset, how many did the model
identify?
F1-Score: An average metric that takes both precision and recall into account.
Support: How many instances of this class are there in the test dataset?
The classification report also includes averages for these metrics, including a weighted
average that allows for the imbalance in the number of cases of each class.
from sklearn.metrics import precision_score, recall_score
print("Overall Precision:",precision_score(y_test, predictions))
print("Overall Recall:",recall_score(y_test, predictions))
The precision and recall metrics are derived from four possible prediction outcomes:
True Positives: The predicted label and the actual label are both 1.
False Positives: The predicted label is 1, but the actual label is 0.
False Negatives: The predicted label is 0, but the actual label is 1.
True Negatives: The predicted label and the actual label are both 0.
These metrics are generally tabulated for the test set and shown together as a confusion
matrix, which takes the following form:
T
FP
N
FN TP
Note that the correct (true) predictions form a diagonal line from top left to bottom
right - these figures should be significantly higher than the false predictions if the model
is any good.
from sklearn.metrics import confusion_matrix
# Print the confusion matrix
cm = confusion_matrix(y_test, predictions)
print (cm)
Until now, we've considered the predictions from the model as being either 1 or 0 class
labels. Actually, things are a little more complex than that. Statistical machine learning
algorithms, like logistic regression, are based on probability; so what actually gets
predicted by a binary classifier is the probability that the label is true (P(y)) and the
probability that the label is false (1 - P(y)). A threshold value of 0.5 is used to decide
whether the predicted label is a 1 (P(y) > 0.5) or a 0 (P(y) <= 0.5). You can use
the predict_proba method to see the probability pairs for each case:
[ ]
y_scores = model.predict_proba(X_test)
print(y_scores)
Press shift + enter to run
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
# plot ROC curve
fig = plt.figure(figsize=(6, 6))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Press shift + enter to run
The ROC chart shows the curve of the true and false positive rates for different threshold
values between 0 and 1. A perfect classifier would have a curve that goes straight up the
left side and straight across the top. The diagonal line across the chart represents the
probability of predicting correctly with a 50/50 random prediction; so you obviously
want the curve to be higher than that (or your model is no better than simply guessing!).
The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall
performance of the model. The closer to 1 this value is, the better the model. Once
again, scikit-Learn includes a function to calculate this metric.
[ ]
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
In this case, the ROC curve and its AUC indicate that the model performs better than a
random guess which is not bad considering we performed very little preprocessing of
the data.
In practice, it's common to perform some preprocessing of the data to make it easier for
the algorithm to fit a model to it. There's a huge range of preprocessing transformations
you can perform to get your data ready for modeling, but we'll limit ourselves to a few
common techniques:
Scaling numeric features so they're on the same scale. This prevents features with large values
from producing coefficients that disproportionately affect the predictions.
Encoding categorical variables. For example, by using a one hot encoding technique you can
create individual binary (true/false) features for each possible category value.
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import numpy as np
# Define preprocessing for numeric columns (normalize them so they're on the same sca
le)
numeric_features = [0,1,2,3,4,5,6]
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
# Define preprocessing for categorical features (encode the Age column)
categorical_features = [7]
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('logregressor', LogisticRegression(C=1/reg, solver
="liblinear"))])
# fit the pipeline to train a logistic regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
Let's use the model trained by this pipeline to predict labels for our test set, and
compare the performance metrics with the basic model we created previously.
[ ]
# Get predictions from test data
predictions = model.predict(X_test)
y_scores = model.predict_proba(X_test)
# Get evaluation metrics
cm = confusion_matrix(y_test, predictions)
print ('Confusion Matrix:\n',cm, '\n')
print('Accuracy:', accuracy_score(y_test, predictions))
print("Overall Precision:",precision_score(y_test, predictions))
print("Overall Recall:",recall_score(y_test, predictions))
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
# plot ROC curve
fig = plt.figure(figsize=(6, 6))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
The results look a little better, so clearly preprocessing the data has made a difference.
Now let's try a different algorithm. Previously we used a logistic regression algorithm,
which is a linear algorithm. There are many kinds of classification algorithm we could try,
including:
Support Vector Machine algorithms: Algorithms that define a hyperplane that separates
classes.
Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to
improve generalizability.
This time, We'll use the same preprocessing steps as before, but we'll train the model
using an ensemble algorithm named Random Forest that combines the outputs of
multiple random decision trees (for more details, see the Scikit-Learn documentation).
[ ]
from sklearn.ensemble import RandomForestClassifier
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('logregressor', RandomForestClassifier(n_estimators=100
))])
# fit the pipeline to train a random forest model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
predictions = model.predict(X_test)
y_scores = model.predict_proba(X_test)
cm = confusion_matrix(y_test, predictions)
print ('Confusion Matrix:\n',cm, '\n')
print('Accuracy:', accuracy_score(y_test, predictions))
print("Overall Precision:",precision_score(y_test, predictions))
print("Overall Recall:",recall_score(y_test, predictions))
auc = roc_auc_score(y_test,y_scores[:,1])
print('\nAUC: ' + str(auc))
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
# plot ROC curve
fig = plt.figure(figsize=(6, 6))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Press shift + enter to run
Now that we have a reasonably useful trained model, we can save it for use later to
predict labels for new data:
[ ]
import joblib
# Save the model as a pickle file
filename = './diabetes_model.pkl'
joblib.dump(model, filename)
When we have some new observations for which the label is unknown, we can load the
model and use it to predict values for the unknown label:
[ ]
# Load the model from the file
model = joblib.load(filename)
# predict on a new sample
# The model accepts an array of feature arrays (so you can predict the classes of mul
tiple patients in a single call)
# We'll create an array with a single array of features, representing one patient
X_new = np.array([[2,180,74,24,21,23.9091702,1.488172308,22]])
print ('New sample: {}'.format(list(X_new[0])))
# Get a prediction
pred = model.predict(X_new)
# The model returns an array of predictions - one for each set of features submitted
# In our case, we only submitted one patient, so our prediction is the first one in t
he resulting array.
print('Predicted class is {}'.format(pred[0]))
Summary
In this notebook, we looked at the basics of binary classification. We will move onto
more complex classification problems in the following notebook.
UNIT 4/9:
The training accuracy of a classification model is much less important than how well that
model will work when given new, unseen data. After all, we train models so that they can
be used on new data we find in the real world. So, after we have trained a classification
model, we should evaluate how it performs on a set of new, unseen data.
In the previous units, we created a model that would predict whether a patient had
diabetes or not based on their blood glucose level. Now, when applied to some data
that wasn't part of the training set we get the following predictions:
x y ŷ
83 0 0
119 1 1
104 1 0
105 0 1
86 0 0
109 1 1
Recall that x refers to blood glucose level, y refers to whether they're actually diabetic,
and ŷ refers to the model’s prediction as to whether they're diabetic or not.
Simply calculating how many predictions were correct is sometimes misleading or too
simplistic for us to understand the kinds of errors it will make in the real world. To get
more detailed information, we can tabulate the results in a structure called a confusion
matrix, like this:
The model predicted 0 and the actual label is 0 (true negatives; top left)
The model predicted 1 and the actual label is 1 (true positives; bottom right)
The model predicted 0 and the actual label is 1 (false negatives; bottom left)
The model predicted 1 and the actual label is 0 (false positives; top right)
The cells in the confusion matrix are often shaded so that higher values have a deeper
shade. This makes it easier to see a strong diagonal trend from top-left to bottom-right,
highlighting the cells where the predicted value and actual value are the same.
From these core values, you can calculate a range of other metrics that can help you
evaluate the performance of the model. For example:
To get started, run the next cell to load our data and train our model like last time.
CodeMarkdown
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/diabetes.csv
diabetes = pd.read_csv('diabetes.csv')
# Separate features and labels
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness'
,'SerumInsulin','BMI','DiabetesPedigree','Age']
label = 'Diabetic'
X, y = diabetes[features].values, diabetes[label].values
# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)
print ('Training cases: %d\nTest cases: %d' % (X_train.shape[0], X_test.shape[0]))
# Train the model
from sklearn.linear_model import LogisticRegression
# Set regularization rate
reg = 0.01
# train a logistic regression model on the training set
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predicted labels: ', predictions)
print('Actual labels: ' ,y_test)
print('Accuracy: ', accuracy_score(y_test, predictions))
One of the simplest places to start is a classification report. Run the next cell to see a
range of alternatives ways to assess our model
[ ]
from sklearn. metrics import classification_report
print(classification_report(y_test, predictions))
The classification report includes the following metrics for each class (0 and 1)
note that the header row may not line up with the values!
Precision: Of the predictions the model made for this class, what proportion were correct?
Recall: Out of all of the instances of this class in the test dataset, how many did the model
identify?
F1-Score: An average metric that takes both precision and recall into account.
Support: How many instances of this class are there in the test dataset?
The classification report also includes averages for these metrics, including a weighted
average that allows for the imbalance in the number of cases of each class.
Because this is a binary classification problem, the 1 class is considered positive and its
precision and recall are particularly interesting - these in effect answer the questions:
Of all the patients the model predicted are diabetic, how many are actually diabetic?
Of all the patients that are actually diabetic, how many did the model identify?
from sklearn.metrics import precision_score, recall_score
print("Overall Precision:",precision_score(y_test, predictions))
print("Overall Recall:",recall_score(y_test, predictions))
The precision and recall metrics are derived from four possible prediction outcomes:
True Positives: The predicted label and the actual label are both 1.
False Positives: The predicted label is 1, but the actual label is 0.
False Negatives: The predicted label is 0, but the actual label is 1.
True Negatives: The predicted label and the actual label are both 0.
These metrics are generally tabulated for the test set and shown together as a confusion
matrix, which takes the following form:
T
FP
N
FN TP
Note that the correct (true) predictions form a diagonal line from top left to bottom
right - these figures should be significantly higher than the false predictions if the model
is any good.
In Python, you can use the sklearn.metrics.confusion_matrix function to find these
values for a trained classifier:
[ ]
from sklearn.metrics import confusion_matrix
# Print the confusion matrix
cm = confusion_matrix(y_test, predictions)
print (cm)
Until now, we've considered the predictions from the model as being either 1 or 0 class
labels. Actually, things are a little more complex than that. Statistical machine learning
algorithms, like logistic regression, are based on probability; so what actually gets
predicted by a binary classifier is the probability that the label is true (P(y)) and the
probability that the label is false (1 - P(y)). A threshold value of 0.5 is used to decide
whether the predicted label is a 1 (P(y) > 0.5) or a 0 (P(y) <= 0.5). You can use
the predict_proba method to see the probability pairs for each case:
[ ]
y_scores = model.predict_proba(X_test)
print(y_scores)
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
# plot ROC curve
fig = plt.figure(figsize=(6, 6))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
The ROC chart shows the curve of the true and false positive rates for different threshold
values between 0 and 1. A perfect classifier would have a curve that goes straight up the
left side and straight across the top. The diagonal line across the chart represents the
probability of predicting correctly with a 50/50 random prediction; so you obviously
want the curve to be higher than that (or your model is no better than simply guessing!).
The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall
performance of the model. The closer to 1 this value is, the better the model. Once
again, scikit-Learn includes a function to calculate this metric.
[ ]
from sklearn.metrics import roc_auc_score
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
In this case, the ROC curve and its AUC indicate that the model performs better than a
random guess which is not bad considering we performed very little preprocessing of
the data.
In practice, it's common to perform some preprocessing of the data to make it easier for
the algorithm to fit a model to it. There's a huge range of preprocessing transformations
you can perform to get your data ready for modeling, but we'll limit ourselves to a few
common techniques:
Scaling numeric features so they're on the same scale. This prevents features with large values
from producing coefficients that disproportionately affect the predictions.
Encoding categorical variables. For example, by using a one hot encoding technique you can
create individual binary (true/false) features for each possible category value.
# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
import numpy as np
# Define preprocessing for numeric columns (normalize them so they're on the same sca
le)
numeric_features = [0,1,2,3,4,5,6]
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
# Define preprocessing for categorical features (encode the Age column)
categorical_features = [7]
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
# Combine preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('logregressor', LogisticRegression(C=1/reg, solver
="liblinear"))])
# fit the pipeline to train a logistic regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
Press shift + enter to run
Let's use the model trained by this pipeline to predict labels for our test set, and
compare the performance metrics with the basic model we created previously.
[ ]
# Get predictions from test data
predictions = model.predict(X_test)
y_scores = model.predict_proba(X_test)
# Get evaluation metrics
cm = confusion_matrix(y_test, predictions)
print ('Confusion Matrix:\n',cm, '\n')
print('Accuracy:', accuracy_score(y_test, predictions))
print("Overall Precision:",precision_score(y_test, predictions))
print("Overall Recall:",recall_score(y_test, predictions))
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
# plot ROC curve
fig = plt.figure(figsize=(6, 6))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Press shift + enter to run
The results look a little better, so clearly preprocessing the data has made a difference.
Now let's try a different algorithm. Previously we used a logistic regression algorithm,
which is a linear algorithm. There are many kinds of classification algorithm we could try,
including:
Support Vector Machine algorithms: Algorithms that define a hyperplane that separates
classes.
Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to
improve generalizability.
This time, We'll use the same preprocessing steps as before, but we'll train the model
using an ensemble algorithm named Random Forest that combines the outputs of
multiple random decision trees (for more details, see the Scikit-Learn documentation).
[ ]
from sklearn.ensemble import RandomForestClassifier
# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('logregressor', RandomForestClassifier(n_estimators=100))
])
# fit the pipeline to train a random forest model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
Press shift + enter to run
predictions = model.predict(X_test)
y_scores = model.predict_proba(X_test)
cm = confusion_matrix(y_test, predictions)
print ('Confusion Matrix:\n',cm, '\n')
print('Accuracy:', accuracy_score(y_test, predictions))
print("Overall Precision:",precision_score(y_test, predictions))
print("Overall Recall:",recall_score(y_test, predictions))
auc = roc_auc_score(y_test,y_scores[:,1])
print('\nAUC: ' + str(auc))
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
# plot ROC curve
fig = plt.figure(figsize=(6, 6))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
Now that we have a reasonably useful trained model, we can save it for use later to
predict labels for new data:
[ ]
import joblib
# Save the model as a pickle file
filename = './diabetes_model.pkl'
joblib.dump(model, filename)
When we have some new observations for which the label is unknown, we can load the
model and use it to predict values for the unknown label:
[ ]
# Load the model from the file
model = joblib.load(filename)
# predict on a new sample
# The model accepts an array of feature arrays (so you can predict the classes of mul
tiple patients in a single call)
# We'll create an array with a single array of features, representing one patient
X_new = np.array([[2,180,74,24,21,23.9091702,1.488172308,22]])
print ('New sample: {}'.format(list(X_new[0])))
# Get a prediction
pred = model.predict(X_new)
# The model returns an array of predictions - one for each set of features submitted
# In our case, we only submitted one patient, so our prediction is the first one in t
he resulting array.
print('Predicted class is {}'.format(pred[0]))
Press shift + enter to run
Summary
In this notebook, we looked at a range of metrics for binary classification and tried a few
algorithms beyond logistic regression. We will move onto more complex classification
problems in the following notebook.
UNIT 6/9:
It's also possible to create multiclass classification models, in which there are more than
two possible classes. For example, the health clinic might expand the diabetes model to
classify patients as:
Non-diabetic
Type-1 diabetic
Type-2 diabetic
The individual class probability values would still add up to a total of 1 as the patient is
definitely in only one of the three classes, and the most probable class would be
predicted by the model.
One vs Rest (OVR), in which a classifier is created for each possible class value,
with a positive outcome for cases where the prediction is this class, and negative
predictions for cases where the prediction is any other class. For example, a
classification problem with four possible shape classes (square, circle, triangle,
hexagon) would require four classifiers that predict:
o square or not
o circle or not
o triangle or not
o hexagon or not
One vs One (OVO), in which a classifier for each possible pair of classes is created.
The classification problem with four shape classes would require the following
binary classifiers:
o square or circle
o square or triangle
o square or hexagon
o circle or triangle
o circle or hexagon
o triangle or hexagon
In both approaches, the overall model must take into account all of these predictions to
determine which single category the item belongs to.
In both approaches, the overall model that combines the classifiers generates a vector
of predictions in which the probabilities generated from the individual binary classifiers
are used to determine which class to predict.
Let's start by examining a dataset that contains observations of multiple classes. We'll
use a dataset that contains observations of three different species of penguin.
Citation: The penguins dataset used in the this exercise is a subset of data collected and
made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a
member of the Long Term Ecological Research Network.
CodeMarkdown
[ ]
import pandas as pd
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/penguins.csv
penguins = pd.read_csv('penguins.csv')
# Display a random sample of 10 observations
sample = penguins.sample(10)
sample
penguin_classes = ['Adelie', 'Gentoo', 'Chinstrap']
print(sample.columns[0:5].values, 'SpeciesName')
for index, row in penguins.sample(10).iterrows():
print('[',row[0], row[1], row[2], row[3], int(row[4]),']',penguin_classes[int(row
[4])])
Now that we know what the features and labels in the data represent, let's explore the
dataset. First, let's see if there are any missing (null) values.
[ ]
# Count the number of null values for each column
penguins.isnull().sum()
It looks like there are some missing feature values, but no missing labels. Let's dig a little
deeper and see the rows that contain nulls.
[ ]
# Show rows containing nulls
penguins[penguins.isnull().any(axis=1)]
There are two rows that contain no feature values at all (NaN stands for "not a
number"), so these won't be useful in training a model. Let's discard them from the
dataset.
[ ]
# Drop rows containing NaN values
penguins=penguins.dropna()
#Confirm there are now no nulls
penguins.isnull().sum()
Now that we've dealt with the missing values, let's explore how the features relate to the
label by creating some box charts.
[ ]
from matplotlib import pyplot as plt
%matplotlib inline
penguin_features = ['CulmenLength','CulmenDepth','FlipperLength','BodyMass']
penguin_label = 'Species'
for col in penguin_features:
penguins.boxplot(column=col, by=penguin_label, figsize=(6,6))
plt.title(col)
plt.show()
From the box plots, it looks like species 0 and 2 (Adelie and Chinstrap) have similar data
profiles for culmen depth, flipper length, and body mass, but Chinstraps tend to have
longer culmens. Species 1 (Gentoo) tends to have fairly clearly differentiated features
from the others; which should help us train a good classification model.
Prepare the data
Just as for binary classification, before training the model, we need to separate the
features and label, and then split the data into subsets for training and validation. We'll
also apply a stratification technique when splitting the data to maintain the proportion
of each label value in the training and validation datasets.
[ ]
from sklearn.model_selection import train_test_split
# Separate features and labels
penguins_X, penguins_y = penguins[penguin_features].values, penguins[penguin_label].v
alues
# Split data 70%-30% into training set and test set
x_penguin_train, x_penguin_test, y_penguin_train, y_penguin_test = train_test_split(p
enguins_X, penguins_y,
t
est_size=0.30,
r
andom_state=0,
s
tratify=penguins_y)
print ('Training Set: %d, Test Set: %d \n' % (x_penguin_train.shape[0], x_penguin_tes
t.shape[0]))
Now that we have a set of training features and corresponding training labels, we can fit
a multiclass classification algorithm to the data to create a model. Most scikit-learn
classification algorithms inherently support multiclass classification. We'll try a logistic
regression algorithm.
[ ]
from sklearn.linear_model import LogisticRegression
# Set regularization rate
reg = 0.1
# train a logistic regression model on the training set
multi_model = LogisticRegression(C=1/reg, solver='lbfgs', multi_class='auto', max_ite
r=10000).fit(x_penguin_train, y_penguin_train)
print (multi_model)
Now we can use the trained model to predict the labels for the test features, and
compare the predicted labels to the actual labels:
[ ]
penguin_predictions = multi_model.predict(x_penguin_test)
print('Predicted labels: ', penguin_predictions[:15])
print('Actual labels : ' ,y_penguin_test[:15])
from sklearn. metrics import classification_report
print(classification_report(y_penguin_test, penguin_predictions))
Press shift + enter to run
You can get the overall metrics separately from the report using the scikit-learn metrics
score classes, but with multiclass results you must specify which average metric you
want to use for precision and recall.
[ ]
from sklearn.metrics import accuracy_score, precision_score, recall_score
print("Overall Accuracy:",accuracy_score(y_penguin_test, penguin_predictions))
print("Overall Precision:",precision_score(y_penguin_test, penguin_predictions, avera
ge='macro'))
print("Overall Recall:",recall_score(y_penguin_test, penguin_predictions, average='ma
cro'))
from sklearn.metrics import confusion_matrix
# Print the confusion matrix
mcm = confusion_matrix(y_penguin_test, penguin_predictions)
print(mcm)
When dealing with multiple classes, it's generally more intuitive to visualize this as a
heat map, like this:
[ ]
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(mcm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(penguin_classes))
plt.xticks(tick_marks, penguin_classes, rotation=45)
plt.yticks(tick_marks, penguin_classes)
plt.xlabel("Predicted Species")
plt.ylabel("Actual Species")
plt.show()
The darker squares in the confusion matrix plot indicate high numbers of cases, and you
can hopefully see a diagonal line of darker squares indicating cases where the predicted
and actual label are the same.
In the case of a multiclass classification model, a single ROC curve showing true positive
rate vs false positive rate is not possible. However, you can use the rates for each class in
a One vs Rest (OVR) comparison to create a ROC chart for each class.
[ ]
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
# Get class probability scores
penguin_prob = multi_model.predict_proba(x_penguin_test)
# Get ROC metrics for each class
fpr = {}
tpr = {}
thresh ={}
for i in range(len(penguin_classes)):
fpr[i], tpr[i], thresh[i] = roc_curve(y_penguin_test, penguin_prob[:,i], pos_labe
l=i)
# Plot the ROC chart
plt.plot(fpr[0], tpr[0], linestyle='--',color='orange', label=penguin_classes[0] + '
vs Rest')
plt.plot(fpr[1], tpr[1], linestyle='--',color='green', label=penguin_classes[1] + ' v
s Rest')
plt.plot(fpr[2], tpr[2], linestyle='--',color='blue', label=penguin_classes[2] + ' vs
Rest')
plt.title('Multiclass ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.show()
To quantify the ROC performance, you can calculate an aggregate area under the curve
score that is averaged across all of the OVR curves.
[ ]
auc = roc_auc_score(y_penguin_test,penguin_prob, multi_class='ovr')
print('Average AUC:', auc)
Press shift + enter to run
Again, just like with binary classification, you can use a pipeline to apply preprocessing
steps to the data before fitting it to an algorithm to train a model. Let's see if we can
improve the penguin predictor by scaling the numeric features in a transformation steps
before training. We'll also try a different algorithm (a support vector machine), just to
show that we can!
[ ]
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
# Define preprocessing for numeric columns (scale them)
feature_columns = [0,1,2,3]
feature_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
# Create preprocessing steps
preprocessor = ColumnTransformer(
transformers=[
('preprocess', feature_transformer, feature_columns)])
# Create training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', SVC(probability=True))])
# fit the pipeline to train a linear regression model on the training set
multi_model = pipeline.fit(x_penguin_train, y_penguin_train)
print (multi_model)
Press shift + enter to run
# Get predictions from test data
penguin_predictions = multi_model.predict(x_penguin_test)
penguin_prob = multi_model.predict_proba(x_penguin_test)
# Overall metrics
print("Overall Accuracy:",accuracy_score(y_penguin_test, penguin_predictions))
print("Overall Precision:",precision_score(y_penguin_test, penguin_predictions, avera
ge='macro'))
print("Overall Recall:",recall_score(y_penguin_test, penguin_predictions, average='ma
cro'))
print('Average AUC:', roc_auc_score(y_penguin_test,penguin_prob, multi_class='ovr'))
# Confusion matrix
plt.imshow(mcm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(penguin_classes))
plt.xticks(tick_marks, penguin_classes, rotation=45)
plt.yticks(tick_marks, penguin_classes)
plt.xlabel("Predicted Species")
plt.ylabel("Actual Species")
plt.show()
Now let's save our trained model so we can use it again later.
[ ]
import joblib
# Save the model as a pickle file
filename = './penguin_model.pkl'
joblib.dump(multi_model, filename)
OK, so now we have a trained model. Let's use it to predict the class of a new penguin
observation:
[ ]
# Load the model from the file
multi_model = joblib.load(filename)
# The model accepts an array of feature arrays (so you can predict the classes of mul
tiple penguin observations in a single call)
# We'll create an array with a single array of features, representing one penguin
x_new = np.array([[50.4,15.3,224,5550]])
print ('New sample: {}'.format(x_new[0]))
# The model returns an array of predictions - one for each set of features submitted
# In our case, we only submitted one penguin, so our prediction is the first one in t
he resulting array.
penguin_pred = multi_model.predict(x_new)[0]
print('Predicted class is', penguin_classes[penguin_pred])
Press shift + enter to run
You can also submit a batch of penguin observations to the model, and get back a
prediction for each one.
[ ]
# This time our input is an array of two feature arrays
x_new = np.array([[49.5,18.4,195, 3600],
[38.2,20.1,190,3900]])
print ('New samples:\n{}'.format(x_new))
# Call the web service, passing the input data
predictions = multi_model.predict(x_new)
# Get the predicted classes.
for prediction in predictions:
print(prediction, '(' + penguin_classes[prediction] +')')
Summary
Classification is one of the most common forms of machine learning, and by following
the basic principles we've discussed in this notebook you should be able to train and
evaluate classification models with scikit-learn. It's worth spending some time
investigating classification algorithms in more depth, and a good starting point is
the Scikit-Learn documentation.
[ ]
Knowledge check
200 XP
3 minutes
1.
You plan to use scikit-learn to train a model that predicts credit default risk. The model
must predict a value of 0 for loan applications that should be automatically approved,
and 1 for applications where there is a risk of default that requires human consideration.
What kind of model is required?
ANSWER: 1 ( A binary classification model predicts probability for two classes. )
2.
You have trained a classification model using the scikit-learn LogisticRegression class.
You want to use the model to return labels for new data in the array x_new. Which code
should you use?
model.predict(x_new)
model.fit(x_new)
model.score(x_new, y_new)
ANSWER: 1 (Use the predict method for inferencing labels for new data.)
3.
You train a binary classification model using scikit-learn. When you evaluate it with test
data, you determine that the model achieves an overall Recall metric of 0.81. What does
this metric indicate?
81% of the cases predicted as positive by the model were actually positive
UNIT 9/9:
Summary
Completed100 XP
1 minute
In this module, you learned how classification can be used to create a machine learning
model that predicts categories, or classes. You then used the scikit-learn framework in
Python to train and evaluate a classification model.
While scikit-learn is a popular framework for writing code to train classification models,
you can also create machine learning solutions for classification using the graphical
tools in Microsoft Azure Machine Learning. You can learn more about no-code
development of classification models using Azure Machine Learning in the Create a
classification model with Azure Machine Learning designer module.
Note
The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!
MODULE:
Train and evaluate clustering models
UNIT 1/7:
Introduction
Completed100 XP
2 minutes
Clustering is the process of grouping objects with similar objects. For example, in the
image below we have a collection of 2D coordinates that have been clustered into three
categories - top left (yellow), bottom (red), and top right (blue).
A major difference between clustering and classification models is that clustering is an
‘unsupervised’ method, where ‘training’ is done without labels. Instead, models identify
examples that have a similar collection of features. In the image above, examples that
are in a similar location are grouped together.
Clustering is common and useful for exploring new data where patterns between data
points, such as high-level categories, are not yet known. It's used in many fields that
need to automatically label complex data, including analysis of social networks, brain
connectivity, spam filtering, and so on.
Unit 2/7:
What is clustering?
Completed100 XP
5 minutes
For example, suppose a botanist observes a sample of flowers and records the number
of petals and leaves on each flower.
It may be useful to group these flowers into clusters based on similarities between their
features.
There are many ways this could be done. For example, if most flowers have the same
number of leaves, they could be grouped into those with many vs few petals.
Alternatively, if both petal and leaf counts vary considerably there may be a pattern to
discover, such as those with many leaves also having many petals. The goal of the
clustering algorithm is to find the optimal way to split the dataset into groups. What
‘optimal’ means depends on both the algorithm used and the dataset that is provided.
Although this flower example may be simple for a human to achieve with only a few
samples, as the dataset grows to thousands of samples or to more than two features,
clustering algorithms become very useful to quickly dissect a dataset into groups.
Unit 3/7:
Clustering - Introduction
In contrast to supervised machine learning, unsupervised learning is used when there is no
"ground truth" from which to train and validate label predictions. The most common form of
unsupervised learning is clustering, which is simllar conceptually to classification, except that
the the training data does not include known values for the class label to be predicted. Clustering
works by separating the training cases based on similarities that can be determined from their
feature values. Think of it this way; the numeric features of a given entity can be thought of as
vector coordinates that define the entity's position in n-dimensional space. What a clustering
model seeks to do is to identify groups, or clusters, of entities that are close to one another while
being separated from other clusters.
For example, let's take a look at a dataset that contains measurements of different species of
wheat seed.
Citation: The seeds dataset used in the this exercise was originally published by the Institute of
Agrophysics of the Polish Academy of Sciences in Lublin, and can be downloaded from the UCI
dataset repository (Dua, D. and Graff, C. (2019). UCI Machine Learning
Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of
Information and Computer Science).
CodeMarkdown
[ ]
import pandas as pd
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/seeds.csv
data = pd.read_csv('seeds.csv')
# Display a random sample of 10 observations (just the features)
features = data[data.columns[0:6]]
features.sample(10)
As you can see, the dataset contains six data points (or features) for each instance (observation)
of a seed. So you could interpret these as coordinates that describe each instance's location in six-
dimensional space.
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
# Normalize the numeric features so they're on the same scale
scaled_features = MinMaxScaler().fit_transform(features[data.columns[0:6]])
# Get two principal components
pca = PCA(n_components=2).fit(scaled_features)
features_2d = pca.transform(scaled_features)
features_2d[0:10]
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(features_2d[:,0],features_2d[:,1])
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Data')
plt.show()
Hopefully you can see at least two, arguably three, reasonably distinct groups of data points; but
here lies one of the fundamental problems with clustering - without known class labels, how do
you know how many clusters to separate your data into?
One way we can try to find out is to use a data sample to create a series of clustering models with
an incrementing number of clusters, and measure how tightly the data points are grouped within
each cluster. A metric often used to measure this tightness is the within cluster sum of
squares (WCSS), with lower values meaning that the data points are closer. You can then plot
the WCSS for each model.
[ ]
#importing the libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
%matplotlib inline
# Create 10 models with 1 to 10 clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i)
# Fit the data points
kmeans.fit(features.values)
# Get the WCSS (inertia) value
wcss.append(kmeans.inertia_)
#Plot the WCSS values onto a line graph
plt.plot(range(1, 11), wcss)
plt.title('WCSS by Clusters')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
The plot shows a large reduction in WCSS (so greater tightness) as the number of
clusters increases from one to two, and a further noticable reduction from two to three
clusters. After that, the reduction is less pronounced, resulting in an "elbow" in the chart
at around three clusters. This is a good indication that there are two to three reasonably
well separated clusters of data points.
Summary
Here we looked at what clustering means, and how to determine whether clustering
might be appropriate for your data. In the next notebook, we will look at two ways of
labelling the data automatically.
UNIT 4/7:
5 minutes
1. The feature values are vectorized to define n-dimensional coordinates (where n is the
number of features). In the flower example, we have two features (number of petals and
number of leaves), so the feature vector has two coordinates that we can use to
conceptually plot the data points in two-dimensional space.
2. You decide how many clusters you want to use to group the flowers, and call this value k.
For example, to create three clusters, you would use a k value of 3. Then k points are
plotted at random coordinates. These points will ultimately be the center points for each
cluster, so they're referred to as centroids.
3. Each data point (in this case flower) is assigned to its nearest centroid.
4. Each centroid is moved to the center of the data points assigned to it based on the mean
distance between the points.
5. After moving the centroid, the data points may now be closer to a different centroid, so
the data points are reassigned to clusters based on the new closest centroid.
6. The centroid movement and cluster reallocation steps are repeated until the clusters
become stable or a pre-determined maximum number of iterations is reached.
For example, if we apply clustering to the meanings of words, we may get a group
containing adjectives specific to emotions (‘angry’, ‘happy’, and so on), which itself
belongs to a group containing all human-related adjectives (‘happy’, ‘handsome’,
‘young’), and this belongs to an even higher group containing all adjectives (‘happy’,
‘green’, ‘handsome’, ‘hard’ etc.).
Hierarchical clustering is useful for not only breaking data into groups, but
understanding the relationships between these groups. A major advantage of
hierarchical clustering is that it does not require the number of clusters to be defined in
advance, and can sometimes provide more interpretable results than non-hierarchical
approaches. The major drawback is that these approaches can take much longer to
compute than simpler approaches and sometimes are not suitable for large datasets.
UNIT 5/7:
Exercise - Train and evaluate advanced clustering
models
CodeMarkdown
[ ]
import pandas as pd
# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/seeds.csv
data = pd.read_csv('seeds.csv')
# Display a random sample of 10 observations (just the features)
features = data[data.columns[0:6]]
features.sample(10)
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
# Normalize the numeric features so they're on the same scale
scaled_features = MinMaxScaler().fit_transform(features[data.columns[0:6]])
# Get two principal components
pca = PCA(n_components=2).fit(scaled_features)
features_2d = pca.transform(scaled_features)
features_2d[0:10]
K-Means Clustering
The algorithm we used to create our test clusters is K-Means. This is a commonly used
clustering algorithm that separates a dataset into K clusters of equal variance. The
number of clusters, K, is user defined. The basic algorithm has the following steps:
1. A set of K centroids are randomly chosen.
2. Clusters are formed by assigning the data points to their closest centroid.
3. The means of each cluster is computed and the centroid is moved to the mean.
4. Steps 2 and 3 are repeated until a stopping criteria is met. Typically, the algorithm terminates
when each new iteration results in negligable movement of centroids and the clusters become
static.
5. When the clusters stop changing, the algorithm has converged, defining the locations of the
clusters - note that the random starting point for the centroids means that re-running the
algorithm could result in slightly different clusters, so training usually involves multiple
iterations, reinitializing the centroids each time, and the model with the best WCSS is selected.
from sklearn.cluster import KMeans
# Create a model based on 3 centroids
model = KMeans(n_clusters=3, init='k-means++', n_init=100, max_iter=1000)
# Fit to the data and predict the cluster assignments for each data point
km_clusters = model.fit_predict(features.values)
# View the cluster assignments
km_clusters
Press shift + enter to run
Let's see those cluster assignments with the two-dimensional data points.
[ ]
import matplotlib.pyplot as plt
%matplotlib inline
def plot_clusters(samples, clusters):
col_dic = {0:'blue',1:'green',2:'orange'}
mrk_dic = {0:'*',1:'x',2:'+'}
colors = [col_dic[x] for x in clusters]
markers = [mrk_dic[x] for x in clusters]
for sample in range(len(clusters)):
plt.scatter(samples[sample][0], samples[sample][1], color = colors[sample], m
arker=markers[sample], s=100)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Assignments')
plt.show()
plot_clusters(features_2d, km_clusters)
Hopefully, the the data has been separated into three distinct clusters.
So what's the practical use of clustering? In some cases, you may have data that you
need to group into distict clusters without knowing how many clusters there are or what
they indicate. For example a marketing organization might want to separate customers
into distinct segments, and then investigate how those segments exhibit different
purchasing behaviors.
In the case of the seeds data, the different species of seed are already known and
encoded as 0 (Kama), 1 (Rosa), or 2 (Canadian), so we can use these identifiers to
compare the species classifications to the clusters identified by our unsupervised
algorithm
[ ]
seed_species = data[data.columns[7]]
plot_clusters(features_2d, seed_species.values)
There may be some differences between the cluster assignments and class labels, but
the K-Means model should have done a reasonable job of clustering the observations so
that seeds of the same species are generally in the same cluster.
Hierarchical Clustering
Hierarchical clustering methods make fewer distributional assumptions when compared
to K-means methods. However, K-means methods are generally more scalable,
sometimes very much so.
Agglomerative Clustering
Let's see an example of clustering the seeds data using an agglomerative clustering
algorithm.
[ ]
from sklearn.cluster import AgglomerativeClustering
agg_model = AgglomerativeClustering(n_clusters=3)
agg_clusters = agg_model.fit_predict(features.values)
agg_clusters
import matplotlib.pyplot as plt
%matplotlib inline
def plot_clusters(samples, clusters):
col_dic = {0:'blue',1:'green',2:'orange'}
mrk_dic = {0:'*',1:'x',2:'+'}
colors = [col_dic[x] for x in clusters]
markers = [mrk_dic[x] for x in clusters]
for sample in range(len(clusters)):
plt.scatter(samples[sample][0], samples[sample][1], color = colors[sample], m
arker=markers[sample], s=100)
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.title('Assignments')
plt.show()
plot_clusters(features_2d, agg_clusters)
Summary
Here we practiced using K-means and heirachial clustering. This unsupervised learning
has the ability to take unlabelled data and identify which of these data are similar to one
another.
Further Reading
To learn more about clustering with scikit-learn, see the scikit-learn documentation.
UNIT 6/7:
Knowledge check
200 XP
3 minutes
1.
Reinforcement learning
You are using scikit-learn to train a K-Means clustering model that groups observations
into three clusters. How should you create the KMeans object to accomplish this goal?
model = KMeans(n_clusters=3)
model = KMeans(n_init=3)
model = KMeans(max_iter=3)
1 minute
In this module, you learned how clustering can be used to create unsupervised machine
learning models that group data observations into clusters. You then used the scikit-
learn framework in Python to train a clustering model.
While scikit-learn is a popular framework for writing code to train clustering models, you
can also create machine learning solutions for clustering using the graphical tools in
Microsoft Azure Machine Learning. You can learn more about no-code development of
clustering models using Azure Machine Learning in the Create a clustering model with
Azure Machine Learning designer module.
Note
The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!
Module complete:
MODULE 5: TRAIN AND EVAULATE DEEP
LEARNING MODELS
UNIT 1/9:
Introduction
Completed100 XP
5 minutes
Deep learning is an advanced form of machine learning that tries to emulate the way the
human brain learns.
In your brain, you have nerve cells called neurons, which are connected to one another
by nerve extensions that pass electrochemical signals through the network.
When the first neuron in the network is stimulated, the input signal is processed, and if
it exceeds a particular threshold, the neuron is activated and passes the signal on to the
neurons to which it is connected. These neurons in turn may be activated and pass the
signal on through the rest of the network. Over time, the connections between the
neurons are strengthened by frequent use as you learn how to respond effectively. For
example, if someone throws a ball towards you, your neuron connections enable you to
process the visual information and coordinate your movements to catch the ball. If you
perform this action repeatedly, the network of neurons involved in catching a ball will
grow stronger as you learn how to be better at catching a ball.
Deep learning emulates this biological process using artificial neural networks that
process numeric inputs rather than electrochemical stimuli.
The incoming nerve connections are replaced by numeric inputs that are typically
identified as x. When there's more than one input value, x is considered a vector with
elements named x1, x2, and so on.
The neuron itself encapsulates a function that calculates a weighted sum of x, w, and b.
This function is in turn enclosed in an activation function that constrains the result (often
to a value between 0 and 1) to determine whether or not the neuron passes an output
onto the next layer of neurons in the network.
UNIT 2/9:
10 minutes
Before exploring how to train a deep neural network (DNN) machine learning model,
let's consider what we're trying to achieve. Machine learning is concerned with
predicting a label based on some features of a particular observation. In simple terms, a
machine learning model is a function that calculates y (the label) from x (the
features): f(x)=y.
In this case, the features (x) are a vector of four values, or mathematically, x=[x1,x2,x3,x4].
Let's suppose that the label we're trying to predict (y) is the species of the penguin, and
that there are three possible species it could be:
1. Adelie
2. Gentoo
3. Chinstrap
You train the machine learning model by using observations for which you already know
the true label. For example, you may have the following feature measurements for
an Adelie specimen:
You already know that this is an example of an Adelie (class 0), so a perfect classification
function should result in a label that indicates a 100% probability for class 0, and a 0%
probability for classes 1 and 2:
y=[1, 0, 0]
Because of the layered architecture of the network, this kind of model is sometimes
referred to as a multilayer perceptron. Additionally, notice that all neurons in the input
and hidden layers are connected to all neurons in the subsequent layers - this is an
example of a fully connected network.
When you create a model like this, you must define an input layer that supports the
number of features your model will process, and an output layer that reflects the
number of outputs you expect it to produce. You can decide how many hidden layers
you want to include and how many neurons are in each of them; but you have no
control over the input and output values for these layers - these are determined by the
model training process.
1. Features for data observations with known label values are submitted to the input layer.
Generally, these observations are grouped into batches (often referred to as mini-batches).
2. The neurons then apply their function, and if activated, pass the result onto the next layer
until the output layer produces a prediction.
3. The prediction is compared to the actual known value, and the amount of variance
between the predicted and true values (which we call the loss) is calculated.
4. Based on the results, revised values for the weights and bias values are calculated to
reduce the loss, and these adjustments are backpropagated to the neurons in the network
layers.
5. The next epoch repeats the batch training forward pass with the revised weight and bias
values, hopefully improving the accuracy of the model (by reducing the loss).
Note
Processing the training features as a batch improves the efficiency of the training
process by processing multiple observations simultaneously as a matrix of features with
vectors of weights and biases. Linear algebraic functions that operate with matrices and
vectors also feature in 3D graphics processing, which is why computers with graphic
processing units (GPUs) provide significantly better performance for deep learning
model training than central processing unit (CPU) only computers.
Calculating loss
Suppose one of the samples passed through the training process contains features of
an Adelie specimen (class 0). The correct output from the network would be [1, 0, 0].
Now suppose that the output produced by the network is [0.4, 0.3, 0.3]. Comparing
these, we can calculate an absolute variance for each element (in other words, how far is
each predicted value away from what it should be) as [0.6, 0.3, 0.3].
In reality, since we're actually dealing with multiple observations, we typically aggregate
the variance - for example by squaring the individual variance values and calculating the
mean, so we end up with a single, average loss value, like 0.18.
Optimizers
Now, here's the clever bit. The loss is calculated using a function, which operates on the
results from the final layer of the network, which is also a function. The final layer of
network operates on the outputs from the previous layers, which are also functions. So
in effect, the entire model from the input layer right through to the loss calculation is
just one big nested function. Functions have a few really useful characteristics, including:
You can conceptualize a function as a plotted line comparing its output with each of its
variables.
You can use differential calculus to calculate the derivative of the function at any point
with respect to its variables.
Let's take the first of these capabilities. We can plot the line of the function to show how
an individual weight value compares to loss, and mark on that line the point where the
current weight value matches the current loss value.
Now let's apply the second characteristic of a function. The derivative of a function for a
given point indicates whether the slope (or gradient) of the function output (in this case,
loss) is increasing or decreasing with respect to a function variable (in this case, the
weight value). A positive derivative indicates that the function is increasing, and a
negative derivative indicates that it is decreasing. In this case, at the plotted point for
the current weight value, the function has a downward gradient. In other words,
increasing the weight will have the effect of decreasing the loss.
We use an optimizer to apply this same trick for all of the weight and bias variables in
the model and determine in which direction we need to adjust them (up or down) to
reduce the overall amount of loss in the model. There are multiple commonly used
optimization algorithms, including stochastic gradient descent (SGD), Adaptive Learning
Rate (ADADELTA), Adaptive Momentum Estimation (Adam), and others; all of which are
designed to figure out how to adjust the weights and biases to minimize loss.
Learning rate
Now, the obvious next question is, by how much should the optimizer adjust the
weights and bias values? If you look at the plot for our weight value, you can see that
increasing the weight by a small amount will follow the function line down (reducing the
loss), but if we increase it by too much, the function line starts to go up again, so we
might actually increase the loss; and after the next epoch, we might find we need to
reduce the weight.
The size of the adjustment is controlled by a parameter that you set for training called
the learning rate. A low learning rate results in small adjustments (so it can take more
epochs to minimize the loss), while a high learning rate results in large adjustments (so
you might miss the minimum altogether).
UNIT 3/9:
25 minutes
So far in this module, you've learned a lot about the theory and principles of deep
learning with neural networks. The best way to learn how to apply this theory is to
actually build a deep learning model, and that's what you'll do in this exercise.
There are many frameworks available for training deep neural networks, and in this
exercise you can choose to explore either (or both) of two of the most popular deep
learning frameworks for Python: PyTorch and TensorFlow.
A Microsoft Azure subscription. If you don't already have one, you can sign up for a free
trial at https://azure.microsoft.com/free.
An Azure Machine Learning workspace with a compute instance and the ml-
basics repository cloned.
Note
This module makes use of an Azure Machine Learning workspace. If you are completing
this module in preparation for the Azure Data Scientist certification, consider creating
the workspace once, and reusing it in other modules. After completing the exercise, be
sure to follow the Clean Up instructions to stop compute resources, and retain the
workspace if you plan to reuse it.
If you don't already have an Azure Machine Learning workspace in your Azure
subscription, follow these steps to create one:
1. Sign into the Azure portal using the Microsoft account associated with your Azure
subscription.
2. On the Azure Home page, under Azure services, select Create a resource.
The Create a resource pane appears.
Setting Value
Project Details
Subscription Select the Azure subscription you'd like to use for this exercise.
Resource group Select the Create new link, and name the new resource group with a unique name, and selec
Workspace details
Workspace name Enter a unique name for your app. For example, you could use <yourname>;machinelearn.
Wait for your workspace resource to be created as it can take a few minutes.
10. In Azure Machine Learning studio, toggle the ☰ icon at the top left to
expane/collapse its menu pane. You can use these options to manage the
resources in your workspace.
Create a compute instance
To run the notebook used in this exercise, you will need a compute instance in your
Azure Machine Learning workspace.
5. Wait for the compute instance to start as this may take a couple of minutes. Under
the State column, your Compute instance will change to Running.
Clone the ml-basics repository
The files used in this module, and other related modules, are published in
the MicrosoftDocs/ml-basics GitHub repository. If you haven't already done so, use the
following steps to clone the repository to your Azure Machine Learning workspace:
5. After the command has completed and the checkout of the files is done, close the
terminal tab and view the home page in your Jupyter notebook file explorer.
Note
We highly recommend using Jupyter in an Azure Machine Learning workspace for this
exercise. This setup ensures the correct version of Python and the various packages you
will need are installed; and after creating the workspace once, you can reuse it in other
modules. If you prefer to complete the exercise in a Python environment on your own
computer, you can do so. You'll find details for configuring a local development
environment that uses Visual Studio Code at Running the labs on your own
computer. Be aware that if you choose to do this, the instructions in the exercise may
not match your notebooks user interface.
When you've finished working through the notebook, return to this module and move
on to the next unit to learn more.
UNIT 4/9:
Convolutional neural networks
Completed100 XP
10 minutes
While you can use deep learning models for any kind of machine learning, they're
particularly useful for dealing with data that consists of large arrays of numeric values -
such as images. Machine learning models that work with images are the foundation for
an area artificial intelligence called computer vision, and deep learning techniques have
been responsible for driving amazing advances in this area over recent years.
At the heart of deep learning's success in this area is a kind of model called
a convolutional neural network, or CNN. A CNN typically works by extracting features
from images, and then feeding those features into a fully connected neural network to
generate a prediction. The feature extraction layers in the network have the effect of
reducing the number of features from the potentially huge array of individual pixel
values to a smaller feature set that supports label prediction.
Layers in a CNN
CNNs consist of multiple layers, each performing a specific task in extracting features or
predicting labels.
Convolution layers
One of the principal layer types is a convolutional layer that extracts important features
in images. A convolutional layer works by applying a filter to images. The filter is defined
by a kernel that consists of a matrix of weight values.
Copy
1 -1 1
-1 0 -1
1 -1 1
An image is also just a matrix of pixel values. To apply the filter, you "overlay" it on an
image and calculate a weighted sum of the corresponding image pixel values under the
filter kernel. The result is then assigned to the center cell of an equivalent 3x3 patch in a
new matrix of values that is the same size as the image. For example, suppose a 6 x 6
image has the following pixel values:
Copy
255 255 255 255 255 255
255 255 100 255 255 255
255 100 100 100 255 255
100 100 100 100 100 255
255 255 255 255 255 255
255 255 255 255 255 255
Applying the filter to the top-left 3x3 patch of the image would work like this:
Copy
255 255 255 1 -1 1 (255 x 1)+(255 x -1)+(255 x 1) +
255 255 100 x -1 0 -1 = (255 x -1)+(255 x 0)+(100 x -1) + = 155
255 100 100 1 -1 1 (255 x1 )+(100 x -1)+(100 x 1)
The result is assigned to the corresponding pixel value in the new matrix like this:
Copy
? ? ? ? ? ?
? 155 ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
Now the filter is moved along (convolved), typically using a step size of 1 (so moving
along one pixel to the right), and the value for the next pixel is calculated
Copy
255 255 255 1 -1 1 (255 x 1)+(255 x -1)+(255 x 1) +
255 100 255 x -1 0 -1 = (255 x -1)+(100 x 0)+(255 x -1) + = -155
100 100 100 1 -1 1 (100 x1 )+(100 x -1)+(100 x 1)
Copy
? ? ? ? ? ?
? 155 -155 ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
The process repeats until we've applied the filter across all of the 3x3 patches of the
image to produce a new matrix of values like this:
Copy
? ? ? ? ? ?
? 155 -155 155 -155 ?
? -155 310 -155 155 ?
? 310 155 310 0 ?
? -155 -155 -155 0 ?
? ? ? ? ? ?
Because of the size of the filter kernel, we can't calculate values for the pixels at the
edge; so we typically just apply a padding value (often 0):
Copy
0 0 0 0 0 0
0 155 -155 155 -155 0
0 -155 310 -155 155 0
0 310 155 310 0 0
0 -155 -155 -155 0 0
0 0 0 0 0 0
The output of the convolution is typically passed to an activation function, which is often
a Rectified Linear Unit (ReLU) function that ensures negative values are set to 0:
Copy
0 0 0 0 0 0
0 155 0 155 0 0
0 0 310 0 155 0
0 310 155 310 0 0
0 0 0 0 0 0
0 0 0 0 0 0
The resulting matrix is a feature map of feature values that can be used to train a
machine learning model.
Note: The values in the feature map can be greater than the maximum value for a pixel
(255), so if you wanted to visualize the feature map as an image you would need
to normalize the feature values between 0 and 255.
Typically, a convolutional layer applies multiple filter kernels. Each filter produces a
different feature map, and all of the feature maps are passed onto the next layer of the
network.
Pooling layers
One of the most common kinds of pooling is max pooling in which a filter is applied to
the image, and only the maximum pixel value within the filter area is retained. So for
example, applying a 2x2 pooling kernel to the following patch of an image would
produce the result 155.
Copy
0 0
0 155
Note that the effect of the 2x2 pooling filter is to reduce the number of values from 4 to
1.
As with convolutional layers, pooling layers work by applying the filter across the whole
feature map. The animation below shows an example of max pooling for an image map.
1. The feature map extracted by a filter in a convolutional layer contains an array of feature
values.
2. A pooling kernel is used to reduce the number of feature values. In this case, the kernel
size is 2x2, so it will produce an array with quarter the number of feature values.
3. The pooling kernel is convolved across the feature map, retaining only the highest pixel
value in each position.
Dropping layers
One of the most difficult challenges in a CNN is the avoidance of overfitting, where the
resulting model performs well with the training data but doesn't generalize well to new
data on which it wasn't trained. One technique you can use to mitigate overfitting is to
include layers in which the training process randomly eliminates (or "drops") feature
maps. This may seem counterintuitive, but it's an effective way to ensure that the model
doesn't learn to be over-dependent on the training images.
Other techniques you can use to mitigate overfitting include randomly flipping,
mirroring, or skewing the training images to generate data that varies between training
epochs.
Flattening layers
After using convolutional and pooling layers to extract the salient features in the
images, the resulting feature maps are multidimensional arrays of pixel values. A
flattening layer is used to flatten the feature maps into a vector of values that can be
used as input to a fully connected layer.
Usually, a CNN ends with a fully connected network in which the feature values are
passed into an input layer, through one or more hidden layers, and generate predicted
values in an output layer.
1. Images are fed into a convolutional layer. In this case, there are two filters, so each image
produces two feature maps.
2. The feature maps are passed to a pooling layer, where a 2x2 pooling kernel reduces the
size of the feature maps.
3. A dropping layer randomly drops some of the feature maps to help prevent overfitting.
4. A flattening layer takes the remaining feature map arrays and flattens them into a vector.
5. The vector elements are fed into a fully connected network, which generates the
predictions. In this case, the network is a classification model that predicts probabilities for
three possible image classes (triangle, square, and circle).
45 minutes
PyTorch and TensorFlow both offer comprehensive support for building convolutional
neural networks as classification models for images.
In this exercise, you'll use your preferred framework to create a simple CNN-based
image classifier for images of simple geometric shapes. The same principles can be
applied to images of any kind.
When you've finished working through the notebook, return to this module and move
on to the next unit to learn more.
UNIT 6/9:
Transfer learning
Completed100 XP
5 minutes
In life, it’s often easier to learn a new skill if you already have expertise in a similar,
transferrable skill. For example, it’s probably easier to teach someone how to drive a bus
if they have already learned how to drive a car. The driver can build on the driving skills
they've already learned in a car, and apply them to driving a bus.
The same principle can be applied to training deep learning models through a
technique called transfer learning.
The feature extraction layers apply convolutional filters and pooling to emphasize
edges, corners, and other patterns in the images that can be used to differentiate them,
and in theory should work for any set of images with the same dimensions as the input
layer of the network. The prediction layer maps the features to a set of outputs that
represent probabilities for each class label you want to use to classify the images.
By separating the network into these types of layers, we can take the feature extraction
layers from a model that has already been trained and append one or more layers to
use the extracted features for prediction of the appropriate class labels for your images.
This approach enables you to keep the pre-trained weights for the feature extraction
layers, which means you only need to train the prediction layers you have added.
There are many established convolutional neural network architectures for image
classification that you can use as the base model for transfer learning, so you can build
on the work someone else has already done to easily create an effective image
classification model.
UNIT 7/9:
30 minutes
PyTorch and TensorFlow both support a library of existing models that you can use as
the basis for transfer learning.
In this exercise, you'll use your preferred framework to train a convolutional neural
network model by using transfer learning.
Clean-up
If you used a compute instance in an Azure Machine Learning workspace to complete
the exercises, use these steps to clean up.
If you don't intend to complete other modules that require the Azure Machine Learning
workspace, you can delete the resource group you created for it from your Azure
subscription.
UNIT 8/9:
Knowledge check
200 XP
3 minutes
1.
You are creating a deep neural network to train a classification model that predicts to
which of three classes an observation belongs based on 10 numeric features. Which of
the following statements is true of the network architecture?
ANSWER: 3 (The output layer should contain a node for each possible class value.)
2.
You are training a deep neural network. You configure the training process to use 50
epochs. What effect does this configuration have?
The training data is split into 50 subsets, and each subset is passed through the network
The first 50 rows of data are used to train the model, and the remaining rows are used
to validate it
ANSWER: 1 (The number of epochs determines the number of training passes for
the full dataset.)
3.
You are creating a deep neural network. You increase the Learning Rate parameter.
What effect does this setting have?
More records are included in each batch passed through the network
4.
You are creating a convolutional neural network. You want to reduce the size of the
feature maps that are generated by a convolutional layer. What should you do?
Reduce the size of the filter kernel used in the convolutional layer
1 minute
In this module you learned about the fundamental principles of deep learning, and how
to create deep neural network models using PyTorch or Tensorflow. You also explored
the use of convolutional neural networks to create image classification models.
Deep learning techniques are at the cutting edge of machine learning and artificial
intelligence, and are used to implement enterprise solutions. If this module has inspired
you to build machine learning solutions, you should consider learning how Azure
Machine Learning can help you train, deploy, and manage models at scale. You can
learn how to use Azure Machine Learning to manage machine learning operations in
the Build AI solutions with Azure Machine Learning service learning path.
UNIT 1/8:
Introduction
Completed100 XP
1 minute
Machine Learning is the foundation for most artificial intelligence solutions. Creating an
intelligent solution often begins with the use of machine learning to train predictive
models using historic data that you have collected.
Azure Machine Learning is a cloud service that you can use to train and manage machine
learning models.
To complete this module, you'll need a Microsoft Azure subscription. If you don't
already have one, you can sign up for a free trial at https://azure.microsoft.com.
UNIT 2/8:
What is machine learning?
Completed100 XP
5 minutes
Machine learning is a technique that uses mathematics and statistics to create a model
that can predict unknown values.
For example, suppose Adventure Works Cycles is a business that rents cycles in a city.
The business could use historic data to train a model that predicts daily rental demand
in order to make sure sufficient staff and cycles are available.
To do this, Adventure Works could create a machine learning model that takes
information about a specific day (the day of week, the anticipated weather conditions,
and so on) as an input, and predicts the expected number of rentals as an output.
Mathematically, you can think of machine learning as a way of defining a function (let's
call it f) that operates on one or more features of something (which we'll call x) to
calculate a predicted label (y) - like this:
f(x) = y
In this bicycle rental example, the details about a given day (day of the week, weather,
and so on) are the features (x), the number of rentals for that day is the label (y), and the
function (f) that calculates the number of rentals based on the information about the
day is encapsulated in a machine learning model.
Regression: used to predict a continuous value; like a price, a sales total, or some other
measure.
Classification: used to determine a binary class label; like whether a patient has diabetes
or not.
Clustering: used to determine labels by grouping similar information into label groups;
like grouping measurements from birds into species.
The following video discusses the various kinds of machine learning model you can
create, and the process generally followed to train and use them.
5 minutes
Training and deploying an effective machine learning model involves a lot of work,
much of it time-consuming and resource-intensive. Azure Machine Learning is a cloud-
based service that helps simplify some of the tasks it takes to prepare data, train a
model, and deploy a predictive service.
Most importantly, Azure Machine Learning helps data scientists increase their efficiency
by automating many of the time-consuming tasks associated with training models; and
it enables them to use cloud-based compute resources that scale effectively to handle
large volumes of data while incurring costs only when actually used.
After you have created an Azure Machine Learning workspace, you can develop
solutions with the Azure machine learning service either with developer tools or the
Azure Machine Learning studio web portal.
Compute targets are cloud-based resources on which you can run model training and
data exploration processes.
In Azure Machine Learning studio, you can manage the compute targets for your data
science activities. There are four kinds of compute resource you can create:
Compute Instances: Development workstations that data scientists can use to work with
data and models.
Compute Clusters: Scalable clusters of virtual machines for on-demand processing of
experiment code.
Inference Clusters: Deployment targets for predictive services that use your trained
models.
Attached Compute: Links to existing Azure compute resources, such as Virtual Machines
or Azure Databricks clusters.
UNIT 4/8:
3 minutes
Automated machine learning allows you to train models without extensive data science
or programming knowledge. For people with a data science and programming
background, it provides a way to save time and resources by automating algorithm
selection and hyperparameter tuning.
You can create an automated machine learning job in Azure Machine Learning studio.
In Azure Machine Learning, operations that you run are called jobs. You can configure
multiple settings for your job before starting an automated machine learning run. The
run configuration provides the information needed to specify your training script,
compute target, and Azure ML environment in your run configuration and run a training
job.
UNIT 5/8:
Understand the AutoML process
Completed100 XP
5 minutes
1. Prepare data: Identify the features and label in a dataset. Pre-process, or clean and
transform, the data as needed.
2. Train model: Split the data into two groups, a training and a validation set. Train a
machine learning model using the training data set. Test the machine learning model for
performance using the validation data set.
3. Evaluate performance: Compare how close the model's predictions are to the known
labels.
4. Deploy a predictive service: After you train a machine learning model, you can deploy
the model as an application on a server or device so that others can use it.
These are the same steps in the automated machine learning process with Azure
Machine Learning.
Prepare data
Machine learning models must be trained with existing data. Data scientists expend a lot
of effort exploring and pre-processing data, and trying various types of model-training
algorithms to produce accurate models, which is time consuming, and often makes
inefficient use of expensive compute hardware.
In Azure Machine Learning, data for model training and other operations is usually
encapsulated in an object called a dataset. You can create your own dataset in Azure
Machine Learning studio.
Train model
The automated machine learning capability in Azure Machine Learning
supports supervised machine learning models - in other words, models for which the
training data includes known label values. You can use automated machine learning to
train models for:
The best model is identified based on the evaluation metric you specified, Normalized
root mean squared error.
The difference between the predicted and actual value, known as the residuals, indicates
the amount of error in the model. The performance metric root mean squared
error (RMSE), is calculated by squaring the errors across all of the test cases, finding the
mean of these squares, and then taking the square root. What all of this means is that
smaller this value is, the more accurate the model's predictions. The normalized root
mean squared error (NRMSE) standardizes the RMSE metric so it can be used for
comparison between models which have variables on different scales.
After you've used automated machine learning to train some models, you can deploy
the best performing model as a service for client applications to use.
50 minutes
In this exercise, you will use a dataset of historical bicycle rental details to train a model
that predicts the number of bicycle rentals that should be expected on a given day,
based on seasonal and meteorological features.
Note
To complete this lab, you will need an Azure subscription in which you have
administrative access.
4 minutes
1.
An automobile dealership wants to use historic car sales data to train a machine learning
model. The model should predict the price of a pre-owned car based on its make,
model, engine size, and mileage. What kind of machine learning model should the
dealership use automated machine learning to create?
Classification
Regression
2.
A bank wants to use historic loan repayment records to categorize loan applications as
low-risk or high-risk based on characteristics like the loan amount, the income of the
borrower, and the loan period. What kind of machine learning model should the bank
use automated machine learning to create?
Classification
Regression
You want to use automated machine learning to train a regression model with the best
possible R2 score. How should you configure the automated machine learning
experiment?
Enable featurization
1 minute
2 minutes
You can use Microsoft Azure Machine Learning designer to create regression models by
using a drag and drop visual interface, without needing to write any code.
To complete this module, you'll need a Microsoft Azure subscription. If you don't
already have one, you can sign up for a free trial at https://azure.microsoft.com.
UNIT 2/8:
Identify regression machine learning
scenarios
Completed100 XP
3 minutes
5 minutes
Training and deploying an effective machine learning model involves a lot of work,
much of it time-consuming and resource-intensive. Azure Machine Learning is a cloud-
based service that helps simplify some of the tasks it takes to prepare data, train a
model, and deploy a predictive service. Regression machine learning models can be
built using Azure Machine Learning.
Most importantly, Azure Machine Learning helps data scientists increase their efficiency
by automating many of the time-consuming tasks associated with training models. It
enables them to use cloud-based compute resources that scale effectively to handle
large volumes of data while incurring costs only when actually used.
After you have created an Azure Machine Learning workspace, you can develop
solutions with the Azure machine learning service either with developer tools or the
Azure Machine Learning studio web portal.
In Azure Machine Learning studio, you can manage the compute targets for your data
science activities. There are four kinds of compute resource you can create:
Compute Instances: Development workstations that data scientists can use to work with
data and models.
Compute Clusters: Scalable clusters of virtual machines for on-demand processing of
experiment code.
Inference Clusters: Deployment targets for predictive services that use your trained
models.
Attached Compute: Links to existing Azure compute resources, such as Virtual Machines
or Azure Data bricks clusters.
UNIT 4/8:
What is Azure Machine Learning
designer?
100 XP
4 minutes
In Azure Machine Learning studio, there are several ways to author regression machine
learning models. One way is to use a visual interface called designer that you can use to
train, test, and deploy machine learning models. The drag-and-drop interface makes use
of clearly defined inputs and outputs that can be shared, reused, and version controlled.
Each designer project, known as a pipeline, has a left panel for navigation and a canvas
on your right hand side. To use designer, identify the building blocks, or components,
needed for your model, place and connect them on your canvas, and run a machine
learning job.
Pipelines
Pipelines let you organize, manage, and reuse complex machine learning workflows
across projects and users. A pipeline starts with the dataset from which you want to train
the model. Each time you run a pipeline, the configuration of the pipeline and its results
are stored in your workspace as a pipeline job.
Components
An Azure Machine Learning component encapsulates one step in a machine learning
pipeline. You can think of a component as a programming function and as a building
block for Azure Machine Learning pipelines. In a pipeline project, you can access data
assets and components from the left panel's Asset Library tab.
Datasets
You can create data assets on the Data page from local files, a datastore, web files, and
Open Datasets. These data assets will appear along with standard sample datasets
in designer's Asset
Library.
In your designer project, you can access the status of a pipeline job using
the Submitted jobs tab on the left
pane.
You can find all the jobs you have run in a workspace on the Jobs page.
UNIT 5/8:
Understand steps for regression
Completed100 XP
6 minutes
You can think of the steps to train and evaluate a regression machine learning model as:
1. Prepare data: Identify the features and label in a dataset. Pre-process, or clean and
transform, the data as needed.
2. Train model: Split the data into two groups, a training and a validation set. Train a
machine learning model using the training data set. Test the machine learning model for
performance using the validation data set.
3. Evaluate performance: Compare how close the model's predictions are to the known
labels.
4. Deploy a predictive service: After you train a machine learning model, you need to
convert the training pipeline into a real-time inference pipeline. Then you can deploy the
model as an application on a server or device so that others can use it.
Prepare data
Azure machine learning designer has several pre-built components that can be used to
prepare data for training. These components enable you to clean data, normalize
features, join tables, and
more.
Train model
To train a regression model, you need a dataset that includes historical features,
characteristics of the entity for which you want to make a prediction, and
known label values. The label is the quantity you want to train a model to predict.
It's common practice to train the model using a subset of the data, while holding back
some data with which to test the trained model. This enables you to compare the labels
that the model predicts with the actual known labels in the original dataset.
Inference pipeline
To deploy your pipeline, you must first convert the training pipeline into a real-time
inference pipeline. This process removes training components and adds web service
inputs and outputs to handle requests.
The inference pipeline performs the same data transformations as the first pipeline
for new data. Then it uses the trained model to infer, or predict, label values based on its
features. This model will form the basis for a predictive service that you can publish for
applications to use.
You can create an inference pipeline by selecting the menu above a completed
job.
Deployment
After creating the inference pipeline, you can deploy it as an endpoint. In the endpoints
page, you can view deployment details, test your pipeline service with sample data, and
find credentials to connect your pipeline service to a client application.
It will take a while for your endpoint to be deployed. The Deployment state on
the Details tab will indicate Healthy when deployment is successful.
On the Test tab, you can test your deployed service with sample data in a JSON format.
The test tab is a tool you can use to quickly check to see if your model is behaving as
expected. Typically it is helpful to test the service before connecting it to an application.
You can find credentials for your service on the Consume tab. These credentials are
used to connect your trained machine learning model as a service to a client application.
UNIT 6/8:
Exercise - Explore regression with Azure
Machine Learning designer
Completed100 XP
55 minutes
In this exercise, you will train a regression model that predicts the price of an
automobile based on its characteristics.
Note
To complete this lab, you will need an Azure subscription in which you have
administrative access.
In Azure Machine Learning studio, what can you use to author regression machine
learning pipelines using a drag-and-drop interface?
Notebooks
Designer
ANSWER: 3 (You can use Designer to author regression pipelines with a drag-and-drop
interface. )
2.
You are creating a training pipeline for a regression model. You use a dataset that has
multiple numeric columns in which the values are on different scales. You want to
transform the numeric columns so that the values are all on a similar scale. You also
want the transformation to scale relative to the minimum and maximum values in each
column. Which module should you add to the pipeline?
Normalize Data
ANSWER: 3 (When you transform numeric data to be on a similar scale, use a Normalize
Data module. )
3.
Data is split into two sets in order to create two models, one model with the training set
and a different model with the validation set.
Splitting data into two sets enables you to compare the labels that the model predicts
with the actual known labels in the original dataset.
Only split data when you use the Azure Machine Learning Designer, not in other
machine learning scenarios.
ANSWER: 2 ( You want to test the model created with training data on validation data to
see how well the model performs with data it was not trained on. )
UNIT 8/8:
Summary
Completed100 XP
2 minutes