Professional Documents
Culture Documents
Training vs. Testing Sets - Solution
Training vs. Testing Sets - Solution
Training vs. Testing Sets - Solution
Activity Overview
This activity is designed to consolidate your knowledge about the differences in the training and testing sets
and to teach you how to define those in Python using the sklearn library.
In this activity, you'll use one of the toy dataset made available on sklearn . We choose to use the wine
dataset.
This assignment is designed to help you apply the machine learning algorithms you have learned using the
packages in Python . Python concepts, instructions, and starter code are embedded within this Jupyter
Notebook to help guide you as you progress through the activity. Remember to run the code of each code
cell prior to submitting the assignment. Upon completing the activity, we encourage you to compare your
work against the solution file to perform a self-assessment.
NumPy is a library for the Python programming language that adds support for large, multi-
dimensional arrays and matrices, along with a large collection of high-level mathematical functions to
operate on these arrays. The code within the NumPy library is divided into submodules to faciltate the
usage. For example, in the code cell below, we import the module random used to generate and work
with random numbers.
pandas is a software library written for the Python programming language for data manipulation
and analysis. In particular, it offers data structures and operations for manipulating dataframes and
numerical tables series.
load_wine is one of the toy datasets readily available from the sklearn librayy. Scikit-learn
(also known as sklearn ) is a free software machine learning library for the Python programming
language. It features various classification, regression, and clustering algorithms.
In the code cell below, we start by importing the necessary libraries and modules. Next, we load the
dataset from sklearn and assign that to the variable wine . Next, we use a combination of NumPy
and pandas functions to create the dataframe df .
Note: This is not the standard way of importing data during a regular project. The code in the cell below is
only appropriate to use when importing toy datasets from sklearn .
We begin by visualizing the first several rows of the DataFrame df using the function .head() . By
default, .head() displays the first five rows of a DataFrame; this can be changed by passing the desired
number of rows to the function .head() as an integer.
In [ ]: df.head(10)
Next, we retrieve some more information about our DataFrame by using the properties .shape and
columns .
In [ ]: df.shape
In [ ]: df.columns
Training set is used to train the model, testing set is used to tune it.
As you have seen in Video 3 for this week, it is important to split the data into a training and testing sets.
To split the data into training and testing dataset you can use the function train_test_split
(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from
sklearn . This function does a random split of data arrays or matrices into train and test subsets and
returns a list containing a train-test split of inputs.
As you observed, in our case, the function train_test_split takes four arguments:
X : Input dataframe
y : Output dataframe
test_size : Should be between 0.0 and 1.0 and should represent the proportion of the dataset to
include in the test split
random_state : Controls the shuffling applied to the data before applying the split. Ensures the
reproducibility of the results across multiple function calls
In the code cell below, fill in the ellipsis to set the argument test_size equal to 0.3 and
random_state equal to 123 .
You can see the size of the resulting train and test subsets, using .shape :
In [ ]: X_train.shape
In [ ]: X_test.shape
We will learn how to separate the data into inputs and outputs and how to implement algorithms in the next
segments of this week of the course.
Stay tuned!
In [ ]: