Professional Documents
Culture Documents
All Types of Cross Validation
All Types of Cross Validation
Advantages
Disadvantages
If the dataset itself is small, setting aside portions for testing would reduce the
robustness of the model. This is because the training sample may not be
representative of the entire dataset.
The evaluation metrics may vary due to the randomness of the split between the
train and test set.
Although 80-20 split for train test is widely followed, there is no thumb rule for the
split and hence the results can vary based on how the train test split is done.
Advantage
Since every data participates both for training and testing, the overall accuracy is
more reliable.
It is very useful when the dataset is small.
Disadvantage
LOOCV is not practical to use when the number of data observations n is huge.
E.g. imagine a dataset with 500,000 records, then 500,000 model needs to be
created which is not really feasible.
There is a huge computational and time cost associated with the LOOCV approach.
3. K-Fold Cross-Validation
In the K-Fold Cross-Validation approach, the dataset is split into K folds. Now in 1st
iteration, the first fold is reserved for testing and the model is trained on the data of the
remaining k-1 folds.
In the next iteration, the second fold is reserved for testing and the remaining folds are
used for training. This is continued till the K-th iteration. The accuracy obtained in each
iteration is used to derive the overall average accuracy for the model.
Advantages
K-Fold cross-validation is useful when the dataset is small and splitting it is not
possible to split it in train-test set (hold out approach) without losing useful data
for training.
It helps to create a robust model with low variance and low bias as it is trained on
all data
Disadvantages
The major disadvantage of K-Fold Cross Validation is that the training needs to be
done K times and hence it consumes more time and resources,
Not recommended to be used with sequential time series data.
When the dataset is imbalanced, K-fold cross-validation may not give good results.
This is because some folds may have just a few or no records for the
minority class.
Advantage
Stratified K-fold cross-validation is recommended when the dataset is imbalanced.
About Dataset
We will be using Parkinson’s disease dataset for all examples of cross-validation in the
Sklearn library. The goal is to predict whether or not a particular patient has Parkinson’s
disease. We will be using the decision tree algorithm in all the examples.
The dataset has 21 attributes and 195 rows. The various fields of the Parkinson’s Disease
dataset are as follows –
MDVP:Fo(Hz) – Average vocal fundamental frequency
MDVP:Fhi(Hz) – Maximum vocal fundamental frequency
MDVP:Flo(Hz) – Minimum vocal fundamental frequency
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP – Several
measures of variation in fundamental frequency
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Sh
immer:DDA – Several measures of variation in amplitude
NHR,HNR – Two measures of ratio of noise to tonal components in the voice
status – Health status of the subject (one) – Parkinson’s, (zero) – healthy
RPDE,D2 – Two nonlinear dynamical complexity measures
DFA – Signal fractal scaling exponent
spread1,spread2PPE – Three nonlinear measures of fundamental frequency
variation
df=pd.read_csv(“Parkinsson disease.csv")
df.head()
Out[52]:
MDVP:Fo(Hz MDVP:Fhi(Hz MDVP:Flo(Hz MDVP:Jitter(% MDVP:Jitter(
name
) ) ) ) )
phon_R01_S01_
0 119.992 157.302 74.997 0.00784 0.00007
1
phon_R01_S01_
1 122.400 148.650 113.819 0.00968 0.00008
2
phon_R01_S01_
2 116.682 131.111 111.555 0.01050 0.00009
3
phon_R01_S01_
3 116.676 137.871 111.366 0.00997 0.00009
4
phon_R01_S01_
4 116.014 141.781 110.655 0.01284 0.00011
5
5 rows × 24 columns
Data Preprocessing
The “name” column is not going to add any value in training the model and can be
discarded, so we are dropping it below.
X=df.drop('status', axis=1)
y=df['status']
In the below example we have split the dataset to create the test
data with a size of 30% and train data with a size of 70%. The
random_state number ensures the split is deterministic in every run.
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
0.7796610169491526
K-Fold Cross-Validation
K-Fold Cross-Validation in Sklearn can be applied by using cross_val_score module
of sklearn.model_selection.
In the below example, 10 folds are used that produced 10 accuracy scores using which we
calculated the mean score.
In [40]:
model=DecisionTreeClassifier()
kfold_validation=KFold(10)
results=cross_val_score(model,X,y,cv=kfold_validation)
print(results)print(np.mean(results))
Out[40]:
0.758421052631579
In the below example, the dataset is divided into 5 splits or folds. It returns 5 accuracy
scores using which we calculate the final mean score.
In [41]:
skfold=StratifiedKFold(n_splits=5)
model=DecisionTreeClassifier()scores=cross_val_score(model,X,y,cv
=skfold)
print(scores)print(np.mean(scores))
Out[41]:
0.717948717948718
model=DecisionTreeClassifier()
leave_validation=LeaveOneOut()
results=cross_val_score(model,X,y,cv=leave_validation)
results
Out[22]:
array([1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0.,
1., 1., 1.,
1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
0., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
1., 1., 0.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 0.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,
1., 1., 1.,
1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
1., 1., 1.,
0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 0., 1.,
1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 0., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 0.,
1., 1., 1.,
1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
1., 0., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 0., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 0., 0.,
print(np.mean(results))
Out[44]:
0.8358974358974359
Repeated Random Test-Train Splits
In Sklearn repeated random test-train splits can be applied by using ShuffleSplit module of
sklearn.model_selection
In [45]:
model=DecisionTreeClassifier()
ssplit=ShuffleSplit(n_splits=10,test_size=0.30)
results=cross_val_score(model,X,y,cv=ssplit)
print(results)print(np.mean(results))
Out[45]:
0.8016949152542372