Data Science Algorithms

For supervised learning the data will have labels which we can call
outcome.
It’s this level which machine will try to predict based on information available with training
data.
#cylinders bore horsepower peak-rpm price
four 3.47 111 5000 13495
four 3.47 111 5000 16500
six 2.68 154 5000 16500
four 3.19 102 5500 13950
five 3.19 115 5500 17450
five 3.19 110 5500 15250

ALGORITHM WHICH CAN BE BUILD WITH
SUPERVISED LEARNING
• Regression model • Classification model
Regression model
We predict numeric value of the label based on the data available
The predicted value is called outcome.

Classification model
We predict category value of the label based on the data available
The predicted value is called outcome.
Plasma glucose BP BMI Age Diabetic

148 72 33.6 50 yes
85 66 26.6 31 no
183 64 23.3 32 yes
89 66 28.1 21 no
137 40 43.1 33 yes
116 74 25.6 30 no
78 50 31 26 yes
115 0 35.3 29 no
197 70 30.5 53 yes
90% DATA AVAILABLE IS SUITABLE FOR
UNSUPERVISED LEARNING ONLY.
UNSUPERVISED LEARNING WILL HAVE

LIMITED NUMBER OF ALGORITHMS AND
TESTING.
ALGORITHM WHICH CAN BE BUILD WITH
UNSUPERVISED LEARNING
CLUSTERING MODEL
A N O M A LY D E T E C T I O N
CLUSTERING MODEL
wheel-base length width height curb-weight
88.6 168.8 64.1 48.8 2548

88.6 168.8 64.1 48.8 2548
94.5 171.2 65.5 52.4 2823
99.8 278 66.2 54.3 1465
99.4 176.6 66.4 54.3 2824
99.8 177.3 66.3 53.1 2507
105.8 192.7 71.4 55.7 2844
105.8 192.7 71.4 55.7 2954
105.8 192.7 71.4 55.9 3086
99.5 178.2 67.9 52 3053
ANOMALY DETECTION MODEL
88.6 168.8 64.1 48.8 2548
88.6 168.8 64.1 48.8 2548
94.5 171.2 65.5 52.4 2823
89 278 93 46.6 1465
99.4 176.6 66.4 54.3 2824
99.8 177.3 66.3 53.1 2507
105.8 192.7 71.4 55.7 2844
105.8 192.7 71.4 55.7 2954
105.8 192.7 71.4 55.9 3086
99.5 178.2 67.9 52 3053
K MEANS CLUSTERING

88.6 168.8 64.1 48.8 2548
88.6 168.8 64.1 48.8 2548
94.5 171.2 65.5 52.4 2823
89 278 93 46.6 1465
99.4 176.6 66.4 54.3 2824
99.8 177.3 66.3 53.1 2507
105.8 192.7 71.4 55.7 2844
105.8 192.7 71.4 55.7 2954
105.8 192.7 71.4 55.9 3086
99.5 178.2 67.9 52 3053
K MEANS CLUSTERING
K MEANS CLUSTERING
3500
3000
2500
Crub weight
2000
1500
1000
500
0
86 88 90 92 94 96 98 100 102 104 106 108
Wheel base
K MEANS CLUSTERING
3500
3000
2500
Crub weight
2000
1500
1000
500
0
85 90 95 100 105 110
Wheel base
LINEAR REGRESSION
Algorithm to predict the value of
dependent variable with value of
one or more independent variables.
Applicable for supervised learning

only.
Single independent variable
9
6
GRADES Y
5
0
0 2 4 6 8 10 12
AVERAGE HRS OF STUDY X
9
6
GRADES Y
5
1
C
0
0 2 4 6 8 10 12
Y = mX + C
9
6
GRADES
0
0 2 4 6 8 10 12
AVERAGE HRS OF STUDY
Y = -0.0498x2 + 1.074x + 1.817

Multiple independent variable
#cylinders (X1) Bore (X2) Horsepower (X3) peak-rpm (X4) Price (Y)
four 3.47 111 5000 13495
four 3.47 111 5000 16500
six 2.68 154 5000 16500
four 3.19 102 5500 13950
five 3.19 115 5500 17450
five 3.19 110 5500 15250
Multiple independent variable
#cylinders (X1) Bore (X2) Horsepower (X3) peak-rpm (X4) Price (Y)
four 3.47 111 5000 13495
four 3.47 111 5000 16500
six 2.68 154 5000 16500
four 3.19 102 5500 13950
five 3.19 115 5500 17450
five 3.19 110 5500 15250
R squared of linear regression
Y
70
60
50
Dependent variable
40
30
20
10
0
0 1 2 3 4 5 6 7
Independent variable
Y
70
60
50
Dependent variable
40
30
20
10
0
0 1 2 3 4 5 6 7
Independent variable
9
6
GRADES Y
5
0
0 2 4 6 8 10 12
9
6
GRADES Y
5
1
C
0
0 2 4 6 8 10 12
Y = mX + C
LINEAR REGRESSION
Algorithm to predict the value of
dependent variable with value of
one or more independent variables.
Applicable for supervised learning

only.
9
6
GRADES Y
5
0
0 2 4 6 8 10 12
9
6
GRADES Y
5
1
C
0
0 2 4 6 8 10 12
Y = mX + C
Hrs
of Grade
study s
Y-
Mean(YM (Y- Y-Predicted (YP-
X Y ) Y-YM YM)2 (YP) YP-YM YM)2
1 3 5.375 -2.375 5.64 3.39 -1.99 3.95
2 4 5.375 -1.375 1.89 3.92 -1.46 2.13
3 3 5.375 -2.375 5.64 4.45 -0.93 0.86
3 5 5.375 -0.375 0.14 4.45 -0.93 0.86
5 7 5.375 1.625 2.64 5.50 0.13 0.02
6 7 5.375 1.625 2.64 6.03 0.66 0.43
8 6 5.375 0.625 0.39 7.09 1.72 2.94
10 8 5.375 2.625 6.89 8.15 2.77 7.70
5.375 25.88 18.89
𝛴 𝑌𝑃 − 𝑌𝑀 2
=18.89/25.88
R squared = σ 𝑌 − 𝑌𝑀 2 =0.73%
What does low value of R squared means
• The dependent variable has low dependence on independent variable

• The fitting regression line is not the optimum
• Predictions will not be very precise.
DECISION TREES
PRINCIPAL COMPONENT ANALYSIS
Scree plot
Scree plot
NEURAL NETWORK – DEEP LEARNING

Data Science Algorithms

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Algorithms

Uploaded by

Copyright:

Available Formats

For supervised learning the data will have labels which we can call

four 3.47 111 5000 13495

four 3.47 111 5000 16500

six 2.68 154 5000 16500

four 3.19 102 5500 13950

five 3.19 115 5500 17450

five 3.19 110 5500 15250

The predicted value is called outcome.

The predicted value is called outcome.

Plasma glucose BP BMI Age Diabetic

UNSUPERVISED LEARNING WILL HAVE

wheel-base length width height curb-weight

88.6 168.8 64.1 48.8 2548

wheel-base length width height curb-weight

Applicable for supervised learning

Y = -0.0498x2 + 1.074x + 1.817

Applicable for supervised learning

• The dependent variable has low dependence on independent variable

You might also like