PS Notes (Machine Learning

PS1:
 head() shows the first few rows of a DataFrame. typically the top 5 rows by default. It
helps you get a quick glimpse of the data and its structure.
 info() provides a concise summary of a DataFrame, including data types and non-
null counts. It shows information such as the data types of each column, the number of
non-null values, and memory usage. It is useful for understanding the data types and
identifying missing values.
 describe() gives statistical information about the numerical columns in a
DataFrame, like mean, standard deviation, min, max, etc. This method generates
descriptive statistics for the numerical columns in a DataFrame. It gives statistical measures
like mean, standard deviation, minimum, maximum, quartiles, etc. It provides a high-level
overview of the distribution of the data.
 Are there some features that have a strong skewed distribution? If so, what transformation
could help? Yes .
o A log transformation would help to make their distribution more symmetric.
 It is always good to convert qualitative variables into quantitative before fitting machine
learning models
 if we fit and evaluate our model on the same data we obtain overly optimistic results. For
this reason, we have to split the dataset into a train and test part.
5. Split data into train and test
If we fit and evaluate our model on the same data we obtain overly optimistic results. For this
reason, we have to split the dataset into a train and test part.
PS2
6. Train linear regression with stochastic gradient descent
The gradient descent scheme that we coded above is known as batch gradient descent. This is
because at every step it uses all the training data to update the gradient vector and the
parameter vector 𝛽
. This makes it very slow when the training dataset contains many observations.
At the other end of the spectrum, we have the stochastic gradient descent. Here, at every
step, we randomly choose one random observation (row) from the dataset, to update the
gradient vector and the parameter vector 𝛽
. In other words, the updates at every step depend only on a single observation.
This makes the algorithm much faster to compute each iteration, compared to the batch
version. On the other hand, since our updates depend on a single observation, this algorithm
is more "erratic" than the batch version, and it will never settles at the minimum point, unless
we "stop it".
This is why, in the stochastic gradient descent, it is important to have a learning rate 𝛼
that slowly decays to zero, as the number of iterations increases. One common choice is to
define the learning rate as follows
PS3
cross validation taking a train test, the train test split one step further and doing this systematically
with A10 different number of different train test splits
The larger the number of neighbors, the more sparse the model is, the more parsimonious or the
less flexible
using more features that are actually just random noise. This doesn't mean that we shouldn't do the
rescaling. That means we should either use better estimators or we should be using. Some kind of
form of variable selection or? Penalization or. Whatever we can, we can do here, so yeah.
the optimal value depends on the specific data we try to model, and there cannot be a "best" hyper-
parameter value overall, this is often called the no free lunch theorem.)
As kNN relies on euclidean distance to select the nearest points, it is always better to work with
standardized data (each variable rescaled to be centered at 0 and have unit variance). That way the
same distance along each feature axis is proportional for each variable.
When you leave some validation/test data out to evaluate your model, it is best practice to not use it
for any type of estimation, to avoid having a biased error estimate (as, otherwise, you already "took"
some information from the validation data). This also holds for estimating the mean and standard
deviation of variable
If you now think about estimating the error using cross-validation, you'll quickly realize that not
overfitting with the scaler gets a bit more complicated, as we need to perform the above procedure
separately for each fold, before fitting the model. That is when Pipelines come into play.
PS4: Lasso and Ridge
Ridge regression is that it actually penalizes the values of beta and forces the values of beta Smaller.
one important difference between Ridge regression and lasso. Lasso actually goes to zero and then. 0
Ridge on the other hand just gets closer to 0, meaning if you want to rule out coefficients like this
one. You will still have to go in yourself and say oh. They're probably useless, whereas with Russell it
will tell you, OK, it's exactly 0, you can forget it.
one of the main advantages of. Of the lasso is that it actually sets parameters to 0.
using standard using cross validation and the grid search that we did to find the optimal lesser tuning
parameter no.
by using grid search and cross validation we can find we can find the best tuning parameter alpha for
the problem at hand
Notice that we choose a very large fraction of test data because we want to see whether the Ridge
and Lasso regression can handle the high-dimensional setting.
Since ridge regression and lasso penalize the value of the coefficients 𝛽𝑖, it is important that all
features are on a similar scale
Why intercept is include in the model?
The fit_intercept option in machine learning, especially in linear regression, determines

whether an intercept term is included in the model (True) or not (False). The intercept is the point
where the regression line crosses the y-axis. Setting it to True helps when data doesn't pass through
the origin, while setting it to False assumes the line should go through the origin.
The RMSE values give you an idea of the average difference between the predicted values and the
actual values, measured in the same units as your target variable. Lower RMSE values indicate better
predictive performance because they mean that, on average, the model's predictions are closer to
the actual values.
RMSE on training data: 67636.08948270889
RMSE on test data: 69641.95294583413
The model appears to have a slightly better performance on the training data (lower RMSE)
compared to the test data, suggesting some degree of overfitting. Further analysis is needed to
assess if adjustments to the model or additional data are required for better generalization to new,
unseen data.
PS5 LDA QDA
The goal is to transform this data set in a shape that we used to so to have a matrix. What each
column is a feature and each entry so each row is an observation so different digits.
It has a training accuracy of .94 and a test accuracy, not too surprisingly, a bit lower of course, as
usual, which is .92. So it already does a pretty good job on this digits data set
LDA aims to computes the conditional means in each one of these classes. So for this class this will
be the conditional mean, for this class this would be the conditional mean and for this class this
would be the conditional mean for example. Then to classify it will of course try to kind of compute.
The decision boundary so that it can classify according to this mean and the variance
PS6: Logistic regression and decision Tree
Normally we like regression. We check the mean square error. With classification you don't use the
mean squared error, but for example accuracy. Of course, that's also included in this cycle.
If you set alpha or lasso to zero, you don't penalize at all. Since C is one over alpha that corresponds
to set C to Infinity or something very large, and vice versa. If you set alpha to something, the Infinity,
or in this case C to something is zero, you get so much penalization that you basically get the empty
model. So somewhere between those is where the interesting things happen.
Remember a small C is a lot of penalization ( mean testScore are small , not good)
Cost Complexity pruning:
if you prune something for low value of alpha, it's also gonna be pruned for all larger values of alpha.
PS7 Bagging and RF ( use Random Forest Classifier)
In principle, the idea of bagging and of random forest is to use the methods that have the highest
variance possible and the lowest bias possible, so that the bagging aggregation will reduce the
variance without increasing the bias too much. So in principle, you should take the trees that have
the largest variance possible, so you should. Through them very deep. But in practice. The random
forest can still be sensitive and the variants can kind of explode if you grow them fully fully and you
just have a single observations in each knee. So from a practical point of view, if you don't have an
infinite number of trees, it also makes sense to tune this minimal sample hyperparameter.
the only difference between bagging of three classifiers and random forest is that we don't do
these random sub number of features that each split. We consider all the features as we would do
with the classical tree.
in general when you have a classifier it’s that it’s good to check this difference model
diagnostics that will get into so here what we’ve asked you to do is to consider the best
random forest model according to our your grid
random forest works as a kind of voting algorithm you have each tree that will predict a
certain class and then what random forest does is that it looks at all the predicted classes
and it takes the majority class but for each class that you’re trying to model you can actually
look at the proportion of the trees that predict this class and this will give you it's not a
proper probability but this will give you kind of the certainty of the model for that class and
a lot of classifiers have this property in in Sicilian or in general so if you look for logistic
regression you also have a measure of certainty predicts a value between zero and one and
you have this base rule where if it’s above zero you classify as a one and if it’s below 0 you
classify as zero and that’s actually the same thing that happens in the random forest in the
binary case if more than 50% of the trees predict one you will have a ratio before above 0.5
and you will classify one and if your ratio is below 0.5 you will classify as a zero so you have
the same kind of based decision thing going on here so here we can access
in random forests you bootstrap your data, so then you have a different resample data sets and on
each new data set you will construct A classification tree as you did last time. And the trick in
random forest in order to reduce the the the covariance between each one of these trees. As
you've seen in class, at each split it will consider a random subset of the features and it will only
consider these subsets to decide which variable it will split at each node, right? Diffuse in in class and
this number of features that we selected randomly at this splits can be specified to the random
forest using this Max Max features options. So the default for classification is just to take the
square root of the number of features, and you can this is the default for this also exit lamp, so it's
just this Sqrt
In principle, the idea of bagging and of random forest is to use the methods that have the highest
variance possible and the lowest bias possible, so that the bagging aggregation will reduce the
variance without increasing the bias too much. So in principle you should take the trees that. Have
the largest value on possible, so you should throw them very deep. But in practice, the random forest
can still be sensitive and the variance can kind of explode if you grow them fully fully and you just
have the single observations in each leaf
PS9:SVM (deal with kernels used in SVM), Boosting; (RBF: Radial basis function; Kernel; learning rate
λ;
 SVM are sensitive to feature scaling, so it's good pratcice to scale the features before fitting
an SVM.
 If the best linear decision baundary doesn't fit the data very well, We need more flexiblility.
We can do this well the transformation of the polynomial features
 One approach to handling nonlinear datasets is to add more features, such as polynomial
features; in some cases this can result in a linearly separable dataset.
 C has two interpretations:
o C is inverse of regularization strength: High C, low regularization -> flexible model;
Low C, high regulations -> simple model
o C is inverse of budget: High C, low budget -> we accept only a few missclassfied
points; Low C, high budget -> we accept many missclassified points
 the effects of these gamma and C tuning parameters
o more flexible and good, So to make a linear classifier much more flexible using
feature transformations.
 So far, we tried to solve the problem of non linear decision boundary by performing explicit
feature transformations (i.e., adding polynomial features or radial basis function). Another
way to do this, is by using the _kernel trick_.
o The kernel trick makes it possible to get the same result as if you performed feature
transformations, without actually having to add new features.
o Remark__: Using the kernel trick is advisable only for small to medium-sized
datasets.
o It scales well with the number of features, especially with sparse features (i.e., when
each instance has few nonzero features).
o When the number of observations gets large (e.g., hundreds of thousands of
instances), using the kernel trick is _very_ slow. In such a case, one can still use the
approach above for a more reasonable fitting time, i.e., first tranforming features
and then applying linear SVM.
 RBF (radial basis function ): The RBFSampler has a tuning parameter gamma
o RBF: gamma controls how "local" the predictions are. You can think of gamma as
inverse of k in KNN.
o gamma is hyperparameter and this controls the flexibility of this feature mapping so
it controls how local the predictions will be when you will classify on these new
feature sets so large value of gamma will give you some more local influence of your
initial features and some low gamma value will give you more wide influence of your
initial features
o if low value of gamma ( like 0.001), => not flexible
 if we decrease C, more flexible for RBF and for polynomial
o for RBF, we have only the parameter gamma
o for polinomial, we have degree, coef
 the best model, is the model with the largest Test Accuracies.
 Boosting refers to any Ensemble method that can combine several weak learners into a
strong learner. The general idea of most boosting methods is to train predictors sequentially,
each trying to correct its predecessor.
 Gradient Boosting works by sequentially adding predictors to an ensemble, each one
correcting its predecessor. This method tries to fit the new predictor to the residual errors
made by the previous predictor
 The learning_rate λ hyperparameter scales the contribution of each tree. If you set it to a
low value, such as 0.01 or 0.001, you will need more trees in the ensemble to fit the training
set, but the predictions will usually generalize better. This is a regularization technique called
shrinkage.
 A better test score could be achieved using a cross-validated grid search to find the best
values for the hyperparameters max_depth, learning_rate and n_estimators, as
usual.
 Kernel: any valid kernel implicitly represents a dot product between two transformed
vectors. In other words, for any kernel k, one can find a transformation of the
variables Φ (x) (i.e., a feature transformation) such that k(x1,x2)=Φ(x1)’Φ(x2)
o This result helps a lot because one can represent a very high-dimensional
feature transformation without having to compute Φ(x).
 explicitly (more about this in the solutions).
the idea of support vector classifier is to look at your feature space and to look at each point
in your feature space which class it belongs to and the goal is to find glass separating
hyperplane
logistic regression has linear decision boundaries you can create polynomial features then in
the original features space the decision boundary of logistic regression will probably not be
linear anymore
gamma hyperparameter and this controls the flexibility of this feature mapping so it
controls how local the predictions will be when you will classify on these new feature sets so
large value of gamma will give you some more local influence of your initial features and
some low gamma value will give you more wide influence of your initial features so here
But the motivation is the opposite as bagging, bagging we have a lot of estimators that have.
A large variance but the small bias and we kind of average them out together to reduce the
variance between increasing the bias too much. Boosting is the opposite idea. We have
what's called the weak classifiers, so classifiers that do not learn much so they have a large
bias but the low variance and the goal is to. Reduce the bias step by step of this weak
learners without increasing the the variance of the ensemble too much. So let's see how we
do this by hand. So let's now assume we want to do some.
PS10: NN; dense NN or fully connected feed forward NN or multi layer perception; Keras
Basis of DNN
 The pixel intensities are represented as integers (from 0 to 255). Since we are going to train
the NN using gradient descent, we may scale the input features so that they lie in the [0, 1]
interval. Fill in the ??.
 Since the dataset has no validation set, let us create one. Split the training set so that you
have 5,000 observations in the validation set
 The stochastic gradient descent algorithm handles one mini-batch of observations at a time
(e.g., 32 observations), and it goes through the whole training set. Each pass is called an
epoch. Fill in the ??.
 Since training neural networks is computationally intensive and time consuming, it is
important to be able to save and restore trained models. Keras makes this very easy.
Controlling dense neural networks

 What if training lasts several hours? This is quite common, especially when training on large
datasets. In this case, you should not only save your model at the end of training, but also
save checkpoints at regular intervals during training, to avoid losing everything if your
computer crashes. You can do so by using the callbacks argument in the fit() function.
This argument accepts list of callbacks. A callback is an object that can perform actions
at various stages of training (e.g. at the start or end of an epoch, before or after a single
batch, etc) (https://keras.io/api/callbacks). Fill in the ?? to save the checkpoints of your
model at the end of each epoch.
 For cross Entropy Loss, if Epochs increase, Training and Validation decrease: Epoch has
decrease quite significant the cross entropy loss
 For Accuracy, if Epochs increase, Training and Validation increase.
 Very much parameter, over parameter, the data can overfit.
 Avoiding overfitting : A useful strategy to avoid overfitting with neural networks is by using
early stopping, i.e., interrupt the training when there is no improvement on the validation
set for a number of epochs. When this happens, early stop allows you to roll back to the best
model. Notice that early stop is also implemented as a Keras callback.
o The number of epochs can be set to a large value since training will stop
automatically when there is no more progress. In this case, there is no need to
restore the best model saved because the callback will keep track of the best weights
and restore them for you at the end of training.
 In neural networks there is high risk of overfitting the training data. Therefore, it is
important to apply some form of regularization. A very popular one is the dropout which
consists of dropping randomly a given percentage of nodes in some layers (during training
only). To implement dropout using Keras, you can use the keras.layers.Dropout layer
 Learning rate and optimization : Finding a good learning rate is very important. If you set it
much too high, training may diverge (as we discussed in “Gradient Descent”). If you set it too
low, training will eventually converge to an optimum, but it will take a very long time. One of
the easiest options is to use a constant learning rate. Fill in the ?? by setting the learning rate
to 0.01.
o Another issue with training neural networks is the presence of local loss minima, as
the optimization problem with so many weights is non-convex. The gradient descent
procedure can then get "stuck" in these local minima, although better solutions
might exist. On top of tuning the learning rate, adding momentum during the
gradient descent steps can help avoiding those local minima. The simplest form of
momentum can be achieved by setting a positive momentum value in
keras.optimizers.SGD. However, more complex momentum methods for
stochastic gradient descent have since then been developed, like the now very
popular Adam moment-based stochastic gradient descent algorithm.
 The optimal λ depen on the data
 If more complex model, => training loss and validation Loss; => model diverge. So we stop
the training
 If weught = false, it will return the current network. If weight = True, it return the basically
Neural network
 Overfiting is if model too flexible.
PS 11: Suite of NN and CNN
Sigmoid Activation Function
PS10
1.2 Multi-layer perception (dense neural network)

We are now ready to build the architecture for our neural network. We want two hidden layers with
300 and 100 nodes respectively and the "relu" activation function. For the final layer we use the
"softmax" activation function. Why do we need to "flatten" the inputs? A floating layer where this
is supposed to do because we remember we have matrices as inputs and a fully connected neural
network or neural multilayer perceptron except one vector of covariance as input. So this flattened
layer will basically just convert our 28 by 28 image into a single vector, as we did for example when
we worked with ADA for the for the digits data set. So here this flattened layer just takes one
argument, which is the input shape
Use Relu activation and softmax
Use a sequential model:
weights are not initialized to zero ( if not, we have 0 enywhere)

Since training neural networks is computationally intensive and time consuming, it is important to be
able to save and restore trained models. Keras makes this very easy
A useful strategy to avoid overfitting with neural networks is by using early stopping, i.e., interrupt
the training when there is no improvement on the validation set for a number of epochs. When this
happens, early stop allows you to roll back to the best model. Notice that early stop is also
implemented as a Keras callback.
Learning rate and optimization

Finding a good learning rate is very important. If you set it much too high, training may diverge (as
we discussed in “Gradient Descent”). If you set it too low, training will eventually converge to an
optimum, but it will take a very long time. One of the easiest options is to use a constant learning
rate. Fill in the ?? by setting the learning rate to 0.01.
learning rate : default value is 0.01
high learning rates (0.8), it's it's too big and actually this loss completely explodes. So you see this
loss is huge and each time it's taking too much of a step and overshoots completely again of the
region where it's supposed to optimize. It's not a good choice of a learning rate.
Very, very small (0.0001). Yeah, we will see. Daddy cannot floss creases. But super slowly. Also my.
And that's why. Again the optimal value, as always. Depends on. Really depends on the type of data
and also on the complexity of your neural network the more. The more weight you have to network,
the lower you should choose the the learning rate. So it really depends.
Another issue with training neural networks is the presence of local loss minima, as the optimization
problem with so many weights is non-convex. The gradient descent procedure can then get "stuck"
in these local minima, although better solutions might exist. On top of tuning the learning rate,
adding momentum during the gradient descent steps can help avoiding those local minima. The
simplest form of momentum can be achieved by setting a positive momentum value in
keras.optimizers.SGD. However, more complex momentum methods for stochastic gradient
descent have since then been developed, like the now very popular Adam moment-based stochastic
gradient descent algorithm.
CNN
convolutional neural networks allow doing that each layer will actually take into account, will be
translation invariants and will take into account kind of the neighbouring dependencies. And we
expect that this is important for these images because of course like the information that the pixel
The only difference between a Gray scale image and a color image is that the color image will have
three channels. So any color, any pixel color will have kind of three values that will be the red, green
and blue values. And combining red, green and blue values you can get kind of basically. The the the
computer can represent. So instead of having a 28 by 28 image, if you have a color image of the same
size, let's say it will be a 28 by 28 by three. So for each pixel it will have a value for red, green.
kernel. So this is a hyperparameter. You can choose the width and the height of the kernel. But the
kernel necessary. Has the same number of depth dimension as the input image because you want to
convert it over all the channels
each kernel necessarily has the same number of channels as the input image. But you can change the
the width and the height of each of these kernels in that convolutions
Example: I consider 64 different kernels of size 5 * 5 which will give me an output which has 64
channels.
the number of channels is just the number of filters that we that we use.

PS Notes (Machine Learning

Uploaded by

Copyright:

Available Formats

You might also like

PS Notes (Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PS Notes (Machine Learning

Uploaded by

Copyright:

Available Formats

PS1:

5. Split data into train and test

6. Train linear regression with stochastic gradient descent

Why intercept is include in the model?

The fit_intercept option in machine learning, especially in linear regression, determines

RMSE on training data: 67636.08948270889

RMSE on test data: 69641.95294583413

Cost Complexity pruning:

Controlling dense neural networks

1.2 Multi-layer perception (dense neural network)

Use Relu activation and softmax

Use a sequential model:

weights are not initialized to zero ( if not, we have 0 enywhere)

Learning rate and optimization

learning rate : default value is 0.01

You might also like