Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Materials Informatics

Practical Concepts in ML
Rohit Batra
rbatra@iitm.ac.in

LECTURE 2 (Afternoon, June 7, 2024)


Machine learning and Artificial Intelligence in Materials Science

Use the following concepts to improve your ML model performance


• Overfitting
• Regularization
• Normalization
• Feature selection (dimensionality reduction, curse of dimensionality)
• Ensemble models (e.g., random forest)
• Generally, simple methods generalize better on smaller (less datapoints or less features) datasets
Other important concepts in ML

• Error metrics (mean square error, R2 score)


• Train, Validation, Test error
• Cross-validation
• Overfitting
• Regularization
• Normalization
• Feature selection (dimensionality reduction, curse of dimensionality)

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 2


Polynomial Regression
We assume that a linear model can explain the data
1-Dimensional example
𝑦! = 𝑓 𝒙 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥"#
&
Quadratic model
= * 𝑤$ 𝑥"&
y

$%!

Linear model = 𝒘( 𝒙

𝒙 = (𝑥' , 𝑥" , 𝑥" )


𝒘 = (𝑤' , 𝑤" , 𝑤# )
x

+ of the “line”
How to find the parameters (𝒘)
that best fits the data?

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 3


Error Metrics

1-Dimensional example
Error 𝑒$ = 𝑦$ − 𝑦+$
y

Error )
1
Mean square error *(𝑒$ )#
𝑁
$%"

∑)
$%"(𝑒$ )
#
x Coefficient of 1− )
determination (R )
2 5 #
∑$%"(𝑦$ − 𝑦)

Mean of 𝒚

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 4


Example:
Cross-Validation (CV)
Four-fold CV

All training data

hype-parameter estimation
Use validation errors for
Validation CV training
CV iteration

CV training Validation CV training

CV training Validation CV training

CV training Validation

Final model training using hyper-parameters from CV

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 5


Overfitting

How to resolve overfitting?


Use regularization and cross-validation

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 6


Regularization
)
1
Change cost (or loss) function from *(𝑒$ )#
𝑁
$%"

)
1
to *(𝑒$ )# + 𝜆 𝒘 #
𝑁
$%"

Caution other variations are


Solution that minimizes the new cost function (𝑋 ( 𝑋 + 𝜆𝐼)𝒘
+ = 𝑋( 𝒚 possible depending on what
is regularized!

We impose constraints on 𝒘 to avoid over-fitting


How to decide 𝜆?

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 7


Curse of dimensionality
https://medium.com/analytics-vidhya/the-curse-of-dimensionality-and-its-cure-f9891ab72e5c

More features à Higher data sparsity


More features à More no. of training samples needed

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 8


Feature Selection
Iteration 1
Feature set Selected Feature set
E1
E2
Model training
E3
E4

Iteration 2 Feature set Selected Feature set


E1

Model training E2
E3

Continue iteration until error decreases…


Will this result in optimal set of features?
Rohit Batra, Materials Informatics Lab, MME, IIT Madras 9
Support Vector Regression

• A linear model can explain the data


• Cost function includes regularization term and slack
variables, subject to constraints
• Kernel trick is used to learn non-linear models

Good resources:
https://in.mathworks.com/help/stats/understanding-support-vector-machine-regression.html

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 10


Ensemble Models

• Multiple weak learners (models) together result


in a more accurate prediction
• Methods to build ensemble models
• Bootstrap
• Boosting
• Combining models of different nature

Why ensemble methods work?

Source: https://towardsdatascience.com/what-are-ensemble-methods-in-machine-learning-cac1d17ed349

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 11


Random Forest

• Ensemble of decision trees


• Bootstrapping for better accuracy
• Split strategy based on reduction in MSE

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 12

You might also like