Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Materials Informatics

Practical Concepts in ML
Rohit Batra

LECTURE 2 (Afternoon, June 7, 2024)

Machine learning and Artificial Intelligence in Materials Science

Use the following concepts to improve your ML model performance

• Overfitting
• Regularization
• Normalization
• Feature selection (dimensionality reduction, curse of dimensionality)
• Ensemble models (e.g., random forest)
• Generally, simple methods generalize better on smaller (less datapoints or less features) datasets
Other important concepts in ML

• Error metrics (mean square error, R2 score)

• Train, Validation, Test error
• Cross-validation
• Overfitting
• Regularization
• Normalization
• Feature selection (dimensionality reduction, curse of dimensionality)

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 2

Polynomial Regression
We assume that a linear model can explain the data
1-Dimensional example
𝑦! = 𝑓 𝒙 = 𝑤! + 𝑤" 𝑥" + 𝑤# 𝑥"#
Quadratic model
= * 𝑤$ 𝑥"&


Linear model = 𝒘( 𝒙

𝒙 = (𝑥' , 𝑥" , 𝑥" )

𝒘 = (𝑤' , 𝑤" , 𝑤# )

+ of the “line”
How to find the parameters (𝒘)
that best fits the data?

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 3

Error Metrics

1-Dimensional example
Error 𝑒$ = 𝑦$ − 𝑦+$

Error )
Mean square error *(𝑒$ )#

$%"(𝑒$ )
x Coefficient of 1− )
determination (R )
2 5 #
∑$%"(𝑦$ − 𝑦)

Mean of 𝒚

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 4

Cross-Validation (CV)
Four-fold CV

All training data

hype-parameter estimation
Use validation errors for
Validation CV training
CV iteration

CV training Validation CV training

CV training Validation CV training

CV training Validation

Final model training using hyper-parameters from CV

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 5


How to resolve overfitting?

Use regularization and cross-validation

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 6

Change cost (or loss) function from *(𝑒$ )#

to *(𝑒$ )# + 𝜆 𝒘 #

Caution other variations are

Solution that minimizes the new cost function (𝑋 ( 𝑋 + 𝜆𝐼)𝒘
+ = 𝑋( 𝒚 possible depending on what
is regularized!

We impose constraints on 𝒘 to avoid over-fitting

How to decide 𝜆?

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 7

Curse of dimensionality

More features à Higher data sparsity

More features à More no. of training samples needed

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 8

Feature Selection
Iteration 1
Feature set Selected Feature set
Model training

Iteration 2 Feature set Selected Feature set


Model training E2

Continue iteration until error decreases…

Will this result in optimal set of features?
Rohit Batra, Materials Informatics Lab, MME, IIT Madras 9
Support Vector Regression

• A linear model can explain the data

• Cost function includes regularization term and slack
variables, subject to constraints
• Kernel trick is used to learn non-linear models

Good resources:

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 10

Ensemble Models

• Multiple weak learners (models) together result

in a more accurate prediction
• Methods to build ensemble models
• Bootstrap
• Boosting
• Combining models of different nature

Why ensemble methods work?


Rohit Batra, Materials Informatics Lab, MME, IIT Madras 11

Random Forest

• Ensemble of decision trees

• Bootstrapping for better accuracy
• Split strategy based on reduction in MSE

Rohit Batra, Materials Informatics Lab, MME, IIT Madras 12

You might also like