AI Lec-03

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

AI Foundations and Applications

3. Feature Engineering

Thien Huynh-The
HCM City Univ. Technology and Education
Jan, 2023
Machine Learning

A program that can


sense, reason, act,
and adapt.

AI ML DL Data Science

Math, Statistics, and


Visualization

Algorithms whose performance


improve as they are exposed to Subset of ML in which
more data over time multiplayer neural networks
learn from vast amounts of data
HCMUTE AI Foundations and Applications 03/18/2024 2
Machine Learning

• A computer program is said to learn from experience E with respect to some tasks
T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E

HCMUTE AI Foundations and Applications 03/18/2024 3


Task T

• A task in ML is usually described as a procedure to process a data point.


• In the image classification task, a data point is an image
• In the client classification task, a data point is a client
• In the physical activity classification task, a data point is a vector of sensory data.
• A data point encompasses various features, each feature is usually represented by
numerical value.
• In general, a set of features can be structured as a vector or higher dimensional
structured data like matrix
• In an image, value of each pixel can be considered as a feature. A vector contains all pixel
values is a feature vector.

HCMUTE AI Foundations and Applications 03/18/2024 4


Classification vs Regression

Classification Regression
Classification predictive modeling is the task of Regression predictive modeling is the task of
approximating a mapping function (f) from input approximating a mapping function (f) from input
variables (x) to discrete output variables (y). variables (x) to a continuous output variable (y).

The output variables are often called labels or A continuous output variable is a real-value, such as an
categories. The mapping function predicts the class or integer or floating point value. These are often
category for a given observation. quantities, such as amounts and sizes.

For example, an email of text can be classified as For example, a house may be predicted to sell for a
belonging to one of two classes: “spam“ and “not specific dollar value, perhaps in the range 100,000 to
spam“. 200,000

HCMUTE AI Foundations and Applications 03/18/2024 5


Classification vs Regression

Classification Regression
• A classification problem requires that examples be • A regression problem requires the prediction of a
classified into one of two or more classes. quantity.
• A classification can have real-valued or discrete • A regression can have real valued or discrete input
input variables. variables.
• A problem with two classes is often called a two- • A problem with multiple input variables is often
class or binary classification problem. called a multivariate regression problem.
• A problem with more than two classes is often called • A regression problem where input variables are
a multi-class classification problem. ordered by time is called a time series forecasting
• A problem where an example is assigned multiple problem.
classes is called a multi-label classification problem.

Task: Students learn about clustering by themselves


Ref: https://www.geeksforgeeks.org/clustering-in-machine-learning/

HCMUTE AI Foundations and Applications 03/18/2024 6


Performance P

• Normally, when working with ML algorithms, the whole data will be divided into
training set and test set.
• Training set is used to find the parameter of ML model.
• Test set is used to evaluate the performance of ML model
• Note: the performance evaluation can be applied to two sets.
• A good model should work well on the training set as priority
• Reading task:
• Discriminate online learning/training vs offline learning/training
• What is the generalization in ML (ref: https://deepai.space
/what-is-generalization-in-machine-learning/)

HCMUTE AI Foundations and Applications 03/18/2024 7


Experience E

• Training a ML model can be referred to as an experiencing ML model on the


training set.
• Different datasets will give unsimilar experiences
• The quality of dataset also affect to the model performance

Supervised Learning Unsupervised Learning


• Supervised learning is a machine learning • Unsupervised learning uses machine learning
approach that’s defined by its use of labeled algorithms to analyze and cluster unlabeled data
datasets. sets.
• These datasets are designed to train or • These algorithms discover hidden patterns in data
“supervise” algorithms into classifying data or without the need for human intervention (hence,
predicting outcomes accurately. they are “unsupervised”).
• Using labeled inputs and outputs, the model can
measure its accuracy and learn over time.

HCMUTE AI Foundations and Applications 03/18/2024 8


Unsupervised learning

• Unsupervised learning models are used for three main tasks: clustering, association
and dimensionality reduction:
• Clustering is a data mining technique for grouping unlabeled data based on their similarities or
differences. This technique is helpful for market segmentation, image compression, etc.
• Association is another type of unsupervised learning method that uses different rules to find
relationships between variables in a given dataset.
• Dimensionality reduction is a learning technique used when the number of features (or
dimensions) in a given dataset is too high. It reduces the number of data inputs to a
manageable size while also preserving the data integrity.

HCMUTE AI Foundations and Applications 03/18/2024 9


More information

• Read more at
https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning
• Dive into reinforcement learning at
https://developer.ibm.com/learningpaths/get-started-automated-ai-for-decision-
making-api/what-is-automated-ai-for-decision-making/

• Loss function in Vietnamese at


https://khanh-personal.gitbook.io/ml-book-vn/chapter1/ham-mat-mat

HCMUTE AI Foundations and Applications 03/18/2024 10


ML framework

HCMUTE AI Foundations and Applications 03/18/2024 11


Feature Extraction

• Usually used when the original raw data is very different and we cannot use for ML
modeling. In this case, we should transform raw data into the desired form
• Feature extraction is the method for creating a new and smaller set of features
that capture most of the useful information of raw data
• Some of the popular types of raw data from which features (new feature creation)
can be extracted
• Text / Images / Geospatial data / Date and time / Web data / Sensory data

HCMUTE AI Foundations and Applications 03/18/2024 12


Feature Extraction

• Feature extraction is a technique used to reduce a large input data set into
relevant features. This is done with dimensionality reduction to transform large
input data into smaller, meaningful groups for processing.
Benefits
• Feature extraction can prove helpful when training a machine learning model.
• It leads to:
• A boost in training speed
• An improvement in model accuracy
• A reduction in risk of overfitting
• A rise in model explainability
• Better data visualization

HCMUTE AI Foundations and Applications 03/18/2024 13


Feature Extraction - Example

• Consider a following example


• Data collected from sensors
• This is the sensory data in time with amplitude
and frequency (no. sample recorded each
second)

• In the Figure
• Too hard to discriminate the signals recorded in
two conditions, i.e., healthy operation vs faulty
operation
• After computing condition indication from raw
data, the signals are more discriminative

HCMUTE AI Foundations and Applications 03/18/2024 14


Feature Extraction - Example

• Can derive condition indicators from data


using time, frequency, and time-frequency
domain features.

• Some time features


• Mean
• Standard deviation
• Skewness
• Root-mean square
• Kurtosis
• ...

HCMUTE AI Foundations and Applications 03/18/2024 15


Feature Extraction - Example

HCMUTE AI Foundations and Applications 03/18/2024 16


Feature Extraction - Example

• Frequency-domain features
• Power bandwidth
• Mean frequency
• Peak values
• Peak frequencies
• Harmonics
• ...
• Time-frequency domain features
• Spectral entropy
• Spectral kurtosis
• ...

HCMUTE AI Foundations and Applications 03/18/2024 17


Feature Extraction

• Time-domain features

Features Formulation
Mean

Standard deviation

Variance

Skewness

Kurtosis

HCMUTE AI Foundations and Applications 03/18/2024 18


Feature Extraction

HCMUTE AI Foundations and Applications 03/18/2024 19


Feature Extraction

• Depending upon the data types, such as time series data or high-dimensional data,
vector or matrix, discrete or continuous, we should select the set of appropriate
features to maximize the discrimination of input samples.
• Besides, there are some advanced feature extraction techniques:
• Explicit Semantic Analysis (ESA).
• Non-Negative Matrix Factorization (NMF).
• Singular Value Decomposition (SVD)
Ref: https://www.geeksforgeeks.org/singular-value-decomposition-svd/
• and Principal Component Analysis (PCA).
Ref: https://viblo.asia/p/gioi-thieu-principal-component-analysis-07LKXpq2KV4

HCMUTE AI Foundations and Applications 03/18/2024 20


Feature Normalization

• Normalization is another important concept needed to change all features to the


same scale. This allows for faster convergence on learning, and more uniform
influence for all weights.
• Rescaling
• changes all features to be between 0 and 1

• Standardization
• standardize features by removing the mean and scaling to unit variance

HCMUTE AI Foundations and Applications 03/18/2024 21


Reading

In-class assignment
• Discovery evaluation protocols
• Learn the performance metrics in ML
• Describe the following feature selection methods
• Filter method
• Wrapper method
• Embedded method

HCMUTE AI Foundations and Applications 03/18/2024 22


Coding Time

• Python
• Classification using different classification algorithms with raw data
• Classification using different classification algorithms feature extraction
• Select the best classification algorithm and feature descriptor to obtain the highest accuracy

HCMUTE AI Foundations and Applications 03/18/2024 23

You might also like