Professional Documents
Culture Documents
AI Lec-03
AI Lec-03
AI Lec-03
3. Feature Engineering
Thien Huynh-The
HCM City Univ. Technology and Education
Jan, 2023
Machine Learning
AI ML DL Data Science
• A computer program is said to learn from experience E with respect to some tasks
T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E
Classification Regression
Classification predictive modeling is the task of Regression predictive modeling is the task of
approximating a mapping function (f) from input approximating a mapping function (f) from input
variables (x) to discrete output variables (y). variables (x) to a continuous output variable (y).
The output variables are often called labels or A continuous output variable is a real-value, such as an
categories. The mapping function predicts the class or integer or floating point value. These are often
category for a given observation. quantities, such as amounts and sizes.
For example, an email of text can be classified as For example, a house may be predicted to sell for a
belonging to one of two classes: “spam“ and “not specific dollar value, perhaps in the range 100,000 to
spam“. 200,000
Classification Regression
• A classification problem requires that examples be • A regression problem requires the prediction of a
classified into one of two or more classes. quantity.
• A classification can have real-valued or discrete • A regression can have real valued or discrete input
input variables. variables.
• A problem with two classes is often called a two- • A problem with multiple input variables is often
class or binary classification problem. called a multivariate regression problem.
• A problem with more than two classes is often called • A regression problem where input variables are
a multi-class classification problem. ordered by time is called a time series forecasting
• A problem where an example is assigned multiple problem.
classes is called a multi-label classification problem.
• Normally, when working with ML algorithms, the whole data will be divided into
training set and test set.
• Training set is used to find the parameter of ML model.
• Test set is used to evaluate the performance of ML model
• Note: the performance evaluation can be applied to two sets.
• A good model should work well on the training set as priority
• Reading task:
• Discriminate online learning/training vs offline learning/training
• What is the generalization in ML (ref: https://deepai.space
/what-is-generalization-in-machine-learning/)
• Unsupervised learning models are used for three main tasks: clustering, association
and dimensionality reduction:
• Clustering is a data mining technique for grouping unlabeled data based on their similarities or
differences. This technique is helpful for market segmentation, image compression, etc.
• Association is another type of unsupervised learning method that uses different rules to find
relationships between variables in a given dataset.
• Dimensionality reduction is a learning technique used when the number of features (or
dimensions) in a given dataset is too high. It reduces the number of data inputs to a
manageable size while also preserving the data integrity.
• Read more at
https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning
• Dive into reinforcement learning at
https://developer.ibm.com/learningpaths/get-started-automated-ai-for-decision-
making-api/what-is-automated-ai-for-decision-making/
• Usually used when the original raw data is very different and we cannot use for ML
modeling. In this case, we should transform raw data into the desired form
• Feature extraction is the method for creating a new and smaller set of features
that capture most of the useful information of raw data
• Some of the popular types of raw data from which features (new feature creation)
can be extracted
• Text / Images / Geospatial data / Date and time / Web data / Sensory data
• Feature extraction is a technique used to reduce a large input data set into
relevant features. This is done with dimensionality reduction to transform large
input data into smaller, meaningful groups for processing.
Benefits
• Feature extraction can prove helpful when training a machine learning model.
• It leads to:
• A boost in training speed
• An improvement in model accuracy
• A reduction in risk of overfitting
• A rise in model explainability
• Better data visualization
• In the Figure
• Too hard to discriminate the signals recorded in
two conditions, i.e., healthy operation vs faulty
operation
• After computing condition indication from raw
data, the signals are more discriminative
• Frequency-domain features
• Power bandwidth
• Mean frequency
• Peak values
• Peak frequencies
• Harmonics
• ...
• Time-frequency domain features
• Spectral entropy
• Spectral kurtosis
• ...
• Time-domain features
Features Formulation
Mean
Standard deviation
Variance
Skewness
Kurtosis
• Depending upon the data types, such as time series data or high-dimensional data,
vector or matrix, discrete or continuous, we should select the set of appropriate
features to maximize the discrimination of input samples.
• Besides, there are some advanced feature extraction techniques:
• Explicit Semantic Analysis (ESA).
• Non-Negative Matrix Factorization (NMF).
• Singular Value Decomposition (SVD)
Ref: https://www.geeksforgeeks.org/singular-value-decomposition-svd/
• and Principal Component Analysis (PCA).
Ref: https://viblo.asia/p/gioi-thieu-principal-component-analysis-07LKXpq2KV4
• Standardization
• standardize features by removing the mean and scaling to unit variance
In-class assignment
• Discovery evaluation protocols
• Learn the performance metrics in ML
• Describe the following feature selection methods
• Filter method
• Wrapper method
• Embedded method
• Python
• Classification using different classification algorithms with raw data
• Classification using different classification algorithms feature extraction
• Select the best classification algorithm and feature descriptor to obtain the highest accuracy