Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Data Exploration

Data Exploration
Data exploration, preprocessing, and visualization are crucial steps in
machine learning that help in understanding the data, preparing it for
modeling, and gaining insights.
Data Exploration
Understanding the Data:
● Check basic statistics: Mean, median, standard deviation, etc., to understand the
distribution of features.
● Investigate data types: Categorical, numerical, text, etc.
● Explore missing values: Identify and handle missing or null values appropriately.
Visualizing Data:
● Histograms, box plots, or density plots help understand feature distributions.
● Scatter plots show relationships between variables.
● Heatmaps or correlation matrices reveal feature correlations.
● Categorical data: Bar charts, pie charts, or count plots to analyze distributions.
Data Preprocessing
Handling Missing Values:
● Impute missing values using techniques like mean, median, mode, or advanced imputation methods.
● Delete or interpolate missing values based on the dataset and the impact on analysis.
Encoding Categorical Variables:
● Convert categorical variables into numerical format using techniques like one-hot encoding or label
encoding.
Feature Scaling:
● Normalize or standardize numerical features to bring them to a similar scale, preventing dominance by
larger values.
Feature Engineering:
● Create new features based on existing ones that might be more informative for the model.
● Transform variables (log, square root) to make their distributions more Gaussian.
Data Visualization
● Univariate Visualizations:
○ Histograms, box plots, and kernel density plots for single variables to understand
their distributions and outliers.
● Bivariate and Multivariate Visualizations:
○ Scatter plots, pair plots, heatmaps, or correlation matrices to explore
relationships between variables.
○ Facet grids or conditional plots for comparisons across multiple variables.
● Interactive Visualization:
○ Tools like Plotly or Bokeh for interactive plots that allow zooming, hovering,
and filtering.
Tools for Data Exploration and Visualization
● Python Libraries: Pandas, Matplotlib, Seaborn, Plotly, Bokeh, Altair.
● R Programming: ggplot2, dplyr, tidyr.
● BI Tools: Tableau, Power BI for interactive and dashboard-style
visualizations.
● Jupyter Notebooks: Ideal for combining code, visualization, and explanatory
text in one document.
Importance
● Understanding the data: It helps in selecting appropriate models, identifying potential
issues, and making informed decisions during model building.
● Quality Improvement: Proper preprocessing and visualization lead to cleaner data,
which often results in more accurate models.
● Insights and Interpretability: Visualization aids in communicating findings and
insights from the data to stakeholders.

Effective data exploration, preprocessing, and visualization are integral parts of the
machine learning pipeline, contributing significantly to the success and interpretability of
the models built on the data.
Data Types
Numeric data refers to data that can be expressed as a number. Examples of
numeric data include height, weight, and temperature.
Categorical Data: Data that can be categorized but lacks an inherent hierarchy or
order is known as categorical data. In other words, there is no mathematical
connection between the categories. A person's gender (male/female), eye color
(blue, green, brown, etc.), type of vehicle they drive (sedan, SUV, truck, etc.), or
the kind of fruit they consume (apple, banana, orange, etc.) are examples of
categorical data.
Outliers
● Outliers are the values that look different from the other values in the data.

● Outliers are data points that significantly deviate from the majority of the
data. They can be caused by errors, anomalies, or simply rare events.
Reasons for outliers in data
1. Errors during data entry or a faulty measuring device (a faulty sensor may

result in extreme readings).

2. Natural occurrence (salaries of junior level employees vs C-level

employees)
Problem caused by outliers
1. Outliers in the data may causes problems during model fitting (esp. linear

models).

2. Outliers may inflate the error metrics which give higher weights to large

errors (example, mean squared error, RMSE).


● Overfitting: Models can focus on fitting the outliers rather than the
underlying patterns in the majority of the data.
● Reduced accuracy: Outliers can pull the model’s predictions towards
themselves, leading to inaccurate predictions for other data points.
● Unstable models: The presence of outliers can make the model’s
predictions sensitive to small changes in the data.
Methods for detecting Outliers
● Distance-based measures: These measures, like Z-score and interquartile
range (IQR), calculate the distance of a data point from the center of the
data distribution.
● Visualization techniques: Techniques like box plots and scatter plots can
visually identify data points that lie far away from the majority of the data.
● Clustering algorithms: Clustering algorithms can automatically group
similar data points, isolating outliers as separate clusters.
How can we handle outliers?
● Removing outliers: This is a simple approach but can lead to information loss.
● Clipping: Outliers are capped to a certain value instead of being removed
completely.
● Transformation: Data can be transformed to reduce the impact of outliers, such as
using log transformations for skewed data.
● Robust models: Certain models are less sensitive to outliers, such as decision trees
and support vector machines.
A survey was given to a random sample of 20 sophomore college students. They were asked, “how
many textbooks do you own?” Their responses, were: 0, 0, 2, 5, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 12,
14, 15, 20, and 25.
A teacher wants to examine students’ test scores. Their scores are: 74, 88, 78, 90, 94, 90, 84, 90, 98,
and 80.
Feature Engineering
● Feature engineering is the process of creating new features or modifying
existing ones to improve the performance of machine learning models.
● It involves extracting valuable information, reducing noise, and making the
data more suitable for modeling.
Variable transformation
● Variable transformation refers to the process of modifying or changing the
scale or distribution of variables in a dataset.
● It's often employed to meet certain assumptions of statistical techniques or to
improve the performance of machine learning models.
Log transformation
● Use: Mitigates skewed distributions.
● When: Useful when data is highly skewed, especially towards larger values.
● Application: Transform positively skewed data to be more normally
distributed.
● Example: Logarithmically transforming skewed variables like income,
population, or stock prices.
Feature Scaling
● Feature scaling is a preprocessing technique used to standardize the range of
independent variables or features in a dataset.
● It's essential when features have different scales or units, ensuring that all
features contribute equally to the analysis and preventing certain features
from dominating due to their larger scale.
Why Scale Features
● Equalizes Importance: Ensures that all features contribute proportionately to
the analysis, preventing bias toward features with larger scales.
● Improves Convergence: Helps algorithms converge faster in
optimization-based algorithms like gradient descent.
● Better Performance: Many machine learning algorithms (e.g., SVMs,
k-means clustering) perform better with scaled features.
Min-Max scaling
● Use: Scales variables to a specific range, often [0, 1].
● When: Useful for algorithms sensitive to varying scales.
● Application: Scales variables to a common range without distorting the
original distribution.
● Example: Preprocessing for algorithms like neural networks, SVMs, or
K-means clustering.
Data: [10,20,30,40,50]
Z score Standardization
● Use: Centers data around the mean and scales by the standard deviation.
● When: Effective when data is normally distributed.
● Application: Transforms data to have a mean of 0 and a standard deviation of
1.
● Example: Preprocessing for algorithms like PCA or when normality is
assumed.
Example
Data :[10,20,30,40,50]
Max absolute scaler
● MaxAbsScaler is a method used for transforming numerical features by
scaling each feature to a [-1, 1] range based on the maximum absolute value
of each feature.
● It's useful when the distribution of the data is not Gaussian or when the
standard deviation is very small and not suitable for standard scaling
methods.
● Robustness: MaxAbsScaler is robust to very small standard deviations and
can work better than other scalers in such cases.
● Range Preservation: It maintains the relationships among the original values
but squashes the range to [-1, 1].
● Use Case: Useful when features have a mix of positive and negative values
with varying scales.
● Impact: It may not handle outliers as effectively as methods like
RobustScaler or StandardScaler.
Example
Feature 1: [10, -5, 8, 12, -15]
Feature 2: [0, 3, -6, 9, -12]
Feature 3: [-8, 4, -10, 7, 14]
Calculate Maximum Absolute Value for Each Feature:
● Feature 1: max(|x|) = 15
● Feature 2: max(|x|) = 12
● Feature 3: max(|x|) = 14
Scale Each Feature by its Maximum Absolute Value:
● Scaled Feature 1 = [10/15, -5/15, 8/15, 12/15, -15/15] = [0.6667, -0.3333,
0.5333, 0.8000, -1.0000]
● Scaled Feature 2 = [0/12, 3/12, -6/12, 9/12, -12/12] = [0.0000, 0.2500,
-0.5000, 0.7500, -1.0000]
● Scaled Feature 3 = [-8/14, 4/14, -10/14, 7/14, 14/14] = [-0.5714, 0.2857,
-0.7143, 0.5000, 1.0000]
Robust scaler
Robust Scaler is a technique used for scaling numerical features by removing the
median and scaling data according to the interquartile range (IQR).
It's less sensitive to outliers compared to Min-Max scaling or Z-score
normalization.
Example
[10, 20, 30, 40, 50, 1000]
Feature Selection
● Feature selection is the process of choosing the most relevant features
(variables, attributes) to include in your machine learning model.
● It's crucial for improving model performance, reducing overfitting, and
speeding up training time by eliminating irrelevant or redundant features.
Benefits
● It helps in avoiding the curse of dimensionality.
● It helps in the simplification of the model so that it can be easily interpreted
by the researchers.
● It reduces the training time.
● It reduces overfitting hence enhance the generalization.
Forward selection
Forward selection is a method for feature selection where user start with an empty
set of features and iteratively add the most significant feature at each step based
on some criterion.
Greedy method
● The greedy method for feature selection is an iterative approach where
features are added or removed based on a defined criterion.
● It's called "greedy" because it makes locally optimal choices at each step
with the hope of finding a global optimum.
Backward selection
● This method is also an iterative approach where we initially start with all
features and after each iteration, user remove the least significant feature.
● The stopping criterion is till no improvement in the performance of the
model is observed after the feature is removed.

You might also like