Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

WHAT IS A DATASET?

A dataset is a collection of data within a database.

Typically, datasets take on a tabular format consisting of rows and columns. Each
column represents a specific variable, while each row corresponds to a specific value.
Some datasets consisting of unstructured data are non-tabular, meaning they don’t fit
the traditional row-column format.

WHAT IS DATA ANALYSIS?


Data analysis refers to the process of manipulating raw data to uncover useful insights
and draw conclusions. During this process, a data analyst or data scientist will
organize, transform, and model a dataset.

Organizations use data to solve business problems, make informed decisions, and
effectively plan for the future. Data analysis ensures that this data is optimized and
ready to use.

Some specific types of data analysis include:

 Descriptive analysis
 Diagnostic analysis
 Predictive analysis
 Prescriptive analysis
Regardless of your reason for analyzing data, there are six simple steps that you can
follow to make the data analysis process more efficient.

6 STEPS TO ANALYZE A DATASET


1. Clean Up Your Data

Data wrangling—also called data cleaning—is the process of uncovering and


correcting, or eliminating inaccurate or repeat records from your dataset. During the
data wrangling process, you’ll transform the raw data into a more useful format,
preparing it for analysis.

It’s imperative to clean your data before beginning analysis. This is particularly
important if you’ll be presenting your findings to business teams who may use the data
for decision-making purposes. Teams need to have confidence that they’re acting on a
reliable source of information.
2. Identify the Right Questions

Once you’ve completed the cleaning process, you may have a lot of questions about
your final dataset. There’s so much potential that can be uncovered through analysis.

Identify the most important questions you hope to answer through your analysis.
These questions should be easily measurable and closely related to a specific
business problem. If the request for analysis is coming from a business team, ask
them to provide explicit details about what they’re hoping to learn, what they expect to
learn, and how they’ll use the information. You can use their input to determine which
questions take priority in your analysis.

3. Break Down the Data Into Segments

It’s often helpful to break down your dataset into smaller, defined groups. Segmenting
your data will not only make your analysis more manageable, but also keep it on track.

For example, if you’re attempting to answer questions about a specific department’s


performance, you’ll want to segment your data by department. From there, you’ll be
able to glean insights about the group that you’re concerned with and identify any
relationships that might exist between each group.

4. Visualize the Data

One of the most important parts of data analysis is data visualization, which refers to
the process of creating graphical representations of data. Visualizing the data will help
you to easily identify any trends or patterns and obvious outliers.

By creating engaging visuals that represent the data, you’re also able to effectively
communicate your findings to key stakeholders who can quickly draw conclusions from
the visualizations.

There’s a variety of data visualization tools you can use to automatically generate
visual representations of a dataset, such as Microsoft Excel, Tableau, and Google
Charts.

5. Use the Data to Answer Your Questions

After cleaning, organizing, transforming, and visualizing your data, revisit the questions
you outlined at the beginning of the data analysis process. Interpret your results and
determine whether the data helps you answer your original questions.
If the results are inconclusive, try revisiting a previous step in the analysis process.
Maybe your dataset was too large and should have been segmented further, or
perhaps there’s a different type of visualization better suited to your data.

6. Supplement with Qualitative Data

Finally, as you near the conclusion of your analysis, remember that this dataset is only
one piece of the puzzle.

It’s critical to pair your quantitative findings with qualitative information, which you may
capture using questionnaires, interviews, or testimonials. While the dataset has the
ability to tell you what’s happening, qualitative information can often help you
understand why it’s happening.

THE IMPORTANCE OF DATA ANALYSIS


Virtually all business decisions made by organizations are informed by some type of
data. Because of this, it’s crucial that businesses are able to leverage data that s
available to them.

Businesses rely on the insights gained from data analysis to guide a myriad of
activities, ranging from budgeting to strategy execution. The importance of data
analysis for today’s organizations can't be understated.

How to Handle Missing Values of


Categorical Variables?
Real-world data collection has its own set of problems, It is often very messy which
includes missing data, presence of outliers, unstructured manner, etc. Before
looking for any insights from the data, we have to first perform preprocessing tasks
which then only allow us to use that data for further observation and train our
machine learning model. Missing value in a dataset is a very common phenomenon
in the reality. In this blog, you will see how to handle missing values for categorical
variables while we are performing data preprocessing. Missing value correction is
required to reduce bias and to produce powerful suitable models. Most of the
algorithms can’t handle missing data, thus you need to act in some way to simply
not let your code crash. So, let’s begin with the methods to solve the problem.
What are Missing Values in a Dataset?
In a dataset, missing values refer to the absence of data for one or more variables
or observations. This can occur for various reasons, such as data entry errors,
equipment malfunction, or participants failing to provide information. Missing values
can affect statistical analyses, leading to biased or incorrect results. Thus, it is
important to handle missing values appropriately by removing them or using
imputation methods to estimate their values.

Types of Missing Data

1. Missing Completely at Random (MCAR): When data is MCAR, the


missingness occurs randomly, and there is no relationship between the
missing values and the observed data. In other words, the probability of a
value being missing is the same for all observations, regardless of their
values.
2. Missing at Random (MAR): When data is MAR, the missingness is not
random, but other observed variables can explain it. In other words, the
probability of a missing value depends on the observed data, not the missing
data itself.
3. Missing Not at Random (MNAR): When data is MNAR, the missingness is not
random and cannot be explained by other observed variables. In other words,
the probability of a value being missing depends on the missing data, which
can lead to biased or incorrect results if not handled properly.

Real-world data collection has its own set of problems, It is often very messy which
includes missing data, presence of outliers, unstructured manner, etc. Before
looking for any insights from the data, we have to first perform preprocessing tasks
which then only allow us to use that data for further observation and train our
machine learning model. Missing value in a dataset is a very common phenomenon
in the reality. In this blog, you will see how to handle missing values for categorical
variables while we are performing data preprocessing. Missing value correction is
required to reduce bias and to produce powerful suitable models. Most of the
algorithms can’t handle missing data, thus you need to act in some way to simply
not let your code crash. So, let’s begin with the methods to solve the problem.

“Data is the fuel for Machine Learning algorithms”.

What are Missing Values in a Dataset?


In a dataset, missing values refer to the absence of data for one or more variables
or observations. This can occur for various reasons, such as data entry errors,
equipment malfunction, or participants failing to provide information. Missing values
can affect statistical analyses, leading to biased or incorrect results. Thus, it is
important to handle missing values appropriately by removing them or using
imputation methods to estimate their values.

Types of Missing Data

1. Missing Completely at Random (MCAR): When data is MCAR, the


missingness occurs randomly, and there is no relationship between the
missing values and the observed data. In other words, the probability of a
value being missing is the same for all observations, regardless of their
values.
2. Missing at Random (MAR): When data is MAR, the missingness is not
random, but other observed variables can explain it. In other words, the
probability of a missing value depends on the observed data, not the missing
data itself.
3. Missing Not at Random (MNAR): When data is MNAR, the missingness is not
random and cannot be explained by other observed variables. In other words,
the probability of a value being missing depends on the missing data, which
can lead to biased or incorrect results if not handled properly.

Methods for Dealing with Missing Values in


Dataset

Let’s have a dummy dataset in which there are three independent


features(predictors) and one dependent feature(response).

Feature- Feature- Feature-


1 2 3 Output

Male 23 24 Yes

–––– 24 25 No

Female 25 26 Yes

Male 26 27 Yes

Here, We have a missing value in row-2 for Feature-1.

The popular methods which are used by the machine learning community to
handle the missing value for categorical variables in the dataset are as
follows:

Step 1: Delete the Observations


If there is a large number of observations in the dataset, where all the classes to be
predicted are sufficiently represented in the training data, then try deleting the
missing value observations, which would not bring significant change in your feed
to your model.
For Example,1, Implement this method in a given dataset, we can delete the entire
row which contains missing values(delete row-2).

Step 2: Replace Missing Values with the Most Frequent Value


You can always impute them based on Mode in the case of categorical variables,
just make sure you don’t have highly skewed class distributions.

NOTE: But in some cases, this strategy can make the data imbalanced wrt classes
if there are a huge number of missing values present in our dataset.

– Generally, replacing the missing values with the mean/median/mode is a crude


way of treating missing values. Depending on the context, like if the variation is low
or if the variable has low leverage over the response, such a rough approximation
is acceptable and could give satisfactory results. In this case, since you are saying
it is a categorical variable — this step may not be applicable.

For Example, 1, To implement this method, we replace the missing value by the
most frequent value for that particular column, here we replace the missing value
by Male since the count of Male is more than Female (Male=2 and Female=1).

Step 3: Develop a Model to Predict Missing Values


One smart way of doing this could be training a classifier over your columns with
missing values as a dependent variable against other features of your data set and
trying to impute based on the newly trained classifier.

Algorithms for Missing Values of Categorical Variables

 Divide the data into two parts. One part will have the present values of the column
including the original output column, the other part will have the rows with the
missing values.
 Divide the 1st part (present values) into cross-validation set for model selection.
 Train your models and test their metrics against the cross-validated data. You can
also perform a grid search or randomized search for the best results.
 Finally, with the model, predict the unknown values which are missing in our
problem.

NOTE: Since you are trying to impute missing values, things will be nicer this way
as they are not biased and you get the best predictions out of the best model.

For Example, 1, To implement the given strategy, firstly we will consider Feature-2,
Feature-3, and Output column for our new classifier means these 3 columns are
used as independent features for our new classifier and the Feature-1 considered
as a target outcome and note that here we consider only non-missing rows as our
train data and observations which is having missing value will become our test
data. We have to do the prediction using our model on the test data and after
predictions, we have the dataset which is having no missing value.

Step 4: Deleting the variable


If there are an exceptionally larger set of missing values, try excluding the variable
itself for further modeling, but you need to make sure that it is not much significant
for predicting the target variable i.e, Correlation between dropped variable and
target variable is very low or redundant.

For Example, 1, To implement this strategy to handle the missing values, we have
to drop the complete column which contains missing values, so for a given dataset
we drop the Feature-1 completely and we use only left features to predict our target
variable.

Step 5: Apply unsupervised Machine learning techniques


In this approach, we use unsupervised techniques like K-Means, Hierarchical
clustering, etc. The idea is that you can skip those columns which are having
missing values and consider all other columns except the target column and try to
create as many clusters as no of independent features(after drop missing value
columns), finally find the category in which the missing row falls.

For Example, 1, To implement this strategy, we drop the Feature-1 column and
then use Feature-2 and Feature-3 as our features for the new classifier and then
finally after cluster formation, try to observe in which cluster the missing record is
falling in and we are ready with our final dataset for further analysis.

What is Categorical Data?

Since we are going to be working on categorical variables in this article, here is a


quick refresher on the same with a couple of examples. Categorical variables are
usually represented as ‘strings’ or ‘categories’ and are finite in number. Here are a
few examples:

1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT,
Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters,
PhD.
4. The grades of a student: A+, A, B+, B, B- etc.

In the above examples, the variables only have definite possible values. Further,
we can see there are two kinds of categorical data-

 Ordinal Data: The categories have an inherent order


 Nominal Data: The categories do not have an inherent order
In Ordinal data, while encoding, one should retain the information regarding the
order in which the category is provided. Like in the above example the highest
degree a person possesses, gives vital information about his qualification. The
degree is an important feature to decide whether a person is suitable for a post or
not.

While encoding Nominal data, we have to consider the presence or absence of a


feature. In such a case, no notion of order is present. For example, the city a
person lives in. For the data, it is important to retain where a person lives. Here, We
do not have any order or sequence. It is equal if a person lives in Delhi or
Bangalore.

For encoding categorical data, we have a python package category_encoders. The


following code helps you install easily.

Categorical Data Encoding


Techniques
What is Categorical Data?
When we collect data, we often encounter different types of variables. One
such type is categorical variables. Categorical variables are usually
represented as ‘strings’ or ‘categories’ and are finite in number.

There are two types of categorical data -

• Ordinal Data

• Nominal Data
Here are a few examples of categorical variables:
• Places: Delhi, Mumbai, Ahmedabad, Bangalore, etc.

• Departments: Finance, Human resources, IT, Production.

• Grades: A, A-, B+, B, B- etc.

Ordinal Data:

The categories of ordinal data have an Inherent Order. This means that
the categories can be Ranked or ordered from highest to lowest or vice
versa.

For example, the variable “highest degree a person has” is an ordinal


variable. The categories (High school, Diploma, Bachelors, Masters, PhD)
can be ranked in order of the level of education attained.

Nominal Data:

The categories of nominal data do not have an Inherent Order. This


means that the categories cannot be ranked or ordered.

For example, the variable “city where a person lives” is a nominal variable.
The categories (Delhi, Mumbai, Ahmedabad, Bangalore, etc.) cannot be
ranked or ordered.

What is Data Encoding?


Data Encoding is an important pre-processing step in Machine Learning.
It refers to the process of converting categorical or textual data into
numerical format, so that it can be used as input for algorithms to process.
The reason for encoding is that most machine learning algorithms work
with numbers and not with text or categorical variables.

Why it is Important?
• Most machine learning algorithms work only with numerical data, so
categorical variables (such as text labels) must be transformed into
numerical values.

• This allows the model to identify patterns in the data and make
predictions based on those patterns.

• Encoding also helps to prevent bias in the model by ensuring that all
features are equally weighted.

• The choice of encoding method can have a significant impact on model


performance, so it is important to choose an appropriate encoding
technique based on the nature of the data and the specific requirements of
the model.

There are several methods for encoding categorical variables,


including

1. One-Hot Encoding

2. Dummy Encoding
3.Ordinal Encoding

4. Binary Encoding

5. Count Encoding

6. Target Encoding

Let’s take a closer look at each of these methods.

One-Hot Encoding:
• One-Hot Encoding is the Most Common method for
encoding Categorical variables.

• a Binary Column is created for each Unique Category in the variable.

• If a category is present in a sample, the corresponding column is set to 1,


and all other columns are set to 0.

• For example, if a variable has three categories ‘A’, ‘B’ and ‘C’, three
columns will be created and a sample with category ‘B’ will have the value
[0,1,0].
Dummy Encoding
• Dummy coding scheme is similar to one-hot encoding.

• This categorical data encoding method transforms the categorical variable


into a set of binary variables [0/1].

• In the case of one-hot encoding, for N categories in a variable, it uses N


binary variables.

• The dummy encoding is a small improvement over one-hot-encoding.


Dummy encoding uses N-1 features to represent N labels/categories.

Label Encoding:

 Each unique category is assigned a Unique Integer value.

 This is a simpler encoding method, but it has a Drawback in that the


assigned integers may be misinterpreted by the machine learning
algorithm as having an Ordered Relationship when in fact they do
not.
Ordinal Encoding:
• Ordinal Encoding is used when the categories in a variable have
a Natural Ordering.

• In this method, the categories are assigned a numerical value based


on their order, such as 1, 2, 3, etc.

For example, if a variable has categories ‘Low’, ‘Medium’ and ‘High’,


they can be assigned the values 1, 2, and 3, respectively.

Binary Encoding:
• Binary Encoding is similar to One-Hot Encoding, but instead of
creating a separate column for each category, the categories are
represented as binary digits.

 For example, if a variable has four categories ‘A’, ‘B’, ‘C’ and ‘D’, they can
be represented as 0001, 0010, 0100 and 1000, respectively.

Feature Selection
Importance of Feature Selection in Machine Learning
Feature selection is a process that chooses a subset of features from the original
features so that the feature space is optimally reduced according to a certain
criterion.
Feature selection is a critical step in the feature construction process. In text
categorization problems, some words simply do not appear very often. Perhaps the
word “groovy” appears in exactly one training document, which is positive. Is it really
worth keeping this word around as a feature ? It’s a dangerous endeavor because it’s
hard to tell with just one training example if it is really correlated with the positive
class or is it just noise. You could hope that your learning algorithm is smart enough
to figure it out. Or you could just remove it.
By using feature selection and feature importance methods, you can create new views of your
data to explore with modelling algorithms and gain insights into the most important factors
driving your model's predictions.

The benefits of feature selection and feature importance methods are numerous:

1. Firstly, they can improve model performance by focusing on the most relevant features, which
can make the model more accurate in predicting new, unseen data.
2. Secondly, they can reduce the computational complexity of a model by reducing the number
of features the model needs to process. This can lead to faster training and inference times, which
can be particularly important in real-time or large-scale applications.
3. Finally, they can improve the interpretability of a model by focusing on the most important
features and removing noise or irrelevant information. This can make it easier to understand how
the model is making predictions and to identify which features are most important in driving
those predictions.

Feature Selection Techniques in Machine


Learning
There are three general classes of feature selection algorithms: Filter methods,
wrapper methods and embedded methods.
The role of feature selection in machine learning is,
1. To reduce the dimensionality of feature space.
2. To speed up a learning algorithm.
3. To improve the predictive accuracy of a classification algorithm.
4. To improve the comprehensibility of the learning results.
Features Selection Algorithms are as follows:
1. Instance based approaches: There is no explicit procedure for feature subset
generation. Many small data samples are sampled from the data. Features are
weighted according to their roles in differentiating instances of different classes for a
data sample. Features with higher weights can be selected.
2. Nondeterministic approaches: Genetic algorithms and simulated annealing are
also used in feature selection.
3. Exhaustive complete approaches: Branch and Bound evaluates estimated
accuracy and ABB checks an inconsistency measure that is monotonic. Both start
with a full feature set until the preset bound cannot be maintained.
While building a machine learning model for real-life dataset, we come across a lot of
features in the dataset and not all these features are important every time. Adding
unnecessary features while training the model leads us to reduce the overall
accuracy of the model, increase the complexity of the model and decrease the
generalization capability of the model and makes the model biased. Even the saying
“Sometimes less is better” goes as well for the machine learning model.
Hence, feature selection is one of the important steps while building a machine
learning model. Its goal is to find the best possible set of features for building a
machine learning model.
Some popular techniques of feature selection in machine learning are:
 Filter methods
 Wrapper methods
 Embedded methods

Filter Methods

These methods are generally used while doing the pre-processing step. These
methods select features from the dataset irrespective of the use of any machine
learning algorithm. In terms of computation, they are very fast and inexpensive and
are very good for removing duplicated, correlated, redundant features but these
methods do not remove multicollinearity. Selection of feature is evaluated individually
which can sometimes help when features are in isolation (don’t have a dependency
on other features) but will lag when a combination of features can lead to increase in
the overall performance of the model.

Some techniques used are:


 Information Gain – It is defined as the amount of information provided by the
feature for identifying the target value and measures reduction in the entropy
values. Information gain of each attribute is calculated considering the target
values for feature selection.
 Chi-square test — Chi-square method (X2) is generally used to test the
relationship between categorical variables. It compares the observed values from
different attributes of the dataset to its expected value.

Chi-square Formula

 Fisher’s Score – Fisher’s Score selects each feature independently according to


their scores under Fisher criterion leading to a suboptimal set of features. The
larger the Fisher’s score is, the better is the selected feature.
 Correlation Coefficient – Pearson’s Correlation Coefficient is a measure of
quantifying the association between the two continuous variables and the
direction of the relationship with its values ranging from -1 to 1.

Wrapper methods:

Wrapper methods, also referred to as greedy algorithms train the algorithm by using
a subset of features in an iterative manner. Based on the conclusions made from
training in prior to the model, addition and removal of features takes place. Stopping
criteria for selecting the best subset are usually pre-defined by the person training
the model such as when the performance of the model decreases or a specific
number of features has been achieved. The main advantage of wrapper methods
over the filter methods is that they provide an optimal set of features for training the
model, thus resulting in better accuracy than the filter methods but are
computationally more expensive.

Some techniques used are:


 Forward selection – This method is an iterative approach where we initially start
with an empty set of features and keep adding a feature which best improves our
model after each iteration. The stopping criterion is till the addition of a new
variable does not improve the performance of the model.
 Backward elimination – This method is also an iterative approach where we
initially start with all features and after each iteration, we remove the least
significant feature. The stopping criterion is till no improvement in the performance
of the model is observed after the feature is removed.
 Bi-directional elimination – This method uses both forward selection and
backward elimination technique simultaneously to reach one unique solution.
 Exhaustive selection – This technique is considered as the brute force approach
for the evaluation of feature subsets. It creates all possible subsets and builds a
learning algorithm for each subset and selects the subset whose model’s
performance is best.

Embedded methods:
In embedded methods, the feature selection algorithm is blended as part of the
learning algorithm, thus having its own built-in feature selection methods. Embedded
methods encounter the drawbacks of filter and wrapper methods and merge their
advantages. These methods are faster like those of filter methods and more accurate
than the filter methods and take into consideration a combination of features as well.

Embedded Methods Implementation

Some techniques used are:


 Regularization – This method adds a penalty to different parameters of the
machine learning model to avoid over-fitting of the model. This approach of
feature selection uses Lasso (L1 regularization) and Elastic nets (L1 and L2
regularization). The penalty is applied over the coefficients, thus bringing down
some coefficients to zero. The features having zero coefficient can be removed
from the dataset.
 Tree-based methods – These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way to select features as well.
Feature importance tells us which features are more important in making an
impact on the target feature.

You might also like