Professional Documents
Culture Documents
Unit 4 Notes
Unit 4 Notes
Typically, datasets take on a tabular format consisting of rows and columns. Each
column represents a specific variable, while each row corresponds to a specific value.
Some datasets consisting of unstructured data are non-tabular, meaning they don’t fit
the traditional row-column format.
Organizations use data to solve business problems, make informed decisions, and
effectively plan for the future. Data analysis ensures that this data is optimized and
ready to use.
Descriptive analysis
Diagnostic analysis
Predictive analysis
Prescriptive analysis
Regardless of your reason for analyzing data, there are six simple steps that you can
follow to make the data analysis process more efficient.
It’s imperative to clean your data before beginning analysis. This is particularly
important if you’ll be presenting your findings to business teams who may use the data
for decision-making purposes. Teams need to have confidence that they’re acting on a
reliable source of information.
2. Identify the Right Questions
Once you’ve completed the cleaning process, you may have a lot of questions about
your final dataset. There’s so much potential that can be uncovered through analysis.
Identify the most important questions you hope to answer through your analysis.
These questions should be easily measurable and closely related to a specific
business problem. If the request for analysis is coming from a business team, ask
them to provide explicit details about what they’re hoping to learn, what they expect to
learn, and how they’ll use the information. You can use their input to determine which
questions take priority in your analysis.
It’s often helpful to break down your dataset into smaller, defined groups. Segmenting
your data will not only make your analysis more manageable, but also keep it on track.
One of the most important parts of data analysis is data visualization, which refers to
the process of creating graphical representations of data. Visualizing the data will help
you to easily identify any trends or patterns and obvious outliers.
By creating engaging visuals that represent the data, you’re also able to effectively
communicate your findings to key stakeholders who can quickly draw conclusions from
the visualizations.
There’s a variety of data visualization tools you can use to automatically generate
visual representations of a dataset, such as Microsoft Excel, Tableau, and Google
Charts.
After cleaning, organizing, transforming, and visualizing your data, revisit the questions
you outlined at the beginning of the data analysis process. Interpret your results and
determine whether the data helps you answer your original questions.
If the results are inconclusive, try revisiting a previous step in the analysis process.
Maybe your dataset was too large and should have been segmented further, or
perhaps there’s a different type of visualization better suited to your data.
Finally, as you near the conclusion of your analysis, remember that this dataset is only
one piece of the puzzle.
It’s critical to pair your quantitative findings with qualitative information, which you may
capture using questionnaires, interviews, or testimonials. While the dataset has the
ability to tell you what’s happening, qualitative information can often help you
understand why it’s happening.
Businesses rely on the insights gained from data analysis to guide a myriad of
activities, ranging from budgeting to strategy execution. The importance of data
analysis for today’s organizations can't be understated.
Real-world data collection has its own set of problems, It is often very messy which
includes missing data, presence of outliers, unstructured manner, etc. Before
looking for any insights from the data, we have to first perform preprocessing tasks
which then only allow us to use that data for further observation and train our
machine learning model. Missing value in a dataset is a very common phenomenon
in the reality. In this blog, you will see how to handle missing values for categorical
variables while we are performing data preprocessing. Missing value correction is
required to reduce bias and to produce powerful suitable models. Most of the
algorithms can’t handle missing data, thus you need to act in some way to simply
not let your code crash. So, let’s begin with the methods to solve the problem.
Male 23 24 Yes
–––– 24 25 No
Female 25 26 Yes
Male 26 27 Yes
The popular methods which are used by the machine learning community to
handle the missing value for categorical variables in the dataset are as
follows:
NOTE: But in some cases, this strategy can make the data imbalanced wrt classes
if there are a huge number of missing values present in our dataset.
For Example, 1, To implement this method, we replace the missing value by the
most frequent value for that particular column, here we replace the missing value
by Male since the count of Male is more than Female (Male=2 and Female=1).
Divide the data into two parts. One part will have the present values of the column
including the original output column, the other part will have the rows with the
missing values.
Divide the 1st part (present values) into cross-validation set for model selection.
Train your models and test their metrics against the cross-validated data. You can
also perform a grid search or randomized search for the best results.
Finally, with the model, predict the unknown values which are missing in our
problem.
NOTE: Since you are trying to impute missing values, things will be nicer this way
as they are not biased and you get the best predictions out of the best model.
For Example, 1, To implement the given strategy, firstly we will consider Feature-2,
Feature-3, and Output column for our new classifier means these 3 columns are
used as independent features for our new classifier and the Feature-1 considered
as a target outcome and note that here we consider only non-missing rows as our
train data and observations which is having missing value will become our test
data. We have to do the prediction using our model on the test data and after
predictions, we have the dataset which is having no missing value.
For Example, 1, To implement this strategy to handle the missing values, we have
to drop the complete column which contains missing values, so for a given dataset
we drop the Feature-1 completely and we use only left features to predict our target
variable.
For Example, 1, To implement this strategy, we drop the Feature-1 column and
then use Feature-2 and Feature-3 as our features for the new classifier and then
finally after cluster formation, try to observe in which cluster the missing record is
falling in and we are ready with our final dataset for further analysis.
1. The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
2. The department a person works in: Finance, Human resources, IT,
Production.
3. The highest degree a person has: High school, Diploma, Bachelors, Masters,
PhD.
4. The grades of a student: A+, A, B+, B, B- etc.
In the above examples, the variables only have definite possible values. Further,
we can see there are two kinds of categorical data-
• Ordinal Data
• Nominal Data
Here are a few examples of categorical variables:
• Places: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
Ordinal Data:
The categories of ordinal data have an Inherent Order. This means that
the categories can be Ranked or ordered from highest to lowest or vice
versa.
Nominal Data:
For example, the variable “city where a person lives” is a nominal variable.
The categories (Delhi, Mumbai, Ahmedabad, Bangalore, etc.) cannot be
ranked or ordered.
Why it is Important?
• Most machine learning algorithms work only with numerical data, so
categorical variables (such as text labels) must be transformed into
numerical values.
• This allows the model to identify patterns in the data and make
predictions based on those patterns.
• Encoding also helps to prevent bias in the model by ensuring that all
features are equally weighted.
1. One-Hot Encoding
2. Dummy Encoding
3.Ordinal Encoding
4. Binary Encoding
5. Count Encoding
6. Target Encoding
One-Hot Encoding:
• One-Hot Encoding is the Most Common method for
encoding Categorical variables.
• For example, if a variable has three categories ‘A’, ‘B’ and ‘C’, three
columns will be created and a sample with category ‘B’ will have the value
[0,1,0].
Dummy Encoding
• Dummy coding scheme is similar to one-hot encoding.
Label Encoding:
Binary Encoding:
• Binary Encoding is similar to One-Hot Encoding, but instead of
creating a separate column for each category, the categories are
represented as binary digits.
For example, if a variable has four categories ‘A’, ‘B’, ‘C’ and ‘D’, they can
be represented as 0001, 0010, 0100 and 1000, respectively.
Feature Selection
Importance of Feature Selection in Machine Learning
Feature selection is a process that chooses a subset of features from the original
features so that the feature space is optimally reduced according to a certain
criterion.
Feature selection is a critical step in the feature construction process. In text
categorization problems, some words simply do not appear very often. Perhaps the
word “groovy” appears in exactly one training document, which is positive. Is it really
worth keeping this word around as a feature ? It’s a dangerous endeavor because it’s
hard to tell with just one training example if it is really correlated with the positive
class or is it just noise. You could hope that your learning algorithm is smart enough
to figure it out. Or you could just remove it.
By using feature selection and feature importance methods, you can create new views of your
data to explore with modelling algorithms and gain insights into the most important factors
driving your model's predictions.
The benefits of feature selection and feature importance methods are numerous:
1. Firstly, they can improve model performance by focusing on the most relevant features, which
can make the model more accurate in predicting new, unseen data.
2. Secondly, they can reduce the computational complexity of a model by reducing the number
of features the model needs to process. This can lead to faster training and inference times, which
can be particularly important in real-time or large-scale applications.
3. Finally, they can improve the interpretability of a model by focusing on the most important
features and removing noise or irrelevant information. This can make it easier to understand how
the model is making predictions and to identify which features are most important in driving
those predictions.
Filter Methods
These methods are generally used while doing the pre-processing step. These
methods select features from the dataset irrespective of the use of any machine
learning algorithm. In terms of computation, they are very fast and inexpensive and
are very good for removing duplicated, correlated, redundant features but these
methods do not remove multicollinearity. Selection of feature is evaluated individually
which can sometimes help when features are in isolation (don’t have a dependency
on other features) but will lag when a combination of features can lead to increase in
the overall performance of the model.
Chi-square Formula
Wrapper methods:
Wrapper methods, also referred to as greedy algorithms train the algorithm by using
a subset of features in an iterative manner. Based on the conclusions made from
training in prior to the model, addition and removal of features takes place. Stopping
criteria for selecting the best subset are usually pre-defined by the person training
the model such as when the performance of the model decreases or a specific
number of features has been achieved. The main advantage of wrapper methods
over the filter methods is that they provide an optimal set of features for training the
model, thus resulting in better accuracy than the filter methods but are
computationally more expensive.
Embedded methods:
In embedded methods, the feature selection algorithm is blended as part of the
learning algorithm, thus having its own built-in feature selection methods. Embedded
methods encounter the drawbacks of filter and wrapper methods and merge their
advantages. These methods are faster like those of filter methods and more accurate
than the filter methods and take into consideration a combination of features as well.