Data Transformation & Reduction

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

UNIT-3: Data Preparation

● Data Validation
● Data Transformation:
○ Standardization or Normalization
○ Feature Extraction
● Data Reduction
○ Sampling
○ Feature Selection
○ Principal Component Analysis (PCA)
○ Data Discretization

Data Transformation: Data transformation is one of the fundamental steps in the part of
data processing.

Data Standardization

Data standardization is the process of converting data to a common format to enable users to
process and analyse it.

 Most organizations utilize data from several sources; this can include data warehouses,
data lakes, cloud storage, and databases.
 It involves converting that data into a uniform format, with logical and consistent
definitions.
 These definitions will form your metadata — the labels that identify the what, how,
why, who, when, and where of your data.
 That is the basis of your data standardization process.
 From an accuracy perspective, standardizing the way you label data will improve access
to the most relevant and current information.
 This will help make your analytics and reporting easier.

Data standardization means your data is internally consistent — each of your data sources has
the same format and labels. When your data is neatly organized with logical descriptions and
labels, everyone in your organization can understand it and put it to use.
Importance of Data standardization

A lack of standardization results in low-quality data. Companies lose 12% of their potential
revenue, on average, due to bad data.

This causes many issues, including:

 Harm reporting and forecasting by making your data difficult to search and
filter. Standardization makes data queries and filtering easier and more predictable.
 Embarrassing marketing automation errors that harm your reputation. Ever sent an
email to “jane” instead of “Jane”? Or worse yet, refer to someone as “{First Name}”?
These small errors break the veil of one-to-one personalization and harm your company’s
reputation.
 Increase your marketing budgets. Inconsistent data can cause you to send bad emails,
send direct mail materials to the wrong address, and ultimately cause a lot of waste in
your marketing budget.
 Break integrations with important software. For example, inconsistent phone numbers
could cause problems with a sales team’s auto-dialer software, leading to missed
opportunities.
 Slow down data cleansing processes. When data has inconsistencies, there are many
different types of issues that can crop up. These can be difficult to identify in the CRM or
in Excel with VLOOKUP and will slow down your teams as they require by-hand editing
to rectify them.
Steps of Data Standardization in CRM

1. Audit and Take Stock of Your Data

Raw data mostly comes with errors like any human input data. In general, you want to
make sure that your CRM data is:

 Correct
 Clean
 Complete
 Properly Formatted (standardized)
 Verified
2. Remove Clutter from Your Database
⮚ Duplicate data is a big problem because it breaks the single customer view that your
marketing, sales, support, and success teams rely on to evaluate their engagements
with prospects and customers. It splits the context of those interactions between two
or more records. It also leads to more of the embarrassing mistakes that harm a
company’s reputation — such as emailing or mailing marketing messages twice to the
same customer or prospect.
⮚ Irrelevant data. Unnecessary data that takes up vital storage space within your
CRM.
⮚ Redundant data. Data contained in two fields (or across multiple records) that are
trying to convey the same thing. For instance, “Location” and “City” may convey the
same data in two separate fields, taking up space and leading to confusion when your
team goes to use the data.
⮚ Inaccurate data. Standardizing your data does not do too much good if you are
standardizing inaccurate data. Make sure that you have data verification or
enrichment plans in place to ensure that the data that you are collecting is accurate.
⮚ Low quality data. Data that is non-personalized or generally low quality. This can
include organizational emails like info@domain.com, sales@domain.com, or free
email accounts for B2B companies. Another example would be emails that have
bounced when previously mailed.

3. Know & Evaluate Your CRM Data Collection Methods


 Consider your business requirements when collecting data.
 What data do you absolutely need to collect.
 What data is less important for your business processes?

Some common considerations when it comes to data collection methods include:

⮚ Customer-Facing Data Input Forms


▪ When you rely on your customers to supply their information (as most
companies do) you are going to have unavoidable data issues. Consider what
information you are asking your customers to provide and what data it may be
best to collect in another way, such as through a data enrichment service.
▪ Make sure form fields have proper validation in place so that you are sending
clean, standardized data into your CRM.
⮚ Employee-Facing Data Input Forms
▪ Just like customers, your employees can make mistakes when inputting data
into forms too, and just like those customer-facing data forms you need to
evaluate what you are asking your teams to input and whether or not the forms
have proper validation in place for each field.
▪ Additionally, employee training on the importance and best practices of data
quality and the impact of bad data on your business can go a long way toward
reducing internally-caused customer data issues.
⮚ Third-Party Customer List Imports
▪ Importing data from another platform can often lead to data issues and
standardization. You may import a lot of redundant data because another
platform uses different field titles to refer to the same data.
▪ Say you were importing a third-party list into Pipedrive. If you don’t
have Pipedrive deduplication processes in place, you are likely to create
duplicate records. Knowing the issues that are common with imports from
each third-party platform is important for reducing data issues.
▪ Or maybe you have data from a recent event that you appeared at. There can
be all sorts of issues with outside data — data in all caps, titles that don’t
match conventions, picklist values that are unverified, etc.

Your CRM data collection methods are the gears that are responsible for
feeding new data into your CRM. Most data errors and issues will start there.
Taking steps to rectify those issues before they take place can limit a lot of
standardization problems.
⮚ Integrations
▪ Of course, integrations are all around great for businesses. They ensure that
our critical software solutions are talking to each other and sharing data where
available. However, integrations are another common source of data issues.
▪ One software may categorize data uniquely, use different fields to describe the
same information. Salesforce and HubSpot integration is a common need for
companies that are looking to connect and sync their sales and marketing
operations. However, it can lead to several common problems including the
creation of duplicate contacts and companies, complicated account hierarchies
that lead to errors, and issues with sync timing.
▪ Integrations simply add too much value to avoid, but you can be certain that
nearly any integration between two separate systems will cause some
unintended data consequences that you will have to figure out how to deal
with.
4. Define Your CRM Data Standards

To standardize your data, you must have your standards defined. These standards are
defined by a set of rules for each field.

For instance, some of the different standards that are commonly set for CRM field data
include:

 First names should be capitalized and contain no spaces, numbers, or extra characters.
You may also want to remove middle names, and titles like Mr. and Ms. so that you
can use the first name in campaigns without worrying about awkward outcomes
 Phone number formatting should be specific and consistent, like 123-456-7890. As
opposed to (123)-456-7890, 123.456.7890, or 1-(123)456-7890. It’s best to use an
international standard like E.164 that is uniform and is supported by most software and
hardware products.
 States should be expressed using a consistent convention, either verbose such as
“Washington” or abbreviated, such as “WA”. The key is to ensure consistency across
all the systems and apps that you use.
 Emails should follow the standard email format of “name@domain.com.” You may
consider limiting free email services like Gmail and Hotmail, or at least have a way to
identify and filter for free email domains so that you can treat them as a segment.
Limiting data sources can help to improve quality and make data standardization
easier.
 Website URL should include the full “http://www.” and not just “sitedomain.com,”
and use a separate field to store just the domain name “sitedomain.com”
 Job titles standardized to popular acronyms, such as “CEO” instead of “Chief
Executive Officer.” You can choose either option, the key is to maintain consistency.

5. Standardize Data
▪ With standards in place, you can begin the process of fixing standardization issues
throughout your customer database.
▪ For smaller companies, using Excel functions and VLOOKUP might be powerful enough
to fix most of your standardization issues. Larger companies with more CRM records
may need to invest in a third-party solution to manage data at scale on a continuous basis.
▪ Implement proper form validation and data cleaning processes to ensure that data is
standardized and clean when it hits your CRM. The best way to keep a customer database
clean is to make sure that bad data never makes its way into it in the first place!
▪ Try to keep your expectations in check. In large databases, it is impossible to avoid
quality and standardization issues at some level. There will always be errors that you’ll
need to fix manually. But by making standardization and proper data collection a priority,
you’ll greatly reduce the amount of time that you have to spend dealing with these issues.

Standardization Techniques:

a) 0-1 scaling: each variable in the data set is re-calculated as


(V - min V)/ (max V - min V),

where V represents the value of the variable in the original data set.

∙ This method allows variables to have differing means and standard deviations but
equal ranges.
∙ In this case, there is at least one observed value at the 0 and 1 endpoints.
b) Dividing each value by the range: re-calculates each variable as
V /(max V - min V).

In this case, the means, variances, and ranges of the variables are still different, but at
least the ranges are likely to be more similar.

c) Z-score scaling: variables recalculated as

(V - mean of V)/s, where "s" is the standard deviation.

As a result, all variables in the data set have equal means (0) and standard deviations (1)
but different ranges.

d) Dividing each value by the standard deviation. This method produces a set of
transformed variables with variances of 1, but different means and ranges.
Feature Extraction
 Feature Extraction aims to reduce the number of features in a dataset by creating new
features from the existing ones (and then discarding the original features).
 These new reduced set of features should then be able to summarize most of the
information contained in the original set of features.
 In this way, a summarised version of the original features can be created from a
combination of the original set.
 Feature Extraction is basically a process of dimensionality reduction where the raw
data obtained is separated into related manageable groups.
 A distinctive feature of these large datasets is that they contain a large number of
variables and additionally these variables require a lot of computing resources in order
to process them.
 Hence Feature Extraction can be useful in this case in selecting particular variables
and also combining some of the related variables which in a way would reduce the
amount of data.

Feature Generation

∙ Feature Generation is the process of inventing new features from the


already existed features.
∙ As the sizes of the datasets vary a lot, it becomes impossible to manage
the larger ones.
∙ Thus this process of feature generation can play a vital role in order to
ease the task.
∙ To avoid generating meaningless features, we make use of some
mathematical formulae and statistical models to enhance clarity and
accuracy.
∙ This process usually adds more information to the model to make it more
accurate.
∙ So model accuracy can be enhancing through this process.
∙ This process in a way ignores the meaningless interaction by detecting
meaningful interactions.
Feature Evaluation

∙ It is important to initially prioritize the features to get our work done in a


well-organized manner and thus feature evaluation can be a tool for this.
∙ Here each and every feature is being evaluated in order to score them
objectively and henceforth utilize them based on the current needs.
∙ The unimportant ones can be ignored.
∙ So feature evaluation is an important task to perform in order to get a
proper final output of the model by reducing the biasness and
inconsistency in the data.

Linear and Non-Linear Feature Extraction

∙ Feature Extraction can be divided into two broad categories


i.e. linear and non-linear.
∙ One of the examples of linear feature extraction is PCA (Principal
Component Analysis).
∙ A principal component is a normalized linear combination of the original
features in a dataset.
∙ PCA is basically a method to obtain required variables (important ones)
from a large set of variables available in a data set.
∙ PCA tends to use orthogonal transformation to transform data into a lower-
dimensional space which in turn maximizes the variance of the data.
∙ PCA can be used for anomaly and outlier detection as these are
considered as noise or irrelevant data in the entire dataset.

Steps followed in building PCA from scratch are:

∙ Firstly, standardize the data


∙ Thereafter, calculate the Covariance-matrix
∙ Then, calculate the Eigenvector & Eigenvalues for the Covariance-
matrix.
∙ Arrange all Eigenvalues in decreasing order.
∙ Normalize the sorted Eigenvalues.
∙ Horizontally stack the Normalized_Eigenvalues

PCA fails when the data is non-linear which can be considered as one of the
biggest disadvantages of PCA.

This is where Kernel-PCA plays its role.

Kernel-PCA is similar to SVM because both of them implements Kernel–Trick to


convert the non-linear data to higher dimensional data up to the point when the
data is separable. Non-Linear approaches could be used in the case of face
recognition to extract features over large datasets.

Applications of Feature Extraction

∙ Bag of Words: This is the most widely used technique in the field of
Natural Language Processing. Here, firstly sentences are tokenized
and stop words are removed. After that, the words are individually
classified into the frequency of use.
∙ Image Processing: Image processing is one of the most explorative
domains where feature extraction is widely used. Since images
represent different features or attributes such as shapes, hues, motion
in the case of digital images thus processing them is of utmost
importance so that only specified features are extracted. The image
processing also makes use of many algorithms in addition to feature
extraction.
∙ Auto-encoders:
❖ Autoencoders are a family of Machine Learning

algorithms which can be used as a dimensionality

reduction technique.
❖ The main difference between Autoencoders and other

dimensionality reduction techniques is that

Autoencoders use non-linear transformations to project

data from a high dimension to a lower one.

The basic architecture of an Autoencoder can be broken down into 2

main components:

1. Encoder: takes the input data and compress it, so that to


remove all the possible noise and unhelpful information. The
output of the Encoder stage is usually called bottleneck or
latent-space.
2. Decoder: takes as input the encoded latent space and tries to
reproduce the original Autoencoder input using just it’s
compressed form (the encoded latent space).
If we wouldn’t use non-linear activation functions, then the Autoencoder would have

tried to reduce the input data using a linear transformation (therefore giving us a result

similar to if we would have used PCA).

∙ This is mainly used when we want to learn a compressed


representation of raw data. The procedure carried out is basically
unsupervised in nature.
∙ Effective Feature Extraction also plays a major role in solving under-
fitting and overfitting related problems in Machine Learning related
projects.
∙ Feature Extraction also gives us a clear and improvised visualization of
the data present in the dataset as only the important and required data
has been extracted.
∙ Feature Extraction helps in training the model in a more efficient
manner which in turn basically speeds up the whole process.

Feature Extraction v/s Feature Selection?

▪ Feature Selection aims to rank the importance of the features previously


existing in the dataset and in turn remove the less important features.
▪ However, Feature Extraction is concerned with reducing the dimensions of
the dataset to make the dataset more crisp and clear.

Significance of Feature Extraction

∙ Feature Extraction has its diverse usage in most of the domains.


∙ It appears that at the beginning of any project which makes use of a large
dataset the whole procedure of feature extraction must be executed and
evaluated carefully to get an optimized result with greater accuracy.
∙ Which in turn will provide a better insight of the relationship between the
variables present in the dataset.
● Data Validation
● Data Transformation:
● Data Reduction
○ Sampling
○ Feature Selection
○ Principal Component Analysis (PCA)
○ Data Discretization

Data Reduction
∙ When dealing with a small dataset, the transformations described above are usually
adequate to prepare input data for a data mining analysis.
∙ However, when facing a large dataset it is also appropriate to reduce its size, in order to
make learning algorithms more efficient, without sacrificing the quality of the results
obtained.
There are three main criteria to determine whether a data reduction technique should be used:
efficiency, accuracy and simplicity of the models generated.

Efficiency

● The application of learning algorithms to a dataset smaller than the original one usually
means a shorter computation time.
● If the complexity of the algorithm is a superlinear function, as is the case for most known
methods, the improvement in efficiency resulting from a reduction in the dataset size may
be dramatic.
● Within the data mining process it is customary to run several alternative learning
algorithms in order to identify the most accurate model.
● Therefore, a reduction in processing times allows the analyses to be carried out more
quickly.

Accuracy
In most applications, the accuracy of the models generated represents a critical success factor,
and it is therefore the main criterion followed in order to select one class of learning methods
over another.

Data reduction techniques should not significantly compromise the accuracy of the model
generated.

It may also be the case that some data reduction techniques, based on attribute selection, will
lead to models with a higher generalization capability on future records.

Simplicity
In some data mining applications, concerned more with interpretation than with prediction, it is
important that the models generated be easily translated into simple rules that can be understood
by experts in the application domain.
As a trade-off for achieving simpler rules, decision makers are sometimes willing to allow a
slight decrease in accuracy.
Data reduction often represents an effective technique for deriving models that are more easily
interpretable.
Since it is difficult to develop a data reduction technique that represents the optimal solution for
all the criteria described, the analyst will aim for a suitable trade-off among all the requirements
outlined.

Data reduction can be pursued in three distinct directions, described below:


● a reduction in the number of observations through sampling,
● a reduction in the number of attributes through selection and projection, and
● a reduction in the number of values through discretization and aggregation.

Sampling

● A further reduction in the size of the original dataset can be achieved by extracting a
sample of observations that is significant from a statistical standpoint.
● This type of reduction is based on classical inferential reasoning.
● It is therefore necessary to determine the size of the sample that guarantees the level of
accuracy required by the subsequent learning algorithms and to define an adequate
sampling procedure.
● Sampling may be simple or stratified depending on whether one wishes to preserve in the
sample the percentages of the original dataset with respect to a categorical attribute that is
considered critical.
● Generally speaking, a sample comprising a few thousand observations is adequate to train
most learning models.
● It is also useful to set up several independent samples, each of a predetermined size, to
which learning algorithms should be applied.
● In this way, computation times increase linearly with the number of samples determined,
and it is possible to compare the different models generated, in order to assess the
robustness of each model and the quality of the knowledge extracted from data against
the random fluctuations existing in the sample.
● It is obvious that the conclusions obtained can be regarded as robust when the models and
the rules generated remain relatively stable as the sample set used for training varies.

Feature selection
▪ The purpose of feature selection, also called feature reduction, is to eliminate from the
dataset a subset of variables which are not deemed relevant for the purpose of the data
mining activities.
▪ One of the most critical aspects in a learning process is the choice of the combination of
predictive variables more suited to accurately explain the investigated phenomenon.
▪ Feature reduction has several potential advantages.
▪ Due to the presence of fewer columns, learning algorithms can be run more quickly
on the reduced dataset than on the original one.
▪ Moreover, the models generated after the elimination from the dataset of un-influential
attributes are often more accurate and easier to understand.

Feature selection methods can be classified into three main categories: filter
methods, wrapper methods and embedded methods.
Filter methods
● Filter methods select the relevant attributes before moving on to the subsequent learning
phase, and are therefore independent of the specific algorithm being used.
● The attributes deemed most significant are selected for learning, while the rest are
excluded.
● Several alternative statistical metrics have been proposed to assess the predictive
capability and relevance of a group of attributes.
● Generally, these are monotone metrics in which their value increases or decreases
according to the number of attributes considered.
● The simplest filter method to apply for supervised learning involves the assessment of
each single attribute based on its level of correlation with the target.
● Consequently, this leads to the selection of the attributes that appear mostly correlated
with the target.

Wrapper methods

● If the purpose of the data mining investigation is classification or regression, and


consequently performances are assessed mainly in terms of accuracy, the selection of
predictive variables should be based not only on the level of relevance of each single
attribute but also on the specific learning algorithm being utilized.
● Wrapper methods are able to meet this need, since they assess a group of variables using
the same classification or regression algorithm used to predict the value of the target
variable.
● Each time, the algorithm uses a different subset of attributes for learning, identified by a
search engine that works on the entire set of all possible combinations of variables, and
selects the set of attributes that guarantees the best result in terms of accuracy.
● Wrapper methods are usually burdensome from a computational standpoint, since the
assessment of every possible combination identified by the search engine requires one to
deal with the entire training phase of the learning algorithm.

Embedded methods
▪ For the embedded methods, the attribute selection process lies inside the learning
algorithm, so that the selection of the optimal set of attributes is directly made during the
phase of model generation. Classification trees are an example of embedded methods.
▪ At each tree node, they use an evaluation function that estimates the predictive value of a
single attribute or a linear combination of variables.
▪ In this way, the relevant attributes are automatically selected and they determine the rule
for splitting the records in the corresponding node.
▪ Filter methods are the best choice when dealing with very large datasets, whose
observations are described by a large number of attributes.
▪ In these cases, the application of wrapper methods is inappropriate due to very long
computation times.
▪ Moreover, filter methods are flexible and in principle can be associated with any learning
algorithm.
▪ However, when the size of the problem at hand is moderate, it is preferable to turn to
wrapper or embedded methods which afford in most cases accuracy levels that are higher
compared to filter methods.
▪ As described above, wrapper methods select the attributes according to a search scheme
that inspects in sequence several subsets of attributes and applies the learning algorithm
to each subset in order to assess the resulting accuracy of the corresponding model. If a
dataset contains n attributes, there are 2n possible subsets and therefore an exhaustive
search procedure would require excessive computation times even for moderate values of
n. As a consequence, the procedure for selecting the attributes for wrapper methods is
usually of a heuristic nature, based in most cases on a greedy logic which evaluates for
each attribute a relevance indicator adequately defined and then selects the attributes
based on their level of relevance.

In particular, three distinct myopic search schemes can be followed: forward, backward and
forward–backward search.
Forward. According to the forward search scheme, also referred to as bottom-up
search, the exploration starts with an empty set of attributes and subsequently
introduces the attributes one at a time based on the ranking induced by the
relevance indicator. The algorithm stops when the relevance index of all the
attributes still excluded is lower than a prefixed threshold.
Backward. The backward search scheme, also referred to as top-down search,
begins the exploration by selecting all the attributes and then eliminates them
one at a time based on the preferred relevance indicator. The algorithm stops
when the relevance index of all the attributes still included in the model is
higher than a prefixed threshold.
Forward–backward. The forward–backward method represents a trade-off
between the previous schemes, in the sense that at each step the best attribute
among those excluded is introduced and the worst attribute among those
included is eliminated. Also in this case, threshold values for the included
and excluded attributes determine the stopping criterion.
The various wrapper methods differ in the choice of the relevance measure
as well as well as the threshold preset values for the stopping rule of the
algorithm.

6.3.3 Principal component analysis

You might also like