Professional Documents
Culture Documents
Data Transformation & Reduction
Data Transformation & Reduction
Data Transformation & Reduction
● Data Validation
● Data Transformation:
○ Standardization or Normalization
○ Feature Extraction
● Data Reduction
○ Sampling
○ Feature Selection
○ Principal Component Analysis (PCA)
○ Data Discretization
Data Transformation: Data transformation is one of the fundamental steps in the part of
data processing.
Data Standardization
Data standardization is the process of converting data to a common format to enable users to
process and analyse it.
Most organizations utilize data from several sources; this can include data warehouses,
data lakes, cloud storage, and databases.
It involves converting that data into a uniform format, with logical and consistent
definitions.
These definitions will form your metadata — the labels that identify the what, how,
why, who, when, and where of your data.
That is the basis of your data standardization process.
From an accuracy perspective, standardizing the way you label data will improve access
to the most relevant and current information.
This will help make your analytics and reporting easier.
Data standardization means your data is internally consistent — each of your data sources has
the same format and labels. When your data is neatly organized with logical descriptions and
labels, everyone in your organization can understand it and put it to use.
Importance of Data standardization
A lack of standardization results in low-quality data. Companies lose 12% of their potential
revenue, on average, due to bad data.
Harm reporting and forecasting by making your data difficult to search and
filter. Standardization makes data queries and filtering easier and more predictable.
Embarrassing marketing automation errors that harm your reputation. Ever sent an
email to “jane” instead of “Jane”? Or worse yet, refer to someone as “{First Name}”?
These small errors break the veil of one-to-one personalization and harm your company’s
reputation.
Increase your marketing budgets. Inconsistent data can cause you to send bad emails,
send direct mail materials to the wrong address, and ultimately cause a lot of waste in
your marketing budget.
Break integrations with important software. For example, inconsistent phone numbers
could cause problems with a sales team’s auto-dialer software, leading to missed
opportunities.
Slow down data cleansing processes. When data has inconsistencies, there are many
different types of issues that can crop up. These can be difficult to identify in the CRM or
in Excel with VLOOKUP and will slow down your teams as they require by-hand editing
to rectify them.
Steps of Data Standardization in CRM
Raw data mostly comes with errors like any human input data. In general, you want to
make sure that your CRM data is:
Correct
Clean
Complete
Properly Formatted (standardized)
Verified
2. Remove Clutter from Your Database
⮚ Duplicate data is a big problem because it breaks the single customer view that your
marketing, sales, support, and success teams rely on to evaluate their engagements
with prospects and customers. It splits the context of those interactions between two
or more records. It also leads to more of the embarrassing mistakes that harm a
company’s reputation — such as emailing or mailing marketing messages twice to the
same customer or prospect.
⮚ Irrelevant data. Unnecessary data that takes up vital storage space within your
CRM.
⮚ Redundant data. Data contained in two fields (or across multiple records) that are
trying to convey the same thing. For instance, “Location” and “City” may convey the
same data in two separate fields, taking up space and leading to confusion when your
team goes to use the data.
⮚ Inaccurate data. Standardizing your data does not do too much good if you are
standardizing inaccurate data. Make sure that you have data verification or
enrichment plans in place to ensure that the data that you are collecting is accurate.
⮚ Low quality data. Data that is non-personalized or generally low quality. This can
include organizational emails like info@domain.com, sales@domain.com, or free
email accounts for B2B companies. Another example would be emails that have
bounced when previously mailed.
Your CRM data collection methods are the gears that are responsible for
feeding new data into your CRM. Most data errors and issues will start there.
Taking steps to rectify those issues before they take place can limit a lot of
standardization problems.
⮚ Integrations
▪ Of course, integrations are all around great for businesses. They ensure that
our critical software solutions are talking to each other and sharing data where
available. However, integrations are another common source of data issues.
▪ One software may categorize data uniquely, use different fields to describe the
same information. Salesforce and HubSpot integration is a common need for
companies that are looking to connect and sync their sales and marketing
operations. However, it can lead to several common problems including the
creation of duplicate contacts and companies, complicated account hierarchies
that lead to errors, and issues with sync timing.
▪ Integrations simply add too much value to avoid, but you can be certain that
nearly any integration between two separate systems will cause some
unintended data consequences that you will have to figure out how to deal
with.
4. Define Your CRM Data Standards
To standardize your data, you must have your standards defined. These standards are
defined by a set of rules for each field.
For instance, some of the different standards that are commonly set for CRM field data
include:
First names should be capitalized and contain no spaces, numbers, or extra characters.
You may also want to remove middle names, and titles like Mr. and Ms. so that you
can use the first name in campaigns without worrying about awkward outcomes
Phone number formatting should be specific and consistent, like 123-456-7890. As
opposed to (123)-456-7890, 123.456.7890, or 1-(123)456-7890. It’s best to use an
international standard like E.164 that is uniform and is supported by most software and
hardware products.
States should be expressed using a consistent convention, either verbose such as
“Washington” or abbreviated, such as “WA”. The key is to ensure consistency across
all the systems and apps that you use.
Emails should follow the standard email format of “name@domain.com.” You may
consider limiting free email services like Gmail and Hotmail, or at least have a way to
identify and filter for free email domains so that you can treat them as a segment.
Limiting data sources can help to improve quality and make data standardization
easier.
Website URL should include the full “http://www.” and not just “sitedomain.com,”
and use a separate field to store just the domain name “sitedomain.com”
Job titles standardized to popular acronyms, such as “CEO” instead of “Chief
Executive Officer.” You can choose either option, the key is to maintain consistency.
5. Standardize Data
▪ With standards in place, you can begin the process of fixing standardization issues
throughout your customer database.
▪ For smaller companies, using Excel functions and VLOOKUP might be powerful enough
to fix most of your standardization issues. Larger companies with more CRM records
may need to invest in a third-party solution to manage data at scale on a continuous basis.
▪ Implement proper form validation and data cleaning processes to ensure that data is
standardized and clean when it hits your CRM. The best way to keep a customer database
clean is to make sure that bad data never makes its way into it in the first place!
▪ Try to keep your expectations in check. In large databases, it is impossible to avoid
quality and standardization issues at some level. There will always be errors that you’ll
need to fix manually. But by making standardization and proper data collection a priority,
you’ll greatly reduce the amount of time that you have to spend dealing with these issues.
Standardization Techniques:
where V represents the value of the variable in the original data set.
∙ This method allows variables to have differing means and standard deviations but
equal ranges.
∙ In this case, there is at least one observed value at the 0 and 1 endpoints.
b) Dividing each value by the range: re-calculates each variable as
V /(max V - min V).
In this case, the means, variances, and ranges of the variables are still different, but at
least the ranges are likely to be more similar.
As a result, all variables in the data set have equal means (0) and standard deviations (1)
but different ranges.
d) Dividing each value by the standard deviation. This method produces a set of
transformed variables with variances of 1, but different means and ranges.
Feature Extraction
Feature Extraction aims to reduce the number of features in a dataset by creating new
features from the existing ones (and then discarding the original features).
These new reduced set of features should then be able to summarize most of the
information contained in the original set of features.
In this way, a summarised version of the original features can be created from a
combination of the original set.
Feature Extraction is basically a process of dimensionality reduction where the raw
data obtained is separated into related manageable groups.
A distinctive feature of these large datasets is that they contain a large number of
variables and additionally these variables require a lot of computing resources in order
to process them.
Hence Feature Extraction can be useful in this case in selecting particular variables
and also combining some of the related variables which in a way would reduce the
amount of data.
Feature Generation
PCA fails when the data is non-linear which can be considered as one of the
biggest disadvantages of PCA.
∙ Bag of Words: This is the most widely used technique in the field of
Natural Language Processing. Here, firstly sentences are tokenized
and stop words are removed. After that, the words are individually
classified into the frequency of use.
∙ Image Processing: Image processing is one of the most explorative
domains where feature extraction is widely used. Since images
represent different features or attributes such as shapes, hues, motion
in the case of digital images thus processing them is of utmost
importance so that only specified features are extracted. The image
processing also makes use of many algorithms in addition to feature
extraction.
∙ Auto-encoders:
❖ Autoencoders are a family of Machine Learning
reduction technique.
❖ The main difference between Autoencoders and other
main components:
tried to reduce the input data using a linear transformation (therefore giving us a result
Data Reduction
∙ When dealing with a small dataset, the transformations described above are usually
adequate to prepare input data for a data mining analysis.
∙ However, when facing a large dataset it is also appropriate to reduce its size, in order to
make learning algorithms more efficient, without sacrificing the quality of the results
obtained.
There are three main criteria to determine whether a data reduction technique should be used:
efficiency, accuracy and simplicity of the models generated.
Efficiency
● The application of learning algorithms to a dataset smaller than the original one usually
means a shorter computation time.
● If the complexity of the algorithm is a superlinear function, as is the case for most known
methods, the improvement in efficiency resulting from a reduction in the dataset size may
be dramatic.
● Within the data mining process it is customary to run several alternative learning
algorithms in order to identify the most accurate model.
● Therefore, a reduction in processing times allows the analyses to be carried out more
quickly.
Accuracy
In most applications, the accuracy of the models generated represents a critical success factor,
and it is therefore the main criterion followed in order to select one class of learning methods
over another.
Data reduction techniques should not significantly compromise the accuracy of the model
generated.
It may also be the case that some data reduction techniques, based on attribute selection, will
lead to models with a higher generalization capability on future records.
Simplicity
In some data mining applications, concerned more with interpretation than with prediction, it is
important that the models generated be easily translated into simple rules that can be understood
by experts in the application domain.
As a trade-off for achieving simpler rules, decision makers are sometimes willing to allow a
slight decrease in accuracy.
Data reduction often represents an effective technique for deriving models that are more easily
interpretable.
Since it is difficult to develop a data reduction technique that represents the optimal solution for
all the criteria described, the analyst will aim for a suitable trade-off among all the requirements
outlined.
Sampling
● A further reduction in the size of the original dataset can be achieved by extracting a
sample of observations that is significant from a statistical standpoint.
● This type of reduction is based on classical inferential reasoning.
● It is therefore necessary to determine the size of the sample that guarantees the level of
accuracy required by the subsequent learning algorithms and to define an adequate
sampling procedure.
● Sampling may be simple or stratified depending on whether one wishes to preserve in the
sample the percentages of the original dataset with respect to a categorical attribute that is
considered critical.
● Generally speaking, a sample comprising a few thousand observations is adequate to train
most learning models.
● It is also useful to set up several independent samples, each of a predetermined size, to
which learning algorithms should be applied.
● In this way, computation times increase linearly with the number of samples determined,
and it is possible to compare the different models generated, in order to assess the
robustness of each model and the quality of the knowledge extracted from data against
the random fluctuations existing in the sample.
● It is obvious that the conclusions obtained can be regarded as robust when the models and
the rules generated remain relatively stable as the sample set used for training varies.
Feature selection
▪ The purpose of feature selection, also called feature reduction, is to eliminate from the
dataset a subset of variables which are not deemed relevant for the purpose of the data
mining activities.
▪ One of the most critical aspects in a learning process is the choice of the combination of
predictive variables more suited to accurately explain the investigated phenomenon.
▪ Feature reduction has several potential advantages.
▪ Due to the presence of fewer columns, learning algorithms can be run more quickly
on the reduced dataset than on the original one.
▪ Moreover, the models generated after the elimination from the dataset of un-influential
attributes are often more accurate and easier to understand.
Feature selection methods can be classified into three main categories: filter
methods, wrapper methods and embedded methods.
Filter methods
● Filter methods select the relevant attributes before moving on to the subsequent learning
phase, and are therefore independent of the specific algorithm being used.
● The attributes deemed most significant are selected for learning, while the rest are
excluded.
● Several alternative statistical metrics have been proposed to assess the predictive
capability and relevance of a group of attributes.
● Generally, these are monotone metrics in which their value increases or decreases
according to the number of attributes considered.
● The simplest filter method to apply for supervised learning involves the assessment of
each single attribute based on its level of correlation with the target.
● Consequently, this leads to the selection of the attributes that appear mostly correlated
with the target.
Wrapper methods
Embedded methods
▪ For the embedded methods, the attribute selection process lies inside the learning
algorithm, so that the selection of the optimal set of attributes is directly made during the
phase of model generation. Classification trees are an example of embedded methods.
▪ At each tree node, they use an evaluation function that estimates the predictive value of a
single attribute or a linear combination of variables.
▪ In this way, the relevant attributes are automatically selected and they determine the rule
for splitting the records in the corresponding node.
▪ Filter methods are the best choice when dealing with very large datasets, whose
observations are described by a large number of attributes.
▪ In these cases, the application of wrapper methods is inappropriate due to very long
computation times.
▪ Moreover, filter methods are flexible and in principle can be associated with any learning
algorithm.
▪ However, when the size of the problem at hand is moderate, it is preferable to turn to
wrapper or embedded methods which afford in most cases accuracy levels that are higher
compared to filter methods.
▪ As described above, wrapper methods select the attributes according to a search scheme
that inspects in sequence several subsets of attributes and applies the learning algorithm
to each subset in order to assess the resulting accuracy of the corresponding model. If a
dataset contains n attributes, there are 2n possible subsets and therefore an exhaustive
search procedure would require excessive computation times even for moderate values of
n. As a consequence, the procedure for selecting the attributes for wrapper methods is
usually of a heuristic nature, based in most cases on a greedy logic which evaluates for
each attribute a relevance indicator adequately defined and then selects the attributes
based on their level of relevance.
In particular, three distinct myopic search schemes can be followed: forward, backward and
forward–backward search.
Forward. According to the forward search scheme, also referred to as bottom-up
search, the exploration starts with an empty set of attributes and subsequently
introduces the attributes one at a time based on the ranking induced by the
relevance indicator. The algorithm stops when the relevance index of all the
attributes still excluded is lower than a prefixed threshold.
Backward. The backward search scheme, also referred to as top-down search,
begins the exploration by selecting all the attributes and then eliminates them
one at a time based on the preferred relevance indicator. The algorithm stops
when the relevance index of all the attributes still included in the model is
higher than a prefixed threshold.
Forward–backward. The forward–backward method represents a trade-off
between the previous schemes, in the sense that at each step the best attribute
among those excluded is introduced and the worst attribute among those
included is eliminated. Also in this case, threshold values for the included
and excluded attributes determine the stopping criterion.
The various wrapper methods differ in the choice of the relevance measure
as well as well as the threshold preset values for the stopping rule of the
algorithm.