Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

CRISP DM >>>--------------------------------------------------------->>> CRISP MLQ

Cross Industry Standard Process for Data Mining CRISP for ML with Quality Assurance
this is the earlier used methodology Currently we are working in this methodology

PAST

DATA ANALYST (Work in tableau & BI)


PRESENT

FUTURE DATA SCIENTIST (Should know the quality of data Analyst as well)

ALGORITHM DATA ENGINEERS (who suggest which algorithm to used)

Stages of Analytics -
Descriptive –” What happened?” Past <<-------- Data Analyst
Diagnostic – “Why did it happen?” Past <<-------- Data Analyst
Predictive – “What will happen?” Future <<-------- Data Scientist
Prescriptive – “How can we make it happen?” Future <<-------- Data Scientist

PROJECT MANAGEMENT METHODOLOGY

CRISP MLQ – Cross Industry Standard Process for ML with Quality Assurance.
CRISP MLQ process model describes on 6phases –

 Business & Data Understanding


 Data preparation (Data Engineering)
 Model building & Tuning
 Model Evaluation (Testing & Evaluation)
 Deployment
 Monitoring & Maintenance

1a. Understand Business Problem & Create Project Charter


What is the Data Optimization terms?
Objective (Max, Min) & Constraints.
Maximum words to describe the Objective & constraints is 5-6 & minimum words will be 2-3

1b. Data Understanding

 Continuous vs Discrete data –


Key characteristics of discrete data
Discrete data is often used in simple statistical analysis because it's easy to summarize and compute. Let's
look at some of the other key characteristics of discrete data.

Discrete data includes discrete variables that are finite, numeric, countable, and non-negative integers (5,
10, 15, and so on).
Discrete data can be easily visualized and demonstrated using simple statistical methods such as bar
charts, line charts, or pie charts.
Discrete data can also be categorical - contain a finite number of data values, such as the gender of a
person.
Discrete data is distributed discretely in terms of time and space. Discrete distributions make analyzing
discrete values more practical

Key characteristics of continuous data


Unlike discrete data, continuous data can be either numeric or distributed over date and time. This data
type uses advanced statistical analysis methods taking into account the infinite number of possible values.
Key characteristics of continuous data are:

Continuous data changes over time and can have different values at different time intervals.
Continuous data is made up of random variables, which may or may not be whole numbers.
Continuous data is measured using data analysis methods such as line graphs, skews, and so on.
Regression analysis is one of the most common types of continuous data analysis.

Continuous Discrete
1. Continuous data is one that falls on a continuous 1. Discrete data is one that has clear spaces between
sequence. values.
Any kind of data that in numerical decimal format & Any kind of data that in numerical decimal format & not
making sense is called Continuous data. making sense is called Discrete data
2. It’s a measurable 2. It’s a countable
3. It can take any value in some interval. 3. It can take only specific data (distinct or separate values.)
3. Tabulation is known as grouped frequency 3. Tabulation is known as Ungrouped frequency
distribution. distribution.
4. A diagram of continuous functions graph shows the 4. A diagram of discrete functions shows a distinct point
point is connected with an unbroken line. that remains unconnected.
5. It includes any value within a preferred range 5. It contain distinct or separate value.
6. Graphical representation – Histogram or Line graph. 6. Graphical Representation-Bar graph.

7. e.g., Market price of a product. 7. e.g., Days of the week.


The weight of new born babies. The no. of customer who bought different item.
The daily wind speeds. The no. of computers in each department.
The temperature of freezer. 83.6 – 99.9 degree The no. of items you buy at grocery store each week.
Examples of discrete data
Discrete data can also be qualitative. The nationality you select on a form is a piece of discrete data. The
nationalities of everyone in your workplace, when grouped, can be valuable information in evaluating your
hiring practices.

The national census consists of discrete data, both qualitative and quantitative. Counting and collecting
this identifying information deepens our understanding of the population. It helps us make predictions
while documenting history. This is a great example of discrete data's power.

Examples of continuous data


When you think of experiments or studies involving constant measurements, they're likely to be
continuous variables to some extent. If you have a number like “2.86290” anywhere on a spreadsheet, it's
not a number you could have quickly arrived at yourself — think measurement devices like stopwatches,
scales, thermometers, and the like.

A task involving these tools probably applies to continuous data. For example, if we’re clocking every
runner in the Olympics, the times will be shown on a graph along an applicable line. Although our athletes
get faster and stronger over the years, there should never be an outlier that skews the rest of the data.
Even Usain Bolt is only a few seconds faster than the historical field when it comes down to it.

There are infinite possibilities along this line (for example, 5.77 seconds, 5.772 seconds, 5.7699 seconds,
etc.), but every new measurement is always somewhere within the range.

Not every example of continuous data falls neatly into a straight line. Still, over time a range becomes more
apparent, and you can bet on new data points sticking inside those parameters.

What is Nominal, Ordinal, Interval and Ratio Scales?


(Data type – scale of measurement)
Nominal, Ordinal, Interval, and Ratio are defined as the four fundamental levels of measurement scales
that are used to capture data in the form of surveys and questionnaires, each being a multiple-choice
question.

Discrete Data
Nominal variable (categorical) (least preferred)
- Data can be put into the categories
- They are variables with no numeric value.
- Cannot be assigned any order.
- It Cannot be quantified, i.e... you can’t perform arithmetic operations on them, like addition, subtraction,
logical operations like equal or greater then on them

Ordinal scale
- It classifies according to rank.
- It has all its variables in a specific order, beyond just naming them.

A major disadvantage with using the ordinal scale over other scales is that the distance between
measurements is not always equal. If you have a list of numbers like 1,2 and 3, you know that the distance
between the numbers in this case is exactly 1. But if you had “very satisfied”, “satisfied” and “neutral”,
there’s nothing to say if the different between the three ordinal variables is equal. In the list of five movies
listed above, there’s a small difference in my preference for Jaws or Children of Men, but a huge difference
between Children of Men (which I enjoyed…twice!) and The Sound of Music (which I do not like at all). This
inability to tell how much is in between each variable is one reason why other scales of measurement are
usually preferred in statistics.

Continuous Data
Interval scale
- It has value of equal intervals that mean something. (e.g., thermometer might have interval of 10
degrees)
- Offers labels, order, as well as, a specific interval between each of its variable options.

Ratio scale – (Most preferred data type)


- It’s exactly the same as the interval scale
- Except that the zero on the scale means: doesn’t exist

Nominal Ordinal Interval Ratio


Gender, High school class ranking: 1st, 9th, 87th… Temperature Age,
Color, Socioeconomic status: poor, middle class, rich. Weight,
Country, The Likert Scale: strongly disagree, disagree, IQ rankings Height,
Type of House / neutral, agree, strongly agree. Sales Figures
Accommodation, Level of Agreement: yes, maybe, no. SAT scores Ruler measurements,
Genotype (AA, Time of Day: dawn, morning, noon, Income earned in a
Aa or aa), afternoon, evening, night. Time on a week,
Religious Political Orientation: left, center, right. clock with Years of education,
preference etc. Military rank, Clothing size – small medium large etc. Hands Number of children

 Qualitative vs Quantitative data

Qualitative Data Quantitative Data


1. This type of data analysis is based on human 1. This type of data analysis is based on numerical information
understanding how they think & feel. & facts by using mathematical logic & techniques.
2. Qualitative data is text based. 2. Quantitative data is numerical based.
3.These data collected using interview & observation 3. These data collected using surveys, measuring & counting.
4. Analyzing the data by grouping data into 4. Analyzing the data using statistical analysis
meaningful themes & categories
5. Qualitative data is subjective & dynamic 5. Quantitative data is fixed & universal
6. e.g., My best friend has curly brown hair 6. e.g., My best friend is 5feet & 7inch tall
They have green eyes They have size 6feet
They have a friendly face & a contagious laugh My best friend have one older sibling & two younger siblings
They can also be a quite impatient & impulsive at time They go swimming 4 times a week

 Structured vs semi-structured vs Unstructured Data

Structured Data Semi-Structured Data Unstructured Data


1. Data with high degree of 1. Data with some degree of predefined 1. Data with no predefined organizational
predefined organization. organization & structure. form & no specific format.
2. Data in spreadsheet (Excel) or 2. Data in text file that has some 2. Data which is not structured or
in tabular format structure (header paragraph etc.) unstructured.
3. e.g., formats - 3. e.g., formats - 3. e.g., formats -
Excel sheet, Comma separated HTML, XML etc. Image, Videos, Word files, Pdf Files
values file(.csv) etc.

 Big Data vs Non-Big data

Big Data - Any kind of data that gives you two problem that is Computational burden & Storage burden is
called big data
To deal with the Storage problem we use Hadoop
To deal with the computational problem w use Spark

Non-Big data – Data which is not big data in which we are having Computational & storage burden.

 Cross-sectional vs Timeseries Data


Cross-sectional Data Time series Data
1. It’s an observation that coming from different individual 1. It’s a set of observation that collected at usually
or group at a single point in time. equally spaced time intervals.
2. Focuses in several variables at the same point in time 2. Focuses on same variable over a period of time.
3. e.g., Maximum temperature of several cities on single day 3. e.g., Profit of an organization over a period of time
The closing price of group of 20 different stocks on The daily closing price of a certain stock recorder over
December 15,1986. the last 6 weeks.
4.Day, Time & Sequence doesn’t matter. 4. Day, Time & Sequence matter

Longitudinal data = Cross-sectional data + Time Series data

 Online vs Offline Data


DATA COLLECTION -
e.g., Jio want to launch 5g tariff for Villages in India
.
1st approach-------------------You took data from Online of Vodafone company but can’t use the same one
need to develop data set according to our needs.
In village we, need to see Pricing & person living there is employable or not, High or low earning.
Used google map to see how many people r doing farming & created a new data set. (Doing research about
pricing villagers can afford)
To know farmers are high or low earning, approach drone company for data (you buy the data called
Syndicate data) to check which type of crop they r growing. (You can evaluate which crop pricing give profit
to whom).
These data we collected from Vodafone & drone company may don’t have exact information we need but
it’s not time taking & also not costly.

2nd approach ----------------- You hire 20guys with feedback form in village having population 10,000, it will
get the exact information we needed but the whole process is Costly & time taken.

Time of 1st transaction – 10:00am – X2


Place of 2nd transaction – Bengaluru – X3
Time of 2nd transaction – 10:05am – X4
X is I/p ------------------------------------Y is O/p
We have two methodologies for Simpler the model the better it’s. (If we have complex equation then make it
simpler)
1. Principle of Parsimony
2. Occam’s razor theorem
3 DATA MINING /MACHINE LEARNING
Supervised vs Unsupervised learning
Supervised Learning (Predictive learning) Un-Supervised Learning (Descriptive learning)
1. Algorithm are trained using labelled data 1. Algorithm are trained using unlabelled date
2. SL model predicts O/P 2. USL model finds hidden pattern data
3.In SL, I/P data is provided along with O/P 3. In USL, only I/P data is provided to model.
4.SL is not close to true AI; we have to 1 st train the model 4. USL is closer to AI as it learns similarly as a child learns
for each data & then only it can predict the O/P daily routine things by his experience.
5. Algorithm included - Linear Regression, Logistic 5. Algorithm included – Clustering, KNN & Apriori algorithm
Regression, Multiclass classification, Decision tree etc

Supervised Learning – Split Data

Compare the training error & testing error -


- If both training & testing error are low & close to each other then it is called as Right fit
- If training error is low & testing error is high it is called overfitting or variance. To fix this problem
each algorithm has different set of techniques called Regularisation techniques
- If training error is high & testing error is low it is called underfitting or Bias. To fix this problem
transform the data or perform better feature engineering to get more observation or more features.

UN-SUPERVISED LEARNING
CLUSTERING / Segmentation - Types - Hierarchical Clustering & K-means Clustering
STP framework – Segmentation --- Targeting --- Positioning

-- Cluster analysis also called Data segmentation, it is an explanatory method


-- It identifies homogeneous group of records
-- Similar item should be grouped together into homogeneous group. (Cohesive within cluster) (Distance
between each data point in cluster would be less)
-- Dissimilar item should be grouped into heterogeneous groups. (Distinctive between cluster) (Distance
between each cluster would be more)

Distance between clusters.

-- Single linkage also called Nearest neighbour (Minimum dist. between members of 2 clusters)
-- Complete linkage also called farthest neighbour (Max dist. Between members of 2 clusters)

-- Average Linkage – Avg of all dist. between member of 2 cluster


-- Centroid Linkage – Dist. between centroid of 2 cluster.

Hierarchical Clustering Algorithm – types.


Agglomerative
Divisive
BIRCH (Balanced Iterative Reducing & clustering using hierarchies)
CHAMELEON

Two main types we work -


Agglomerative Clustering & Divisive Clustering.

Agglomerative Clustering also called AGNES Agglomerative Nesting.


It is a Top to Bottom approach or Node to Root approach. (Add single records into a large records)
In this, at each step two cluster that is most similar are combined into a new bigger cluster (Nodes) & this
clustering continues until we get a single big cluster (Root) as a final result.
The result looks like a tree which can be plotted as " Dendrogram ".

Divisive Clustering also called DIANA Divisive Analysis.


It is a Bottom to Top Approach or Root to Node approach. (Splits large records into a small records)
In this, we have a one big cluster & split performed recursively as one moves down the hierarchy i.e...
Partitioning the cluster into least similar cluster.

Distance Properties: -
Dij = Dist b/w records i & j

Dist requirement – Non negative Dij >0 Dij = 0


Symmetry Dij = Dji Triangle Inequality Dij + Djk >= Djk
Distance between records: -
EUCLEDIAN DISTANCE – If columns have values with different scale. We need to normalize & standardize the
numerical value across all columns prior to calculating the Euclidean distance. Otherwise, columns that have
larger value will dominate the distance measure (Smaller value).

MANHATTAN DISTANCE (Taxi cab or City block Distance)


STASTICAL (MAHALANOBIS) DISTANCE

You might also like