Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

What is data science?

 Data science enables businesses to process huge amounts


of structured and unstructured big data to detect
patterns.
 This in turn allows companies to increase efficiencies,
manage costs, identify new market opportunities, and
boost their market advantage.
 Asking a personal assistant like Alexa or Siri for a
Data Science recommendation demands data science.
 Operating a self-driving car, using a search engine that
provides useful results,
 talking to a chatbot for customer service.
 Understand and analyze the actual phenomenon with the  Torture the data that will confess to anything
data by Ronald coase

 It employs techniques and theories drawn from  DS is a process of extracting knowledge and insights from
mathematics, statistics, information sciencs and computer data using scientific methods
science
Data Science stages Data Acquisition or Obtain
 Data Acquisition  Extracting data from multiple sources
 Data Preparation  Integrating and transforming data into homogeneous
 Data mining and Modelling format
 Visualization and Action  Loading the transformed data into data warehouse
 Model Re-computation  To perform the tasks above, user will need certain
technical skills.
 For example, for Database management, you will need to
know how to use MySQL, PostgreSQL or MongoDB (if
you are using a non-structured set of data).
Skills Required Data Preparation or Scrub
 To perform the tasks data preparation, you will need  Data cleaning
certain technical skills. For example, for Database  Handling missing and NULL values
management, you will need to know how to use MySQL,  Data transformation
PostgreSQL or MongoDB (if you are using a non-  Normalization, standardization, categorical - Numerical
structured set of data).
 Handling outliers
 Can be used for fraud detection
 Data integrity
 Accuracy and reliability of data
 Data integration
 Removing duplicate rows/columns
Skills required Data Mining
 Need scripting tools like Python or R to help you to  Process of semi-automatically analyzing large databases to
scrub the data. find patterns that are
 For handling bigger data sets require you are required to  Valid: hold on new data with some certainty
have skills in Hadoop, Map Reduce or Spark. These tools  Novel : Non – obvious to the system
can help you scrub the data by scripting.  Useful : Should be possible to act on the item
 Understandable : humans should be able to interpret the
pattern
Web Mining Artificial Intelligence
 Web content mining  Refers to the ability of the machines to perform cognitive
 Web usage mining tasks like thinking, perceiving, learning, problem solving
 Web structure mining and decision making.
 Programs that behave externally like humans?
 Programs that operates internally as humans do?
 Computational systems that behave intelligently?
 Rational behaviour?
 Weak AI Stimulated thinking and strong AI actual thinking
 Narrow AI – Single task and general AI – Multiple task
 Superintelligence – General and strong AI
Machine Learning Skills Required
 Machine learning is an application that provides systems  Need know how to use Numpy, Matplotlib, Pandas or
the ability to automatically learn and improve from Scipy; if you are using R, you will need to use GGplot2 or
experience without being externally programmed. the data exploration swiss knife Dplyr. On top of that, you
 The process of learning begins with observation of data, need to have knowledge and skills in inferential statistics
such as examples, direct experience or instruction in and data visualization.
order to look for patterns in data
 Primary goal of ML is allow the computers learn
automatically without human help
 Types are
 Supervised learning , Unsupervised Learning, and reinforcement
learing
Visualization and action Model Recomputation
 The results of modelling are presented in a meaningful  Data or characteristics of data changes over time and the
manner accuracy of the model
 Interpreted using various visualization tool for decision  Model needs to be recomputed or updated periodically
making. to cater to new data produced since the present model
 The decisions arrived at are put into action for its was recomputed
applicability  Model recomputaion results in betted accuracy
Typology of Problems
 Regression Problem  Classification Problem
 Once model has trained, to learn the relationship between the  If a user trying to make a qualitative prediction about the
features and response using labeled data house, to answer a yes or no question such as
 Use it to make predictions for houses where you don't know  "will this house go on sale within the next five years?"
the price, based on the information contained in the features.  or
 The goal of predictive modeling in this case is to be able to  "will the owner default on the mortgage?",
make a prediction that is close to the true value of the house.  we would be solving what is known as a classification
Since we are predicting a numerical value on a continuous problem.
scale, this is called a regression problem.  Here, answer is yes or no question correctly.
 Classification and regression tasks are called supervised
learning, which is a class of problems that relies on
labeled data.
 These problems can be thought of a needing
"supervision" by the known values of the target variable.
 By contrast, there is also unsupervised learning, which
relates to more open-ended questions of trying to find
some sort of structure in a dataset that does not
necessarily have labels.
Importance of Linear Algebra
 Linear algebra is a very fundamental part of data science  In data science, Data Representation becomes an
 identify the most important concepts from linear algebra, important aspect of data science and data is represented
that are useful in data science usually in a matrix form
 Linear algebra can be treated very theoretically very  The second important thing that one is interested from a
formally; however, in the short module on linear algebra data science perspective is, if this data contains several
which has relevance to data science variables of interest,
 like to know how many of these variables are really
important and if there are relationships between these
variables and if there are these relationships, how does
one un-cover these relationships?
Statistics in Data science
 another interesting and important question that need to  Statistics is a set of mathematical methods and tools that
answer from the viewpoint of understanding data. enable us to answer important questions about data. It is
 Linear algebraic tools allow us to understand this divided into two categories:
 The third block that we have basically says that the ideas  Descriptive Statistics - this offers methods to
from linear algebra become very important in all kinds of summarise data by transforming raw observations into
machine learning algorithms. meaningful information that is easy to interpret and share.
 Inferential Statistics - this offers methods to study
experiments done on small samples of data and chalk out
the inferences to the entire population (entire domain).
General Statistics Skills
 How to define statistically answerable questions for  How to identify the relationship between target variables
effective decision making. and independent variables.
 Calculating and interpreting common statistics and how  How to design statistical hypothesis testing experiments,
to use standard data visualization techniques to A/B testing, and so on.
communicate findings.  How to calculate and interpret performance metrics like
 Understanding of how mathematical statistics is applied p-value, alpha, type1 and type2 errors, and so on.
to the field, concepts such as the central limit theorem
and the law of large numbers.
 Making inferences from estimates of location and
variability (ANOVA).
Important Statistics Concepts Optimization for Data Science
 Getting Started— Understanding types of data (rectangular  From a mathematical foundation viewpoint, it can be said
and non-rectangular), estimate of location, estimate of that the three pillars for data science that are Linear
variability, data distributions, binary and categorical data, Algebra, Statistics and the third pillar is
correlation, relationship between different types of variables. Optimization which is used pretty much in all data
 Distribution of Statistic — random numbers, the law of science algorithms.
large numbers, Central Limit Theorem, standard error, and so
 And to understand the optimization concepts one needs
on.
a good fundamental understanding of linear algebra.
 Data sampling and Distributions — random sampling,
sampling bias, selection bias, sampling distribution,
bootstrapping, confidence interval, normal distribution, t-
distribution, binomial distribution, chi-square distribution, F-
distribution, Poisson and exponential distribution.
 optimization as a problem where you maximize or  A basic understanding of optimization will help in.,
minimize a real function by systematically choosing input  More deeply understand the working of machine learning
values from an allowed set and computing the value of algorithms.
the function.  Rationalize the working of the algorithm. That means if you get
 That means when we talk about optimization we are a result and you want to interpret it, and if you had a very deep
understanding of optimization you will be able to see why you
always interested in finding the best solution. So, let say
got the result.
that one has some functional form
 And at an even higher level of understanding, you might be able
 minimize𝑓0 (𝑥) S.t to develop new algorithms yourselves.
𝑓𝑖(𝑥) ≤ 0, 𝑖 = 1, … 𝑘
ℎ𝑗(𝑥) ≤ 0, 𝑗 = 1, … 𝑙
Types of Optimization Problems:
 Generally, an optimization problem has three components.  Depending on the types of constraints only:
minimize 𝑓(𝑥), 𝑤. 𝑟. 𝑡 𝑥, 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑎 ≤ 𝑥 ≤ 𝑏
 Constrained optimization problems: In cases where the
 The objective function(f(x)): The first component is an constraint is given there and we have to have the solution
objective function f(x) which we are trying to either maximize satisfy these constraints we call them constrained optimization
or minimize. In general, we talk about minimization problems problems.
this is simply because if you have a maximization problem with
f(x) we can convert it to a minimization problem with -f(x).  Unconstrained optimization problems: In cases where
So, without loss of generality, we can look at minimization the constraint is missing we call them unconstrained
problems. optimization problems.
 Decision variables(x): The second component is the
decision variables which we can choose to minimize the
function. So, we write this as min f(x).
 Constraints(a ≤ x ≤ b): The third component is the
constraint which basically constrains this x to some set.
Structured Thinking
 Depending on the types of objective functions,  Structured thinking is a framework for solving
decision variables and constraints: unstructured problems — which covers almost all data
 If the decision variable(x) is a continuous variable: A variable science problems.
x is said to be continuous if it takes an infinite number of values. In
this case, x can take an infinite number of values between -2 to 2.  Using a structured approach to solve problems doesn't
min 𝑓(𝑥), 𝑥 ∈ (−2, 2) only help with solving the problem faster but also
 If the decision variable(x) is an integer variable: All numbers identifies the parts of the problem that may need some
whose fractional part is 0 (zero) like -3, -2, 1, 0, 10, 100 are integers. extra attention.
min 𝑓(𝑥), 𝑥 ∈ [0, 1, 2, 3]
 Think of structured thinking like the map of a new city
 If the decision variable(x) is a mixed variable: If we combine
both continuous variable and integer variable then this decision you’re visiting.
variable known as a mixed variable.
min 𝑓(𝑥1, 𝑥2), 𝑥1 ∈ [0, 1, 2, 3] 𝑎𝑛𝑑 𝑥2 ∈ (−2, 2)
Six-Step Problem Solving Model
 Without a map, you probably will find it difficult to reach  This technique uses an analytical approach to solve any given
your destination. problem. As the name suggests, this technique uses 6 steps to
solve a problem, which are:
 Even if you did eventually reach your distinction, it would 1. Have a clear and concise problem definition.
probably have taken you to double the time you might 2. Study the roots of the problem.
have needed if you did have a map. 3. Brainstorm possible solutions to the problem.
 Structured thinking is a framework and not a fixed 4. Examine the possible solution and chose the best one.
5. Implement the solution effectively.
mindset; it can be modified to match any problem you
6. Evaluate the results.
need to solve.
 This model follows the mindset of continuous development
and improvement. So, on step 6, if your results didn’t turn out
the way you wanted, you can go back to stem 4 and choose
another solution or to step 1 and try to define the problem
differently.
The Drill-Down Technique Eight Disciplines of Problem Solving
 This technique is more suitable for complex and larger problems  This technique offers a practical plan to solve a problem
that multiple people will be working on.
 The whole purpose of using this technique is to break down a using an eight-step process. Referred as eight disciplines
problem to its roots to ease up finding solutions for it. D1~D8
 To use the drill-down technique, you first need to start by creating a  D1: Put together your team. Having a team with the set of
table.
 The first column of the table will contain the outlined definition of the skills needed to solve the project can make moving forward
problem, much easier.
 followed by a second column containing the factors causing this problem.  D2: Define the problem. Describe the problem using
 Finally, the third column will contain the cause of the second column's
contents, quantifiable terms the who, what, where, when, why, and how.
 continue to drill down on each column until you reach the root of  D3: Develop a working plan.
the problem.
 Once reaches the root causes of the problem, you can then use
these root causes to develop solutions for the bigger problem
The Cynefin Framework
 The Cynefin framework technique, like the rest of the
 D4: Determine and identify root causes. Identify the root techniques, works by breaking down a problem into its
causes of the problem using the cause and effect diagrams to root causes to reach an efficient solution.
map causes against their effects.  The Cynefin framework works by approaching the
 D5: Choose and verify permanent corrections. Based on problem from one of 5 different perspectives.
the root causes, assess the work plan you developed earlier
and edit it if needed.  The Cynefin framework can be considered a higher-level
 D6: Implement the corrected action plan. approach
 D7: Assess your results.  It requires to place your problem into one of the 5
 D8: Congratulate your team. After the end of a project, it’s contexts.
essential to take a step back and appreciate the work that has
been done before jumping into a new project.
 1. Obvious Contexts.
 options are clear, and the cause-and-effect relationships are apparent and
easy to point out. The 5-Why’s Technique
 2. Complicated Contexts.  Called as the curious child approach
 the problem might have several correct solutions.
 a clear relationship between cause and effect may exist, but it’s not  This technique just follows a simple approach of asking
apparent to everyone.
why 5 times.
 3. Complex Contexts.
 If your problem is impossible to find a direct answer  First, start with the main problem
 Complex contexts are problems that have unpredictable answers.  ask why it occurred
 The best approach here is to follow a trial and error approach to solve
it.  then keep asking why until you reach the root cause of
 4. Chaotic Contexts. the said problem.
 there is no relationship between cause and effect.
 main goal will be to establish a correlation between the causes and  Now, need to ask more or less than 5 whys to reach your
effects to solve the problem. answer.
 5. Disorder
 the most difficult of the context to categorize.
 the only way to do it is to eliminate the other contexts and gather
further information.
Structured Data Pros of structured data
 Structured data categorized as quantitative data is highly  Easily used by machine learning (ML) algorithms:
organized and easily decipherable by machine learning The specific and organized architecture of structured data
algorithms. eases manipulation and querying of ML data.
 Developed by IBM in 1974, structured query language  Easily used by business users: Structured data does
(SQL) is the programming language used to manage
structured data. not require an in-depth understanding of different types
of data and how they function. With a basic understanding
 By using a relational (SQL) database, business users can
quickly input, search and manipulate structured data. of the topic relative to the data, users can easily access
and interpret the data.
 Examples of structured data include dates, names,
addresses, credit card numbers, etc. Their benefits are tied  Accessible by more tools: Since structured data
to ease of use and access, while liabilities revolve around predates unstructured data, there are more tools available
data inflexibility for using and analyzing structured data.
Structured data tools
Cons of structured data
 Limited usage: Data with a predefined structure can  OLAP: Performs high-speed, multidimensional data
only be used for its intended purpose, which limits its analysis from unified, centralized data stores.
flexibility and usability.  SQLite: Implements a self-contained, serverless, zero-
 Limited storage options: Structured data is generally configuration, transactional relational database engine.
stored in data storage systems with rigid schemas (e.g.,  MySQL: Embeds data into mass-deployed software,
“data warehouses”). Therefore, changes in data particularly mission-critical, heavy-load production
requirements necessitate an update of all structured data, system.
which leads to a massive expenditure of time and  PostgreSQL: Supports SQL and JSON querying as well
resources. as high-tier programming languages (C/C+, Java, Python,
etc.).
Use cases for structured data Unstructured data
 Customer relationship management (CRM): CRM  Unstructured data, typically categorized as qualitative
software runs structured data through analytical tools to data, cannot be processed and analyzed via conventional
create datasets that reveal customer behavior patterns data tools and methods.
and trends.  Since unstructured data does not have a predefined data
 Online booking: Hotel and ticket reservation data (e.g., model, it is best managed in non-relational (NoSQL)
dates, prices, destinations, etc.) fits the “rows and databases. Another way to manage unstructured data is to
columns” format indicative of the pre-defined data model. use data lakes to preserve it in raw form.
 Accounting: Accounting firms or departments use
structured data to process and record financial
transactions.
Pros of unstructured data Cons of unstructured data
 Native format: Unstructured data, stored in its native  Requires expertise: Due to its undefined/non-
format, remains undefined until needed. Its adaptability formatted nature, data science expertise is required to
increases file formats in the database, which widens the prepare and analyze unstructured data. This is beneficial
data pool and enables data scientists to prepare and to data analysts but alienates unspecialized business users
analyze only the data they need. who may not fully understand specialized data topics or
 Fast accumulation rates: Since there is no need to how to utilize their data.
predefine the data, it can be collected quickly and easily.  Specialized tools: Specialized tools are required to
 Data lake storage: Allows for massive storage and pay- manipulate unstructured data, which limits product
as-you-use pricing, which cuts costs and eases scalability. choices for data managers.

 Courtesy by IBM
Use cases for unstructured data
Unstructured data tools
 MongoDB: Uses flexible documents to process data for  Data mining: Enables businesses to use unstructured
cross-platform applications and services. data to identify consumer behavior, product sentiment,
 DynamoDB: Delivers single-digit millisecond and purchasing patterns to better accommodate their
performance at any scale via built-in security, in-memory customer base.
caching and backup and restore.  Predictive data analytics: Alert businesses of
 Hadoop: Provides distributed processing of large data important activity ahead of time so they can properly plan
sets using simple programming models and no formatting and accordingly adjust to significant market shifts.
requirements.  Chatbots: Perform text analysis to route customer
 Azure: Enables agile cloud computing for creating and questions to the appropriate answer sources.
managing apps through Microsoft’s data centers.
Semi-structured
Properties Structured data Unstructured data
data
It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational database character and binary
Description
table data
Framework).
Matured transaction
Transaction is No transaction
Transaction and various
adapted from DBMS management and no
management concurrency
not matured concurrency
techniques
Versioning over
Version Versioning over Versioned as a
tuples or graph is
management tuples,row,tables whole
possible

It is more flexible
It is schema It is more flexible
than structured data
Flexibility dependent and less and there is absence
but less flexible than
flexible of schema
unstructured data

It is very difficult to It’s scaling is simpler


Scalability It is more scalable.
scale DB schema than structured data

New technology,
Robustness Very robust —
not very spread
Structured query Queries over
Only textual queries
Query performance allow complex anonymous nodes
are possible
joining are possible
Over View

 Descriptive Statistics
 Data Representation
 Graphical Representation
 Tabular Representation
 Summary Statistics
 Probability Distribution & Random Variables
STATISTICAL FOUNDATIONS
 •Inferential Statistics
 Confidence Intervals
 Hypothesis Testing
 Estimating Parameters to establish Relations
Statistics - Introduction Use of Data Analytics
 What is Statistics?  Man vs Machine
 It is defined as the science, which deals with the collection,  IBM Deep Blue Beats Grand Master Gary Kasparov (1997).
analysis and interpretation of data.  Deep Blue train itself using historical Chess Game Data to train the
software
 Scope of Statistics: It find applications in almost all
possible areas like Planning, Economics, Business, Biology,  Machine vs Machine
Astronomy, Medical Science, Psychology, Education and  Deep Mind’s Alphazero beats the best chess engine of the time Stock
fish(2017)
even in War.
 Alphazero uses Reinforcement learning to train itself by just using
 Limitations of Statistics: The following are some of its the set of rules of chess.
important limitations:  Analytics in Games
 Statistics does not study individuals.  Oakland Athletics Baseball team Manager Billy Beane develop
 Statistical laws are not exact. Sabermetrics
 Statistics is liable to be misused  Saber metricians collect in game activity data to take key strategic
decisions during the game
Descriptive Statistics Descriptive Statistics
 describing, presenting, summarizing, and organizing your  •What is Data?
 Data are individual pieces of factual information recorded and
data, either through numerical calculations or graphs or used for the purpose of analysis.
tables.  Data is broadly classified into
 Quantitative Data: Numerical values
 Some of the common measurements in descriptive  Continuous
statistics are central tendency and others the variability of  Discrete
the dataset.  Qualitative Data: Categorical-Data is a group into discrete groups
 Nominal: Order does not exist
 helps us to understand our data and is very important  Ex: marital Status: Married/Unmarried
 Ordinal: Order does exist
part of Machine Learning.  Ex: Player contract: A Class, B Class, C Class, D Class

 Doing a descriptive statistical analysis of our dataset is  Vary High


 Skill player
absolutely crucial.
 Significantly Low
 Skill player
Summary Statistics
Data Representation  Measure of Central Tendency:
 Mean
 Data can be represented in following ways  Median
 Graphical Representation  Mode
 Quartile
 Categorical Variable
 Measure of Statistical Dispersion:
 Bar Chart
 Mean Absolute Deviation (MAD)
 Pie Chart  Standard Deviation (SD)
 Quantitative Variable  Variance
 Box & Whisker Plot  Inter Quartile Range (IQR)
 Range
 Histogram Plot
 Measure of the Shape of the Distribution
 Scatter Plot
 skewness
 Tabular Representation  kurtosis
 Contingency Table  Measure of Statistical Dependence
 Covariance
 Correlation Coefficient
Summary Statistics: Measure of Central
Measure of Central Tendency Tendency - Mean
 It describes a whole set of data with a single value that  Mean: For a data set, the Mean is a central value of a finite
represents the centre of its distribution. There are three set of numbers.
main measures of central tendency:  The arithmetic mean of a set of numbers 𝑥1 , 𝑥2 , … , 𝑥𝑛 is
typically denoted by 𝑥

 If the data set were based on a series of observations


obtained by sampling from a statistical population, the
arithmetic mean is the sample mean.
Summary Statistics: Measure of Central
Tendency - Median
 Median: is the value separating the higher half from the  Example: X=[3,4,3,1,2,3,9,5,6,7,4,8]
lower half of a data.  Step1: sort the data points we get [1,2,3,3,3,4,4,5,6,7,8,9],
 For a data set, it may be thought of as "the middle" value. here number of data points are even, n=12.
 In order to find Median we need to first sort the data  Step2: Since n is even, median is average of middle pair.
points in increasing order and then Pick the middle  Median(X) is 4.
element as Median.
 If number of data points are even then Median is average
of middle pair.
Summary Statistics: Measure of Central Summary Statistics: Measure of Central
Tendency - Mode Tendency - Quartile
 Mode: is the value that appears most frequently in the dataset.  Quartiles: are the points(values) which divides the data in
It is the value in the dataset whose frequency is maximum. quarters.
 If the dataset is of discrete values finding the Mode is the  Quartile Q1 is the points(values) below which there are 25%
easier task. data points of the dataset.
 If the dataset have continuous or real values finding Mode is  Quartile Q2 is the points(values) below which there are 50%
not a easy task. data points of the dataset. Which is also the Median.
 In case of real value dataset Mode is obtained with the help of  Quartile Q3 is the points(values) below which there are 75%
Histogram. data points of the dataset.
 In practice the mid point of the bin with highest frequency is  Sometimes the minimum value (after excluding outliers on
consider as Mode. lower side) is consider as Quartile Q0 and maximum value
 The issue with this approach is as you change the bin size (or (after excluding outliers on higher side) of the Quartile Q5.
number of bins) the Mode will change.  Outliers are all the datapoint which outside [Q1–
 This is the reason mode is not very popular in data analytics. 1.5xIQR,Q3+1.5xIQR] range, where IQR is Inter Quartile
Range.
Box & Whisker Plot
 Method for graphically depicting groups of numerical data
through their quartiles.
 Box&Whisker plot display variation in samples of the
population without making any assumptions of the underlying
distribution
 Method of presenting the dataset based on a five point
summary:
 Minimum: the smallest value without outliers (if present in the
dataset).
 Maximum: the largest value without outliers (if present in the
dataset).
 Median(Q2): the middle value of the dataset.
 First quartile(Q1): It is the median of the first half of the dataset.
 Third quartile(Q3): It is the median of the second half of the
dataset.
Summary Statistics: Measure of Summary Statistics: Measure of
Statistical Dispersion Statistical Dispersion
 Inter Quartile Range (IQR): is the range in which the  Mean Square Deviation : also known as Variance,
middle 50% data points of the data set belongs proposed by Sir Ronald Aylmer Fisher in1920

 Mean Absolute deviation (MAD) proposed by Gauss


(1821)

 Standard deviation: the term was introduced by Karl


Pearson in 1893  Range : is the difference between the largest and smallest
values.
Summary Statistics: Measure of the
Shape of the Distribution - Skewness
 Skewness: is a measure of the asymmetry of the
distribution of a random variable about its mean.
 •The skewness value can be positive, zero, negative.
 •Skewness of the distribution with tail on the right side is
positive, and for left tail distribution it is negative.  For a sample of n values, estimators of the population
Skewness of the symmetric distribution is zero. skewness is
Summary Statistics: Measure of the
Shape of the Distribution -
 Kurtosis: is a measure of the "tailedness“ of the probability
distribution.
 •It is a scaled version of the fourth moment of the distribution.
 •Kurtosis of Normal distribution is Three.
 •Distribution with Kurtosis less than three are referred as
Platykurtic, such distributions have less extreme outliers
than Normal distribution. Example: Uniform distribution.  𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠(𝑥) ≥ (𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠(𝑥))2 +1
 •Distribution with Kurtosis greater than three are referred as
leptokurtic, such distributions have more extreme outliers
than Normal distribution .Example :Laplace distribution.
Summary Statistics: Measure of Summary Statistics: Measure of
Statistical Dependence Statistical Dependence
 Covariance: is a measure of the joint variability of two  Properties of covariance:
random variables.  Bilinear property: for a and b as a constant and random
 If higher value of first variable corresponds to higher variable X, Y and Z,
value of second variable and the lower value of first
variable corresponds to lower variable of second variable
than the covariance between the pair of variables is
positive. Reverse of that have covariance value negative.
 If the variables are independent then Covariance value is
zero.

 •Sample Covariance
Summary Statistics: Measure of
Statistical Dependence Outlier Analysis
 Correlation coefficient: between pair of random variable  Outlier – data objects that are grossly different from or
X and Y with expected value µX and µY and standard inconsistent with the remaining set of data
deviation σX and σY is defined as
 Causes
 Measurement / Execution errors
 Value of Correlation coefficient is bounded between-1  Inherent data variability
and 1  Outliers – maybe valuable patterns
 Fraud detection
 Customized marketing
 Medical Analysis
Outlier Mining Approaches
 Given n data points and k – expected number of  Statistical Approach
outliers find the top k dissimilar objects  Distance-based approach
 Define inconsistent data  Density based outlier approach
 Residuals in Regression  Deviation based approach
 Difficulties – Multi-dimensional data, non-numeric data
 Mine the outliers
 Visualization based methods
 Not applicable to cyclic plots, high dimensional data and categorical
data
Applications
 Variants of Outlier Detection Problems  Credit card fraud detection
 Given a database D, find all the data points x  D with  telecommunication fraud detection
anomaly scores greater than some threshold t  network intrusion detection
 Given a database D, find all the data points x  D having the  fault detection
top-n largest anomaly scores f(x)
 Given a database D, containing mostly normal (but
unlabeled) data points, and a test point x, compute the
anomaly score of x with respect to D
 Challenges  General Steps
 How many outliers are there in the data?  Build a profile of the “normal” behavior
 Method is unsupervised  Profile can be patterns or summary statistics for the overall
 Validation can be quite challenging (just like for clustering) population

 Finding needle in a haystack  Use the “normal” profile to detect anomalies


 Anomalies are observations whose characteristics
differ significantly from the normal profile
 Working assumption:
 There are considerably more “normal” observations than
“abnormal” observations (outliers/anomalies) in the data
Graphical Approaches Convex Hull Method
 Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)  Extreme points are assumed to be outliers
 Use convex hull method to detect extreme values
 Limitations
 Time consuming
 Subjective

 What if the outlier occurs in the middle of the data?


Statistical Approaches Grubbs’ Test
 Assume a parametric model describing the distribution of the data (e.g., normal  Detect outliers in univariate data
distribution)
 Assume data comes from normal distribution
 Apply a statistical test that depends on
 Detects one outlier at a time, remove the outlier, and repeat
 Data distribution
 Parameter of distribution (e.g., mean, variance)  H0: There is no outlier in data
 Number of expected outliers (confidence limit)  HA: There is at least one outlier
 Grubbs’ test statistic:
max X  X
G
s
 Reject H0 if:

( N  1) t (2 / N , N  2 )
G
N N  2  t (2 / N , N  2 )
Statistical-based – Likelihood Approach Limitations
 Assume the data set D contains samples from a mixture of two probability  Most of the tests are for a single attribute
distributions:
 M (majority distribution)
 In many cases, data distribution may not be known
 A (anomalous distribution)
 General Approach:
 Initially, assume all the data points belong to M
 For multi-dimensional data, it may be difficult to estimate
 Let Lt(D) be the log likelihood of D at time t the true distribution
 For each point xt that belongs to M, move it to A
 Let Lt+1 (D) be the new log likelihood.
 Compute the difference,  = Lt(D) – Lt+1 (D)
 If  > c (some threshold), then xt is declared as an anomaly and
moved permanently from M to A
Distance-based Approaches Nearest-Neighbor Based Approach
 Data is represented as a vector of features  Approach:
 Compute the distance between every pair of data points

 Three major approaches


 There are various ways to define outliers:
 Nearest-neighbor based  Data points for which there are fewer than p neighboring points
 Density based within a distance D
 Clustering based
 The top n data points whose distance to the kth nearest neighbor is
greatest

 The top n data points whose average distance to the k nearest


neighbors is greatest
Density-based: LOF approach Clustering-Based
 For each point, compute the density of its local  Basic idea:
 Cluster the data into groups of different density
neighborhood
 Choose points in small cluster as candidate outliers
 Compute local outlier factor (LOF) of a sample p as  Compute the distance between candidate points and non-candidate
the average of the ratios of the density of sample p clusters.
 If candidate points are far from all other non-candidate points, they
and the density of its nearest neighbors
are outliers
 Outliers are points with largest LOF value
Outliers in Lower Dimensional Projections
 In high-dimensional space, data is sparse and notion of  Divide each attribute into  equal-depth intervals
 Each interval contains a fraction f = 1/ of the records
proximity becomes meaningless
 Consider a d-dimensional cube created by picking grid ranges
 Every point is an almost equally good outlier from the from d different dimensions
perspective of proximity-based definitions  If attributes are independent, we expect region to contain
a fraction fk of the records
 If there are N points, we can measure sparsity of a cube D
 Lower-dimensional projection methods
as:
 A point is an outlier if in some lower dimensional projection, it
is present in a local region of abnormally low density
 Negative sparsity indicates cube contains smaller number
of points than expected
 To detect the sparse cells, you have to consider all cells….
exponential to d. Heuristics can be used to find them…
Example How to treat outliers?
 N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4  Trimming: It excludes the outlier values from our
analysis. By applying this technique our data becomes thin
when there are more outliers present in the dataset. Its
main advantage is its fastest nature.

 Capping: In this technique, we cap our outliers data and


make the limit i.e, above a particular value or less than
that value, all the values will be considered as outliers, and
the number of outliers in the dataset gives that capping
number.
How to detect outliers?
 For Example, if you’re working on the income feature, you  For Normal distributions: Use empirical relations of
might find that people above a certain income level behave in Normal distribution.
the same way as those with a lower income. In this case, you
can cap the income value at a level that keeps that intact and  – The data points which fall below mean-3*(sigma) or
accordingly treat the outliers. above mean+3*(sigma) are outliers.
 Treat outliers as a missing value: By assuming  where mean and sigma are the average value and
outliers as the missing observations, treat them standard deviation of a particular column.
accordingly i.e, same as those of missing values.
 Discretization: In this technique, by making the groups,
include the outliers in a particular group and force them
to behave in the same manner as those of other points in
that group. This technique is also known as Binning.
For Skewed distributions
 Use Inter-Quartile Range (IQR) proximity rule.
 – The data points which fall below Q1 – 1.5 IQR or
above Q3 + 1.5 IQR are outliers.
 where Q1 and Q3 are the 25th and 75th percentile of
the dataset respectively, and IQR represents the inter-
quartile range and given by Q3 – Q1.
 For Other distributions: Use percentile-based
approach.
 For Example, Data points that are far from 99%
percentile and less than 1 percentile are considered an
outlier.
Techniques for outlier detection and
removal
 Z-score treatment :
 Assumption– The features are normally or
approximately normally distributed.
Deviation based Outlier detection Distribution and Plots
 Identifies outliers by examining the main characteristics  A sample of data will form a distribution, and by far the
of objects in a group most well-known distribution is the Gaussian distribution,
often called the Normal distribution.
 Objects that “deviate” from this description are
considered outliers  The distribution provides a parameterized mathematical
function that can be used to calculate the probability for
 Sequential exception technique any individual observation from the sample space.
 Simulates the way in which humans can distinguish unusual  This distribution describes the grouping or the density of
objects from among a series of supposedly like objects the observations, called the probability density function.
 also calculate the likelihood of an observation having a
value equal to or lesser than a given value.
 A summary of these relationships between observations
is called a cumulative density function.
Distributions Density Functions
 A distribution is simply a collection of data, or scores, on  Distributions are often described in terms of their density
a variable. or density functions.
 Usually, these scores are arranged in order from smallest  Density functions are functions that describe how the
to largest and then they can be presented graphically. proportion of data or likelihood of the proportion of
 Many data conform to well-known and well-understood observations change over the range of the distribution.
mathematical functions, such as the Gaussian distribution.  Two types of density functions are probability density
 A function can fit the data with a modification of the functions and cumulative density functions.
parameters of the function, such as the mean and  Probability Density function: calculates the probability of
standard deviation in the case of the Gaussian. observing a given value.
 Cumulative Density function: calculates the probability of
an observation equal or less than a value.
Probability density function Cumulative density function
 A probability density function, or PDF, can be used to  A cumulative density function, or CDF, is a different way
calculate the likelihood of a given observation in a of thinking about the likelihood of observed values.
distribution.  Rather than calculating the likelihood of a given
 It can also be used to summarize the likelihood of observation as with the PDF, the CDF calculates the
observations across the distribution’s sample space. cumulative likelihood for the observation and all prior
 Plots of the PDF show the familiar shape of a distribution, observations in the sample space.
such as the bell-curve for the Gaussian distribution.  It allows to quickly understand and comment on how
much of the distribution lies before and after a given
value.
 A CDF is often plotted as a curve from 0 to 1 for the
distribution.
Gaussian Distribution
 Both PDFs and CDFs are continuous functions.  A Gaussian distribution can be described using two
 The equivalent of a PDF for a discrete distribution is parameters:
 mean: Denoted with the Greek lowercase letter mu, is the
called a probability mass function, or PMF. expected value of the distribution.
 variance: Denoted with the Greek lowercase letter sigma
raised to the second power (because the units of the variable
are squared), describes the spread of observation from the
mean.
 It is common to use a normalized calculation of the
variance called the standard deviation
 standard deviation: Denoted with the Greek lowercase
letter sigma, describes the normalized spread of observations
from the mean.
Student’s t-Distribution Chi-Squared Distribution
 It is a distribution that arises when attempting to estimate  The chi-squared distribution is also used in statistical
the mean of a normal distribution with different sized methods on data drawn from a Gaussian distribution to
samples. quantify the uncertainty.
 it is a helpful shortcut when describing uncertainty or  The chi-squared distribution has one parameter:
error related to estimating population statistics for data  degrees of freedom, denoted k.
drawn from Gaussian distributions when the size of the  An observation in a chi-squared distribution is calculated
sample must be taken into account. as the sum of k squared observations drawn from a
 number of degrees of freedom: Gaussian distribution.
 The number of degrees of freedom describes the number of
pieces of information used to describe a population quantity.
Dimensionality Reduction
• Dimensionality reduction refers to techniques for reducing • Dimensionality reduction is a data preparation technique
the number of input variables in training data. performed on data prior to modeling. It might be
• When dealing with high dimensional data, it is often useful performed after data cleaning and data scaling and before
to reduce the dimensionality by projecting the data to a training a predictive model.
lower dimensional subspace which captures the
“essence” of the data. This is called dimensionality
reduction.
Techniques for Dimensionality
Reduction
• Feature Selection Methods • Wrapper methods, as the name suggests, wrap a
• Perhaps the most common are so-called feature selection machine learning model, fitting and evaluating the model
techniques that use scoring or statistical methods to with different subsets of input features and selecting the
select which features to keep and which features to subset the results in the best model performance. RFE is
delete. an example of a wrapper feature selection method.
• perform feature selection, to remove “irrelevant” features • Filter methods use scoring methods, like correlation
that do not help much with the classification problem. between the feature and the target variable, to select a
• Two main classes of feature selection techniques include
subset of input features that are most predictive.
wrapper methods and filter methods. Examples include Pearson’s correlation and Chi-Squared
test.
• Matrix Factorization • Manifold Learning
• Techniques from linear algebra can be used for • Techniques from high-dimensionality statistics can also be
dimensionality reduction. used for dimensionality reduction.
• In mathematics, a projection is a kind of function or
• Specifically, matrix factorization methods can be used to
mapping that transforms data in some way.
reduce a dataset matrix into its constituent parts.
• These techniques are sometimes referred to as “manifold
• Examples include the eigendecomposition and singular learning” and are used to create a low-dimensional
value decomposition. projection of high-dimensional data, often for the
• The parts can then be ranked and a subset of those parts purposes of data visualization.
can be selected that best captures the salient structure of • The projection is designed to both create a low-
the matrix that can be used to represent the dataset. dimensional representation of the dataset whilst best
• The most common method for ranking the components is preserving the salient structure or relationships in the
data.
principal components analysis, or PCA for short.
• Autoencoder Methods • An auto-encoder is a kind of unsupervised neural network
• Deep learning neural networks can be constructed to that is used for dimensionality reduction and feature
perform dimensionality reduction. discovery. More precisely, an auto-encoder is a
• A popular approach is called autoencoders. This involves
feedforward neural network that is trained to predict the
framing a self-supervised learning problem where a model input itself.
must reproduce the input correctly.
Sampling

You might also like