Professional Documents
Culture Documents
What Is Data Science?
What Is Data Science?
It employs techniques and theories drawn from DS is a process of extracting knowledge and insights from
mathematics, statistics, information sciencs and computer data using scientific methods
science
Data Science stages Data Acquisition or Obtain
Data Acquisition Extracting data from multiple sources
Data Preparation Integrating and transforming data into homogeneous
Data mining and Modelling format
Visualization and Action Loading the transformed data into data warehouse
Model Re-computation To perform the tasks above, user will need certain
technical skills.
For example, for Database management, you will need to
know how to use MySQL, PostgreSQL or MongoDB (if
you are using a non-structured set of data).
Skills Required Data Preparation or Scrub
To perform the tasks data preparation, you will need Data cleaning
certain technical skills. For example, for Database Handling missing and NULL values
management, you will need to know how to use MySQL, Data transformation
PostgreSQL or MongoDB (if you are using a non- Normalization, standardization, categorical - Numerical
structured set of data).
Handling outliers
Can be used for fraud detection
Data integrity
Accuracy and reliability of data
Data integration
Removing duplicate rows/columns
Skills required Data Mining
Need scripting tools like Python or R to help you to Process of semi-automatically analyzing large databases to
scrub the data. find patterns that are
For handling bigger data sets require you are required to Valid: hold on new data with some certainty
have skills in Hadoop, Map Reduce or Spark. These tools Novel : Non – obvious to the system
can help you scrub the data by scripting. Useful : Should be possible to act on the item
Understandable : humans should be able to interpret the
pattern
Web Mining Artificial Intelligence
Web content mining Refers to the ability of the machines to perform cognitive
Web usage mining tasks like thinking, perceiving, learning, problem solving
Web structure mining and decision making.
Programs that behave externally like humans?
Programs that operates internally as humans do?
Computational systems that behave intelligently?
Rational behaviour?
Weak AI Stimulated thinking and strong AI actual thinking
Narrow AI – Single task and general AI – Multiple task
Superintelligence – General and strong AI
Machine Learning Skills Required
Machine learning is an application that provides systems Need know how to use Numpy, Matplotlib, Pandas or
the ability to automatically learn and improve from Scipy; if you are using R, you will need to use GGplot2 or
experience without being externally programmed. the data exploration swiss knife Dplyr. On top of that, you
The process of learning begins with observation of data, need to have knowledge and skills in inferential statistics
such as examples, direct experience or instruction in and data visualization.
order to look for patterns in data
Primary goal of ML is allow the computers learn
automatically without human help
Types are
Supervised learning , Unsupervised Learning, and reinforcement
learing
Visualization and action Model Recomputation
The results of modelling are presented in a meaningful Data or characteristics of data changes over time and the
manner accuracy of the model
Interpreted using various visualization tool for decision Model needs to be recomputed or updated periodically
making. to cater to new data produced since the present model
The decisions arrived at are put into action for its was recomputed
applicability Model recomputaion results in betted accuracy
Typology of Problems
Regression Problem Classification Problem
Once model has trained, to learn the relationship between the If a user trying to make a qualitative prediction about the
features and response using labeled data house, to answer a yes or no question such as
Use it to make predictions for houses where you don't know "will this house go on sale within the next five years?"
the price, based on the information contained in the features. or
The goal of predictive modeling in this case is to be able to "will the owner default on the mortgage?",
make a prediction that is close to the true value of the house. we would be solving what is known as a classification
Since we are predicting a numerical value on a continuous problem.
scale, this is called a regression problem. Here, answer is yes or no question correctly.
Classification and regression tasks are called supervised
learning, which is a class of problems that relies on
labeled data.
These problems can be thought of a needing
"supervision" by the known values of the target variable.
By contrast, there is also unsupervised learning, which
relates to more open-ended questions of trying to find
some sort of structure in a dataset that does not
necessarily have labels.
Importance of Linear Algebra
Linear algebra is a very fundamental part of data science In data science, Data Representation becomes an
identify the most important concepts from linear algebra, important aspect of data science and data is represented
that are useful in data science usually in a matrix form
Linear algebra can be treated very theoretically very The second important thing that one is interested from a
formally; however, in the short module on linear algebra data science perspective is, if this data contains several
which has relevance to data science variables of interest,
like to know how many of these variables are really
important and if there are relationships between these
variables and if there are these relationships, how does
one un-cover these relationships?
Statistics in Data science
another interesting and important question that need to Statistics is a set of mathematical methods and tools that
answer from the viewpoint of understanding data. enable us to answer important questions about data. It is
Linear algebraic tools allow us to understand this divided into two categories:
The third block that we have basically says that the ideas Descriptive Statistics - this offers methods to
from linear algebra become very important in all kinds of summarise data by transforming raw observations into
machine learning algorithms. meaningful information that is easy to interpret and share.
Inferential Statistics - this offers methods to study
experiments done on small samples of data and chalk out
the inferences to the entire population (entire domain).
General Statistics Skills
How to define statistically answerable questions for How to identify the relationship between target variables
effective decision making. and independent variables.
Calculating and interpreting common statistics and how How to design statistical hypothesis testing experiments,
to use standard data visualization techniques to A/B testing, and so on.
communicate findings. How to calculate and interpret performance metrics like
Understanding of how mathematical statistics is applied p-value, alpha, type1 and type2 errors, and so on.
to the field, concepts such as the central limit theorem
and the law of large numbers.
Making inferences from estimates of location and
variability (ANOVA).
Important Statistics Concepts Optimization for Data Science
Getting Started— Understanding types of data (rectangular From a mathematical foundation viewpoint, it can be said
and non-rectangular), estimate of location, estimate of that the three pillars for data science that are Linear
variability, data distributions, binary and categorical data, Algebra, Statistics and the third pillar is
correlation, relationship between different types of variables. Optimization which is used pretty much in all data
Distribution of Statistic — random numbers, the law of science algorithms.
large numbers, Central Limit Theorem, standard error, and so
And to understand the optimization concepts one needs
on.
a good fundamental understanding of linear algebra.
Data sampling and Distributions — random sampling,
sampling bias, selection bias, sampling distribution,
bootstrapping, confidence interval, normal distribution, t-
distribution, binomial distribution, chi-square distribution, F-
distribution, Poisson and exponential distribution.
optimization as a problem where you maximize or A basic understanding of optimization will help in.,
minimize a real function by systematically choosing input More deeply understand the working of machine learning
values from an allowed set and computing the value of algorithms.
the function. Rationalize the working of the algorithm. That means if you get
That means when we talk about optimization we are a result and you want to interpret it, and if you had a very deep
understanding of optimization you will be able to see why you
always interested in finding the best solution. So, let say
got the result.
that one has some functional form
And at an even higher level of understanding, you might be able
minimize𝑓0 (𝑥) S.t to develop new algorithms yourselves.
𝑓𝑖(𝑥) ≤ 0, 𝑖 = 1, … 𝑘
ℎ𝑗(𝑥) ≤ 0, 𝑗 = 1, … 𝑙
Types of Optimization Problems:
Generally, an optimization problem has three components. Depending on the types of constraints only:
minimize 𝑓(𝑥), 𝑤. 𝑟. 𝑡 𝑥, 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝑎 ≤ 𝑥 ≤ 𝑏
Constrained optimization problems: In cases where the
The objective function(f(x)): The first component is an constraint is given there and we have to have the solution
objective function f(x) which we are trying to either maximize satisfy these constraints we call them constrained optimization
or minimize. In general, we talk about minimization problems problems.
this is simply because if you have a maximization problem with
f(x) we can convert it to a minimization problem with -f(x). Unconstrained optimization problems: In cases where
So, without loss of generality, we can look at minimization the constraint is missing we call them unconstrained
problems. optimization problems.
Decision variables(x): The second component is the
decision variables which we can choose to minimize the
function. So, we write this as min f(x).
Constraints(a ≤ x ≤ b): The third component is the
constraint which basically constrains this x to some set.
Structured Thinking
Depending on the types of objective functions, Structured thinking is a framework for solving
decision variables and constraints: unstructured problems — which covers almost all data
If the decision variable(x) is a continuous variable: A variable science problems.
x is said to be continuous if it takes an infinite number of values. In
this case, x can take an infinite number of values between -2 to 2. Using a structured approach to solve problems doesn't
min 𝑓(𝑥), 𝑥 ∈ (−2, 2) only help with solving the problem faster but also
If the decision variable(x) is an integer variable: All numbers identifies the parts of the problem that may need some
whose fractional part is 0 (zero) like -3, -2, 1, 0, 10, 100 are integers. extra attention.
min 𝑓(𝑥), 𝑥 ∈ [0, 1, 2, 3]
Think of structured thinking like the map of a new city
If the decision variable(x) is a mixed variable: If we combine
both continuous variable and integer variable then this decision you’re visiting.
variable known as a mixed variable.
min 𝑓(𝑥1, 𝑥2), 𝑥1 ∈ [0, 1, 2, 3] 𝑎𝑛𝑑 𝑥2 ∈ (−2, 2)
Six-Step Problem Solving Model
Without a map, you probably will find it difficult to reach This technique uses an analytical approach to solve any given
your destination. problem. As the name suggests, this technique uses 6 steps to
solve a problem, which are:
Even if you did eventually reach your distinction, it would 1. Have a clear and concise problem definition.
probably have taken you to double the time you might 2. Study the roots of the problem.
have needed if you did have a map. 3. Brainstorm possible solutions to the problem.
Structured thinking is a framework and not a fixed 4. Examine the possible solution and chose the best one.
5. Implement the solution effectively.
mindset; it can be modified to match any problem you
6. Evaluate the results.
need to solve.
This model follows the mindset of continuous development
and improvement. So, on step 6, if your results didn’t turn out
the way you wanted, you can go back to stem 4 and choose
another solution or to step 1 and try to define the problem
differently.
The Drill-Down Technique Eight Disciplines of Problem Solving
This technique is more suitable for complex and larger problems This technique offers a practical plan to solve a problem
that multiple people will be working on.
The whole purpose of using this technique is to break down a using an eight-step process. Referred as eight disciplines
problem to its roots to ease up finding solutions for it. D1~D8
To use the drill-down technique, you first need to start by creating a D1: Put together your team. Having a team with the set of
table.
The first column of the table will contain the outlined definition of the skills needed to solve the project can make moving forward
problem, much easier.
followed by a second column containing the factors causing this problem. D2: Define the problem. Describe the problem using
Finally, the third column will contain the cause of the second column's
contents, quantifiable terms the who, what, where, when, why, and how.
continue to drill down on each column until you reach the root of D3: Develop a working plan.
the problem.
Once reaches the root causes of the problem, you can then use
these root causes to develop solutions for the bigger problem
The Cynefin Framework
The Cynefin framework technique, like the rest of the
D4: Determine and identify root causes. Identify the root techniques, works by breaking down a problem into its
causes of the problem using the cause and effect diagrams to root causes to reach an efficient solution.
map causes against their effects. The Cynefin framework works by approaching the
D5: Choose and verify permanent corrections. Based on problem from one of 5 different perspectives.
the root causes, assess the work plan you developed earlier
and edit it if needed. The Cynefin framework can be considered a higher-level
D6: Implement the corrected action plan. approach
D7: Assess your results. It requires to place your problem into one of the 5
D8: Congratulate your team. After the end of a project, it’s contexts.
essential to take a step back and appreciate the work that has
been done before jumping into a new project.
1. Obvious Contexts.
options are clear, and the cause-and-effect relationships are apparent and
easy to point out. The 5-Why’s Technique
2. Complicated Contexts. Called as the curious child approach
the problem might have several correct solutions.
a clear relationship between cause and effect may exist, but it’s not This technique just follows a simple approach of asking
apparent to everyone.
why 5 times.
3. Complex Contexts.
If your problem is impossible to find a direct answer First, start with the main problem
Complex contexts are problems that have unpredictable answers. ask why it occurred
The best approach here is to follow a trial and error approach to solve
it. then keep asking why until you reach the root cause of
4. Chaotic Contexts. the said problem.
there is no relationship between cause and effect.
main goal will be to establish a correlation between the causes and Now, need to ask more or less than 5 whys to reach your
effects to solve the problem. answer.
5. Disorder
the most difficult of the context to categorize.
the only way to do it is to eliminate the other contexts and gather
further information.
Structured Data Pros of structured data
Structured data categorized as quantitative data is highly Easily used by machine learning (ML) algorithms:
organized and easily decipherable by machine learning The specific and organized architecture of structured data
algorithms. eases manipulation and querying of ML data.
Developed by IBM in 1974, structured query language Easily used by business users: Structured data does
(SQL) is the programming language used to manage
structured data. not require an in-depth understanding of different types
of data and how they function. With a basic understanding
By using a relational (SQL) database, business users can
quickly input, search and manipulate structured data. of the topic relative to the data, users can easily access
and interpret the data.
Examples of structured data include dates, names,
addresses, credit card numbers, etc. Their benefits are tied Accessible by more tools: Since structured data
to ease of use and access, while liabilities revolve around predates unstructured data, there are more tools available
data inflexibility for using and analyzing structured data.
Structured data tools
Cons of structured data
Limited usage: Data with a predefined structure can OLAP: Performs high-speed, multidimensional data
only be used for its intended purpose, which limits its analysis from unified, centralized data stores.
flexibility and usability. SQLite: Implements a self-contained, serverless, zero-
Limited storage options: Structured data is generally configuration, transactional relational database engine.
stored in data storage systems with rigid schemas (e.g., MySQL: Embeds data into mass-deployed software,
“data warehouses”). Therefore, changes in data particularly mission-critical, heavy-load production
requirements necessitate an update of all structured data, system.
which leads to a massive expenditure of time and PostgreSQL: Supports SQL and JSON querying as well
resources. as high-tier programming languages (C/C+, Java, Python,
etc.).
Use cases for structured data Unstructured data
Customer relationship management (CRM): CRM Unstructured data, typically categorized as qualitative
software runs structured data through analytical tools to data, cannot be processed and analyzed via conventional
create datasets that reveal customer behavior patterns data tools and methods.
and trends. Since unstructured data does not have a predefined data
Online booking: Hotel and ticket reservation data (e.g., model, it is best managed in non-relational (NoSQL)
dates, prices, destinations, etc.) fits the “rows and databases. Another way to manage unstructured data is to
columns” format indicative of the pre-defined data model. use data lakes to preserve it in raw form.
Accounting: Accounting firms or departments use
structured data to process and record financial
transactions.
Pros of unstructured data Cons of unstructured data
Native format: Unstructured data, stored in its native Requires expertise: Due to its undefined/non-
format, remains undefined until needed. Its adaptability formatted nature, data science expertise is required to
increases file formats in the database, which widens the prepare and analyze unstructured data. This is beneficial
data pool and enables data scientists to prepare and to data analysts but alienates unspecialized business users
analyze only the data they need. who may not fully understand specialized data topics or
Fast accumulation rates: Since there is no need to how to utilize their data.
predefine the data, it can be collected quickly and easily. Specialized tools: Specialized tools are required to
Data lake storage: Allows for massive storage and pay- manipulate unstructured data, which limits product
as-you-use pricing, which cuts costs and eases scalability. choices for data managers.
Courtesy by IBM
Use cases for unstructured data
Unstructured data tools
MongoDB: Uses flexible documents to process data for Data mining: Enables businesses to use unstructured
cross-platform applications and services. data to identify consumer behavior, product sentiment,
DynamoDB: Delivers single-digit millisecond and purchasing patterns to better accommodate their
performance at any scale via built-in security, in-memory customer base.
caching and backup and restore. Predictive data analytics: Alert businesses of
Hadoop: Provides distributed processing of large data important activity ahead of time so they can properly plan
sets using simple programming models and no formatting and accordingly adjust to significant market shifts.
requirements. Chatbots: Perform text analysis to route customer
Azure: Enables agile cloud computing for creating and questions to the appropriate answer sources.
managing apps through Microsoft’s data centers.
Semi-structured
Properties Structured data Unstructured data
data
It is based on
It is based on It is based on
XML/RDF(Resource
Technology Relational database character and binary
Description
table data
Framework).
Matured transaction
Transaction is No transaction
Transaction and various
adapted from DBMS management and no
management concurrency
not matured concurrency
techniques
Versioning over
Version Versioning over Versioned as a
tuples or graph is
management tuples,row,tables whole
possible
It is more flexible
It is schema It is more flexible
than structured data
Flexibility dependent and less and there is absence
but less flexible than
flexible of schema
unstructured data
New technology,
Robustness Very robust —
not very spread
Structured query Queries over
Only textual queries
Query performance allow complex anonymous nodes
are possible
joining are possible
Over View
Descriptive Statistics
Data Representation
Graphical Representation
Tabular Representation
Summary Statistics
Probability Distribution & Random Variables
STATISTICAL FOUNDATIONS
•Inferential Statistics
Confidence Intervals
Hypothesis Testing
Estimating Parameters to establish Relations
Statistics - Introduction Use of Data Analytics
What is Statistics? Man vs Machine
It is defined as the science, which deals with the collection, IBM Deep Blue Beats Grand Master Gary Kasparov (1997).
analysis and interpretation of data. Deep Blue train itself using historical Chess Game Data to train the
software
Scope of Statistics: It find applications in almost all
possible areas like Planning, Economics, Business, Biology, Machine vs Machine
Astronomy, Medical Science, Psychology, Education and Deep Mind’s Alphazero beats the best chess engine of the time Stock
fish(2017)
even in War.
Alphazero uses Reinforcement learning to train itself by just using
Limitations of Statistics: The following are some of its the set of rules of chess.
important limitations: Analytics in Games
Statistics does not study individuals. Oakland Athletics Baseball team Manager Billy Beane develop
Statistical laws are not exact. Sabermetrics
Statistics is liable to be misused Saber metricians collect in game activity data to take key strategic
decisions during the game
Descriptive Statistics Descriptive Statistics
describing, presenting, summarizing, and organizing your •What is Data?
Data are individual pieces of factual information recorded and
data, either through numerical calculations or graphs or used for the purpose of analysis.
tables. Data is broadly classified into
Quantitative Data: Numerical values
Some of the common measurements in descriptive Continuous
statistics are central tendency and others the variability of Discrete
the dataset. Qualitative Data: Categorical-Data is a group into discrete groups
Nominal: Order does not exist
helps us to understand our data and is very important Ex: marital Status: Married/Unmarried
Ordinal: Order does exist
part of Machine Learning. Ex: Player contract: A Class, B Class, C Class, D Class
•Sample Covariance
Summary Statistics: Measure of
Statistical Dependence Outlier Analysis
Correlation coefficient: between pair of random variable Outlier – data objects that are grossly different from or
X and Y with expected value µX and µY and standard inconsistent with the remaining set of data
deviation σX and σY is defined as
Causes
Measurement / Execution errors
Value of Correlation coefficient is bounded between-1 Inherent data variability
and 1 Outliers – maybe valuable patterns
Fraud detection
Customized marketing
Medical Analysis
Outlier Mining Approaches
Given n data points and k – expected number of Statistical Approach
outliers find the top k dissimilar objects Distance-based approach
Define inconsistent data Density based outlier approach
Residuals in Regression Deviation based approach
Difficulties – Multi-dimensional data, non-numeric data
Mine the outliers
Visualization based methods
Not applicable to cyclic plots, high dimensional data and categorical
data
Applications
Variants of Outlier Detection Problems Credit card fraud detection
Given a database D, find all the data points x D with telecommunication fraud detection
anomaly scores greater than some threshold t network intrusion detection
Given a database D, find all the data points x D having the fault detection
top-n largest anomaly scores f(x)
Given a database D, containing mostly normal (but
unlabeled) data points, and a test point x, compute the
anomaly score of x with respect to D
Challenges General Steps
How many outliers are there in the data? Build a profile of the “normal” behavior
Method is unsupervised Profile can be patterns or summary statistics for the overall
Validation can be quite challenging (just like for clustering) population
( N 1) t (2 / N , N 2 )
G
N N 2 t (2 / N , N 2 )
Statistical-based – Likelihood Approach Limitations
Assume the data set D contains samples from a mixture of two probability Most of the tests are for a single attribute
distributions:
M (majority distribution)
In many cases, data distribution may not be known
A (anomalous distribution)
General Approach:
Initially, assume all the data points belong to M
For multi-dimensional data, it may be difficult to estimate
Let Lt(D) be the log likelihood of D at time t the true distribution
For each point xt that belongs to M, move it to A
Let Lt+1 (D) be the new log likelihood.
Compute the difference, = Lt(D) – Lt+1 (D)
If > c (some threshold), then xt is declared as an anomaly and
moved permanently from M to A
Distance-based Approaches Nearest-Neighbor Based Approach
Data is represented as a vector of features Approach:
Compute the distance between every pair of data points