Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Introduction to Data Science

Dr. Aakanksha Sharaff


Department of Computer Science and Engineering
National Institute of Technology Raipur C.G. India
Data Science

 Interdisciplinary field that uses scientific methods, processes,


algorithms/techniques and systems to extract knowledge and insights from
structured and unstructured data
 Transforming data into some meaningful insights which can be used by
layman.
 E.g. supermarket
The Ascendance of Data

 Today’s world is drowning in data


 The Internet itself represents a huge graph of knowledge that contains
(among other things) an enormous cross-referenced encyclopedia; domain-
specific databases about movies, music, sports results, pinball machines,
memes, and cocktails
Role of internet

2004 Facebook

2005 Youtube

2010 Instagram

2011 Snapchat
Data Science Lifecycle

1 Business Problem
2 Data Acquisition
3 Data Preparation
4 Exploratory Data Analysis
5 Data Modeling
6 Visualization and Communication
7 Deployment and Maintainence
Business Problem

What is Answer Business


problem? Why??? Problem
Data Acquisition

 Web Servers
 Logs
 Databases
 API’s
 Online repository
Data Preparation

 Data Cleaning
Data Preparation

 Transformation
Exploratory Data Analysis

 Most Important Step in Data Science Lifecycle


 Select Feature Variables
Data Modeling

Machine Deep Learning


Learning
• SVM • CNN
• NB • RNN
• RF • LSTM
Data Modeling
Visualization and Communication

Tableau

PowerBI

Qlikview
Data Science Lifecycle

1 Business Problem
2 Data Acquisition
3 Data Preparation
4 Exploratory Data Analysis
5 Data Modeling
6 Visualization and Communication
7 Deployment and Maintainence
The Art of Data Science

 How Big Data is Changing the Whole Equation for Business,” Wall Street
Journal March 8, 2013
 Several "V"s of big data
Big Data
Big Data (Big Deal!)

Apache
Bigspark
Hadoop
Machine Learning

• Herbert Alexander Simon: “Learning is any


process by which a system improves
performance from experience.”
• Machine Learning is concerned with computer
programs that automatically improve their
performance through experience.
Herbert Simon
Turing Award 1975
Nobel Prize in Economics
1978
Machine Learning
❖ Machine learning is an application of Artificial Intelligence (AI) that
provides system the ability to automatically learn and improve from
experience without being explicitly programmed.

❖ Machine learning focuses on the development of computer programs


that can access data and use it learn for themselves.

❖ The process of learning begins with observations or data, such as


examples, direct experience, or instruction, in order to look for patterns in
data and make better decisions in the future based on the examples that we
provide.

❖ The primary aim is to allow the computers learn automatically


without human intervention or assistance and adjust actions accordingly.
Introduction to Machine Learning
Misspelling of ‘Great’ (typo error)
Example: Spam Filter
(Is this spam?) Its rare that universities put exclamation points in their
subject

Did not address (such as Dear Sir/Madam......)

URL is not an Stanford University URL

Each of these features can be combined in a classifier give us some evidence to find out the email is spam.
Machine Learning Methods

Machine learning algorithms are often categorized as supervised or


unsupervised.

❖ Supervised
❖ Unsupervised
❖ Semi-supervised
❖ Reinforcement
Supervised Learning

Supervised machine learning algorithms can apply what has been learned in
the past to new data using labeled examples to predict future events. Starting from
the analysis of a known training dataset, the learning algorithm produces an
inferred function to make predictions about the output values.

❖ The system can provide targets for any new input after sufficient training.
❖ The learning algorithm can also compare its output with the correct, intended
output and find errors in order to modify the model accordingly.
An Example: Supervised Learning

4-Aug-21 24
Supervised Learning

25
Unsupervised Learning

 In contrast, unsupervised machine learning algorithms are used when the


information used to train is neither classified nor labeled.

❖ Unsupervised learning studies how systems can infer a function to describe a


hidden structure from unlabeled data.
❖ The system doesn’t figure out the right output, but it explores the data and can
draw inferences from datasets to describe hidden structures from unlabeled data.
Semi-supervised Machine Learning

 Semi-supervised machine learning algorithms fall somewhere in


between supervised and unsupervised learning, since they use both
labeled and unlabeled data for training – typically a small amount of
labeled data and a large amount of unlabeled data.

❖ The systems that use this method can considerably improve learning
accuracy.
❖ Usually, semi-supervised learning is chosen when the acquired labeled
data requires skilled and relevant resources in order to train it / learn
from it. Otherwise, acquiring unlabeled data generally doesn’t require
additional resources.
Reinforcement Learning

 Reinforcement machine learning algorithm is a learning method that interacts with its environment by
producing actions and discovers errors or rewards.
 Reinforcement Learning (RL) is a type of machine learning technique that enables an agent to learn in
an interactive environment by trial and error using feedback from its own actions and experiences.

❖ Trial and error search and delayed reward are the most relevant characteristics of reinforcement
learning.
❖ This method allows machines and software agents to automatically determine the ideal behavior within
a specific context in order to maximize its performance.
❖ Reinforcement learning uses rewards and punishment as signals for positive and negative behavior.
❖ Simple reward feedback is required for the agent to learn which action is best; this is known as the
reinforcement signal.
❖ Father of Reinforcement Learning- Richard Sutton
Reinforcement Learning
 It differs from other forms of supervised learning because the sample data set does not train the machine.
Instead, it learns by trial and error. Therefore, a series of right decisions would strengthen the method as it
better solves the problem.
 As compared to unsupervised learning, reinforcement learning is different in terms of goals. While the
goal in unsupervised learning is to find similarities and differences between data points, in reinforcement
learning the goal is to find a suitable action model that would maximize the total cumulative reward of the
agent.

Applications

 RL is quite widely used in building AI for playing computer games.


 In robotics and industrial automation, RL is used to enable the robot to create an efficient adaptive control
system for itself which learns from its own experience and behavior.
 Other applications of RL include text summarization engines, dialog agents (text, speech) which can learn
from user interactions and improve with time, learning optimal treatment policies in healthcare and RL
based agents for online stock trading.
ML
ML vs DL
 ML refers to systems that can assimilate
from experience (training data) and Deep
Learning (DL) states to systems that learn
from experience on large data sets. ML can
be considered as a subset of AI. Deep
Learning (DL) is ML but useful to large data
sets.
 Machine learning algorithms always require
structured data and deep learning networks
rely on layers of artificial neural networks.
Datasets: ML
1. Spam SMS classifier dataset
2. Spam-Mails Dataset
3. ImageNet
4. Iris Flower dataset
5. Breast cancer Wisconsin (Diagnostic) Dataset
6. Twitter sentiment Analysis Dataset
7. MNIST dataset (handwritten data)
8. Amazon review dataset
9. IMDB reviews
10. Sentiment 140
11. Iris Dataset
Predictions and Forecasts

 Predictions aim to identify one outcome,


 whereas forecasts encompass a range of outcomes.
 e.g. “it will rain tomorrow” is to make a prediction, but to say that “the chance of
rain is 40%” (implying that the chance of no rain is 60%) is to make a forecast, as it
lays out the range of possible outcomes with probabilities
Theories and Models

 Data science is one where theories are implemented using data, some of its big
data. This is embodied in an inference stack comprising (in sequence): theories,
models, intuition, causality, prediction, and correlation.
 Theories:
Theories are statements of how the world should be or is, and are derived from
axioms that are assumptions about the world, or precedent theories.
 Models:
• Models are implementations of theory, and in data science are often algorithms
based on theories that are run on data.
• The academic Thomas Davenport writes that models are key, and should not be
increasingly eschewed with increasing data
Causality and Correlation

Causality applies to situations where one


action, say X, causes an outcome, say Y,
whereas Correlation is just relating one action
(X) to another action(Y) but X does not
necessarily cause Y.
Normal distribution

 Bell-shaped curve
 Gaussian Distribution
 This curve is symmetric around the Mean.
 Mean, Median, and Mode are all the same.
 Normal distribution describes how the values of a variable
are distributed. It is typically a symmetric distribution
where most of the observations cluster around the central
peak. The values further away from the mean taper off
equally in both directions.
Poisson distribution

 French mathematician Denis Simon Poisson, is a


discrete distribution function describing the
probability that an event will occur a certain
number of times in a fixed time (or space) interval.
 It is used to model count-based data, like the
number of emails arriving in your mailbox in one
hour or the number of customers walking into a
shop in one day, for instance.
 Poisson distribution helps predict the probability
of certain events happening when you know how
often that event has occurred. It can be used by
businessmen to make forecasts about the number of
customers on certain days and allows them to adjust
supply according to the demand.
Applications of Data Science
 E-commerce
 Analyze search patterns

E-commerce
 Recommend products to customers
 Education
 Explore current trends and find latest course as per industry need
 Collect student feedback


 Understand student requirements

Internet Search
Education
 Take user’s query
 Provide results
 Show relevant recommendations Internet Search
 Advertising products
 Post adds on websites
Explore targeted customers and recommend products to
Recommendation

customers
 Recommendation
 Products, Entertainment (Videos streaming, Music)
Applications of Data Science

Gene Text/Data Mining

Logistics (Route Planning)

Predictive Modeling

Airline Companies
Intuition

 Intuition:
The results of running a model leads to intuition, i.e., a deeper understanding of
the world based on theory, model, and data.
Once we have established intuition for the results of a model, it remains to be seen whether
the relationships we observe are causal, predictive, or merely correlational. Theory may be
causal and tested as such. Granger (1969) causality is often stated in mathematical form for
two stationary time series of data as follows. X is said to Granger cause Y if in the following
equation system,

𝑌 𝑡 = 𝑎1 + 𝑏1 𝑌 𝑡 − 1 + 𝑐1 𝑋 𝑡 − 1 + 𝑒1
X 𝑡 = 𝑎2 + 𝑏2 𝑌 𝑡 − 1 + 𝑐2 𝑋 𝑡 − 1 + 𝑒2

The coefficient of 𝑐1 is significant and 𝑏2 is not significant. Hence, X causes Y, but not
vice versa. Causality is a hard property to establish, even with theoretical foundation,
as the causal effect has to be well entrenched in the data.
Correlation

➢ Correlation:
Finally there is correlation, at the end of the data science inference chain.
Contemporaneous movement between two variables is quantified using
correlation. In many cases, we uncover correlation, but no prediction or causality.
Correlation has great value to firms attempting to tease out beneficial information
from big data. And even though it is a linear relationship between variables, it lays
the groundwork for uncovering nonlinear relationships, which are becoming
easier to detect with more data.
Exponentials, Logarithms, and
Compounding
• It is fitting to begin with the fundamental mathematical constant, “e =2.718281828...”, which is also
the function “exp(.)”. We often write this function as 𝑒 𝑥 , where x can be a real or complex variable.
Given y = 𝑒 𝑥 , a fixed change in x results in the same continuous percentage change in y. This is
because ln(y) = x, where ln(.) is the natural logarithm function, and is the inverse function of the
exponential function.
1 𝑛
• The constant e is defined as the limit of a specific function: lim 1 + 𝑛
𝑛→∞

• Exponential compounding is the limit of successively shorter intervals over discrete compounding.
Given a horizon t divided into n intervals per year, one dollar compounded from time zero to time t
𝑟 𝑛𝑡
years over these n intervals at per annum rate r may be written as 1 + .
𝑛

• Continuous-compounding is the limit of this equation when the number of periods n goes to
infinity:
𝑟 𝑛𝑡 1 𝑛/𝑟 𝑡𝑟
 lim 1 + = lim [ 1+ ] = 𝑒 𝑟𝑡
𝑛→∞ 𝑛 𝑛→∞ 𝑛/𝑟
Normal Distribution

 This distribution is the workhorse of many models in the social sciences, and is
assumed to generate much of the data that comprises the Big Data universe.
 Interestingly, most phenomena (variables) in the real world are not normally
distributed. They tend to be “power law” distributed, i.e., many observations of low
value, and very few of high value. The probability distribution declines from left to right
and does not have the characteristic hump shape of the normal distribution.
 we do need to learn about the normal distribution because it is important in statistics,
and the central limit theorem does govern much of the data we look at. Examples of
approximately normally distributed data are stock returns, and human heights.
 If x ~ N(µ,σ 2 ), that is, x is normally distributed with mean m and variance σ 2 , then the
probability “density” function for x is:
1 1 (𝑥−𝜇)2
 𝑓 𝑥 = exp[− ]
2𝜋𝜎 2 2 𝜎2
Normal Distribution (Contd.)

 The cumulative probability is given by the “distribution” function:


𝑥
 F 𝑥 = ‫׬‬−∞ 𝑓 𝑢 𝑑𝑢 and F(𝑥) = 1 – F(- 𝑥)
 because the normal distribution is symmetric. We often also use the notation N(.)
or ⏀(.) instead of F(.).
 The “standard normal” distribution is: x ~ N(0, 1). For the standard normal
distribution: F(0) = 1/2. The normal distribution has continuous support, i.e., a
range of values of x that goes continuously from -∞ to +∞.
Poisson Distribution

 The Poisson is also known as the rare-event distribution. Its density function is:
𝑒 −λ λ𝑛
 𝑓 𝑛, λ = 𝑛!

 where there is only one parameter, i.e., the mean λ. The density function is over
discrete values of n, the number of occurrences given the mean number of
outcomes λ. The mean and variance of the Poisson distribution are both λ. The
Poisson is a discrete-support distribution, with a range of values n = {0, 1, 2,..}.
Moments of a continuous random
variable
 The following formulae are useful to review because any analysis of data begins with
descriptive statistics, and the following statistical “moments” are computed in order to
get a first handle on the data. Given a random variable x with probability density
function f (x), then the following are the first four moments.
 Mean (first moment or average) = 𝐸 𝑥 = ‫𝑥𝑑 𝑥 𝑓𝑥 ׬‬
 In like fashion, powers of the variable result in higher (nth order) moments. These are
“non-central” moments, i.e., they are moments of the raw random variable x, not its
deviation from its mean, i.e., [x E(x)].
 nth moment = 𝐸 𝑥 𝑛 = ‫𝑥𝑑 𝑥 𝑓 𝑛 𝑥 ׬‬
 Central moments are moments of demeaned random variables. The second central
moment is the variance:

2=
 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝑉𝑎𝑟 𝑥 = 𝐸 𝑥 − 𝐸 𝑥 𝐸 𝑥 2 − [𝐸(𝑥)] 2
Moments of a continuous random variable
(Contd.)

• The standard deviation is the square-root of the variance, i.e., 𝜎 = 𝑉𝑎𝑟(𝑥). The third central
moment, normalized by the standard deviation to a suitable power is the skewness:
𝐸[𝑥 − 𝐸(𝑥)] 3
𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑉𝑎𝑟(𝑥) 3/2
• The absolute value of skewness relates to the degree of asymmetry in the probability density. If
more extreme values occur to the left than the right, the distribution is left-skewed. And vice-versa,
the distribution is right-skewed.
• Correspondingly, the fourth central, normalized moment is kurtosis.
𝐸[𝑥 − 𝐸(𝑥)] 4
𝐾𝑢𝑟𝑡𝑜𝑠𝑖𝑠 =
[𝑉𝑎𝑟(𝑥)] 2
• Kurtosis in the normal distribution has value 3. We define “Excess Kurtosis” to be Kurtosis minus 3.
When a probability distribution has positive excess kurtosis we call it “leptokurtic”. Such
distributions have fatter tails (either or both sides) than a normal distribution.
Combining random variables

• Since we often have to deal with composites of random variables, i.e., more than one random variable,
we review here some simple rules for moments of combinations of random variables. There are several
other expressions for the same equations, but we examine just a few here, as these are the ones we will
use more frequently.
• First, we see that means are additive and scalable, i.e.,
• 𝐸 𝑎𝑥 + 𝑏𝑦 = 𝑎𝐸 𝑥 + 𝑏𝐸(𝑦)
• where x, y are random variables, and a, b are scalar constants. The variance of scaled, summed random
variables is as follows:
• 𝑉𝑎𝑟 𝑎𝑥 + 𝑏𝑦 = 𝑎2 𝑉𝑎𝑟 𝑥 + 𝑏2 𝑉𝑎𝑟 𝑦 + 2𝑎𝑏𝐶𝑜𝑣 𝑥, 𝑦
• And the covariance and correlation between two random variables is
• 𝐶𝑜𝑣 𝑥, 𝑦 = 𝐸 𝑥𝑦 − 𝐸 𝑥 𝐸 𝑦
𝐶𝑜𝑣(𝑥,𝑦)
• 𝐶𝑜𝑟𝑟 𝑥, 𝑦 =
𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟(𝑦)
Vector Algebra

• We will be using linear algebra in many of the models. Linear algebra requires the manipulation of
vectors and matrices. We will also use vector calculus. Vector algebra and calculus are very powerful
methods for tackling problems that involve solutions in spaces of several variables, i.e., in high
dimension.
• Rather than work with an abstract exposition, it is better to introduce ideas using an example. We’ll
examine the use of vectors in the context of stock portfolios. We define the returns for each stock in
a portfolio as:
𝑅1 1
𝑅2 1
• 𝑹= . 𝑼= .
. .
𝑅𝑁 1
• The use of this unit vector will become apparent shortly, but it will be used in myriad ways and is a
useful analytical object.
Vector Algebra (Contd.)

• A portfolio vector is defined as a set of portfolio weights, i.e., the fraction of the portfolio that is
invested in each stock:
𝑤1
𝑤2
• W= .
.
𝑤𝑁
• The total of portfolio weights must add up to 1. σ𝑁 ′
𝑖=1 𝑤𝑖 = 1, 𝑾 𝟏 = 1

• Pay special attention to the line above. In it, there are two ways in which to describe the sum of
portfolio weights. The first one uses summation notation, and the second one uses a simple vector
algebraic statement, i.e., that the transpose of w, denoted w’ times the unit vector 1 equals 1.
• The two elements on the left-hand-side of the equation are vectors, and the 1 on the right hand side
is a scalar. The dimension of w’ is (1 x N) and the dimension of 1 is (N x 1). And a (1 x N) vector
multiplied by a (N x 1) results in a (1 x 1) vector, i.e., a scalar.
Statistical Regression
• Consider a multivariate regression where a stock’s returns 𝑅𝑖 are regressed on several market
factors 𝑅𝑘 .
• 𝑅𝑖𝑡 = σ𝑘𝑗=0 𝛽𝑖𝑗 𝑅𝑗𝑡 + 𝑒𝑖𝑡 , ∀𝑖 .

• where t = {1, 2, . . . , T} (i.e., there are T items in the time series), and there are k independent
variables, and usually k = 0 is for the intercept. We could write this also as:
• 𝑅𝑖𝑡 = 𝛽0 + σ𝑘𝑗=1 𝛽𝑖𝑗 𝑅𝑗𝑡 + 𝑒𝑖𝑡 , ∀𝑖 .

• Compactly, using vector notation, the same regression may be written as: 𝑅𝑖 = 𝑅𝑘 𝛽𝑖 + 𝑒𝑖
• Where 𝑅𝑖 , 𝑒𝑖 ∈ 𝑅𝑇 , 𝑅𝑘 ∈ 𝑅 𝑇 𝑘+1 𝑎𝑛𝑑 𝛽𝑖 ∈ 𝑅 𝑘+1 . If there is an intercept in the regression then
the first column of 𝑅𝑘 is 1, the unit vector. Without providing a derivation, you should know that
each regression coefficient is:
𝐶𝑜𝑣(𝑅𝑖 ,𝑅𝑘 )
• 𝛽𝑖𝑘 =
𝑉𝑎𝑟(𝑅𝑘 )
Diversification

• It is useful to examine the power of using vector algebra with an application. Diversification occurs
when we increase the number of non-perfectly correlated stocks in a portfolio, thereby reducing
portfolio variance. In order to compute the variance of the portfolio we need to use the portfolio
weights w and the covariance matrix of stock returns R, denoted Σ. We first write down the formula
for a portfolio’s return variance:
• 𝑉𝑎𝑟 𝑤 ′ 𝑅 = 𝑤′Σw = σ𝑛𝑖=1 𝑤𝑖2 𝜎𝑖2 + σ𝑛𝑖=1 σ𝑛𝑗=1,𝑖≠𝑗 𝑤𝑖 𝑤𝑗 𝜎𝑖𝑗

• Readers are strongly encouraged to implement this by hand for n = 2 to convince themselves that
the vector form of the expression for variance w’Σw is the same thing as the long form on the right-
hand side of the equation above. If returns are independent, then the formula collapses to:
• 𝑉𝑎𝑟 𝑤 ′ 𝑅 = 𝑤′Σw = σ𝑛𝑖=1 𝑤𝑖2 𝜎𝑖2
Matrix Equations

• Here we examine how matrices may be used to represent large systems of equations easily
and also solve them. Using the values of matrices A, B and w from the previous section, we
write out the following in long form:
3 2 𝑤1 3
• Aw = B 𝑤 =
2 4 2 4
• Find the solution values 𝑤1 and 𝑤2 by hand. And then we may compute the solution for w by
“dividing” B by A. This is not regular division because A and B are matrices. Instead we need
to multiply the inverse of A (which is its “reciprocal”) by B.
• The inverse of A is
0.500 −0.250
• 𝐴−1 =
−0.250 0.375
• Now compute by hand:
0.50
• 𝐴−1 𝐵 =
0.75
Inter and Intra Cluster
Questions?

 Difference between supervised and unsupervised learning with example.


 How Reinforcement learning can be applied in healthcare and other domain?
Example and explanation.
 Inter and Intra Cluster explanation with example.
 Example of prediction and forecasts.
References

 Das, S. R., & DAS, S. (2016). Data science: theories, models, algorithms, and
analytics. Learning, 143, 145.
 Sharaff, A., & Sinha, G. R. (Eds.). (2021). Data Science and Its Applications. CRC
Press.
 Van Der Aalst, W. (2016). Data science in action. In Process mining (pp. 3-23).
Springer, Berlin, Heidelberg.
 Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know
about data mining and data-analytic thinking. " O'Reilly Media, Inc.".

You might also like