Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 43

Data Analytics & Visualization

Lec01 – Big Data, Analytics & Data Science


Who is Generating Big Data?
o Facebook : 1.39
billion monthly users
o YouTube: 300hrs o 4.9 billion
video uploaded/min devices (2015)
o Instagram:70 Million
photos/day

o ASKA Pathfinder: 36 antennas stre


o Walmart: 1
250GB/sec/antenna
million o LHC at CERN:15 petabytes/year
transactions each
hour
o Walmart holds
2.5 petabytes

ASKA: Australian Square Kilometer Array LHC: Large Hadron Collider CERN: European Organization for Nuclear Research
DB size = 50 billion sites

Google server farms


2 million machines (est)
The Million-Server Data Center

4
Big Data Overview

220K
50K Apps
photos
AppStore
Instagram

300K 200 million


tweets emails

1 $80K
2.5 million
amazon
shares minute sales
Variety
Volume Velocity

5
V
Variability Veracity
What is Data Science
• Just having platforms and big data are not enough,
we need to have “Data Science” to fill the gap
between platforms and data.
• Data science employs techniques and theories
drawn from many fields such as statistics and
machine learning to extract knowledge and insights
from big data by leveraging big data platform
“Data Science” an Emerging
Field

O’Reilly Radar report 8


Some recent ML
Competitions
Data Science – One Definition
Time-Series Forecasting

► Set of evenly spaced numerical data


► Obtained by observing response variable at
regular time periods
► Forecast based only on past values, no other
variables important
► Assumes that factors influencing past and
present will continue influence in future
Time-Series Components

Trend Cyclical

Seasonal Random
Components of Demand
Trend
component
Demand for product or service

Seasonal peaks

Actual demand
line

Average demand
over 4 years

Random variation
| | | |
1 2 3 4
Time (years)
Figure 4.1
Trend Component
► Persistent, overall upward or downward
pattern
► Changes due to population, technology,

age, culture, etc.


► Typically several years duration
Seasonal Component
► Regular pattern of up and down
fluctuations
► Due to weather, customs, etc.
► Occurs within a single year

PERIOD LENGTH “SEASON” LENGTH NUMBER OF “SEASONS” IN PATTERN


Week Day 7
Month Week 4 – 4.5
Month Day 28 – 31
Year Quarter 4
Year Month 12
Year Week 52
Cyclical Component
► Repeating up and down movements
► Affected by business cycle, political, and
economic factors
► Multiple years duration
► Often causal or
associative
relationships

0 5 10 15 20
Random Component
► Erratic, unsystematic, ‘residual’ fluctuations
► Due to random variation or unforeseen events
► Short duration
and nonrepeating

M T W T
F
Naive Approach
► Assumes demand in next
period is the same as
demand in most recent period
► e.g., If January sales were 68, then
February sales will be 68
► Sometimes cost effective and
efficient
► Can be good starting point
Moving Average Method
► MA is a series of arithmetic means
► Used if little or no trend
► Used often for smoothing
► Provides overall impression of data over time
Moving Average Example
MONTH ACTUAL SHED SALES 3-MONTH MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 (10 + 12 + 13)/3 = 11 2/3
May 19 (12 + 13 + 16)/3 = 13 2/3
June 23 (13 + 16 + 19)/3 = 16
July 26 (16 + 19 + 23)/3 = 19 1/3
August 30
(19 + 23 + 26)/3 = 22 2/3
September 28
(23 + 26 + 30)/3 = 26 1/3
October 18
November 16 (29 + 30 + 28)/3 = 28

December 14 (30 + 28 + 18)/3 = 25 1/3


(28 + 18 + 16)/3 = 20 2/3
Weighted Moving Average
► Used when some trend might be present
► Older data usually less important
► Weights based on experience and intuition

Weighted
moving
average
Weighted Moving Average
MONTH ACTUAL SHED SALES 3-MONTH WEIGHTED MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 [(3 x 13) + (2 x 12) + (10)]/6 = 12 1/6
May 19
June WEIGHTS
23 APPLIED PERIOD
July 26 3 Last month
August 30 2 Two months ago
September 28 1 Three months ago
October 18 6 Sum of the weights
November Forecast for
16this month =
December 3 x 14
Sales last mo. + 2 x Sales 2 mos. ago + 1 x Sales 3 mos. ago
Sum of the weights
Weighted Moving Average
MONTH ACTUAL SHED SALES 3-MONTH WEIGHTED MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 [(3 x 13) + (2 x 12) + (10)]/6 = 12 1/6
May 19 [(3 x 16) + (2 x 13) + (12)]/6 = 14 1/3
June 23
[(3 x 19) + (2 x 16) + (13)]/6 = 17
July 26
[(3 x 23) + (2 x 19) + (16)]/6 = 20 1/2
August 30
[(3 x 26) + (2 x 23) + (19)]/6 = 23 5/6
September 28
[(3 x 30) + (2 x 26) + (23)]/6 = 27 1/2
October 18
November 16 [(3 x 28) + (2 x 30) + (26)]/6 = 28 1/3
December 14 [(3 x 18) + (2 x 28) + (30)]/6 = 23 1/3
[(3 x 16) + (2 x 18) + (28)]/6 = 18 2/3
Potential Problems With
Moving Average
► Increasing n smooths the forecast but makes
it less sensitive to changes
► Does not forecast trends well

► Requires extensive historical data


Graph of Moving Averages
Weighted moving average

30 –

25 –
Sales demand

20 –
Actual sales
15 –

Moving average
10 –

5–
| | | | | | | | | | | |

J F M A M J J A S O N D
Figure 4.2 Month

© 2014 Pearson Education, Inc. 4 - 26


Common Measures of Error

Mean Absolute Deviation (MAD)


Common Measures of Error

Mean Squared Error (MSE)


Common Measures of Error

Mean Absolute Percent Error (MAPE)


Correlation
► How strong is the linear relationship
between the variables?
► Correlation does not necessarily imply
causality!
► Coefficient of correlation, r, measures
degree of association
► Values range from -1 to +1
Correlation Coefficient
Correlation Coefficient
Figure 4.10
y y

x x
(a) Perfect negative (e) Perfect positive
correlation y correlation
y

y
x x
(b) Negative correlation (d) Positive correlation

x
(c) No correlation

High Moderate Low Low Moderate High


| | | | | | | | |

–1.0 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1.0
Correlation coefficient values
Regression

Regression
Response variable (y)

Predictor variable (x)

• Regression attempts to explain the variability in the dependent


(target) variable in terms of the variability in independent
(predictor) variables.
• If the independent variable(s) sufficiently explain the variability in
the dependent variable, then the model can be used for prediction.
Examples of Regression Task
Task Independent variables Target/Response
(Predictor), x variable, y
Forecasting the Historical monthly sales Monthly sales at
monthly sales and other predictor time t
variables (inventory, etc)

Predicting power Sensor measurements Expected power


consumption at of temperature, fan consumed
data centers speed, etc

Predicting crime Statistics about housing, Crime rate in a


rate population, job/income, given city or
education, etc region

Problem Definition
Given:
• A training set {(x1, y1), (x2, y2),…, (xN, yN)}, where each xi,
corresponds to a set of predictor variables and yi is the
corresponding value of the target variable

• Task
• Learn a target function f(x;w) to predict the value of y for any
given input x
• w is the model parameter
Examples of Regression Models
• Linear models
• Multiple linear regression
• Ridge regression

• Nonlinear models
• Neural networks
• Kernel ridge regression
• Support vector regression
• Locally weighted regression
• Regression trees
Multiple Linear Regression
• Assume the target function is linear

 Estimation: find w that minimizes residual sum of square

 Prediction: given a test


Implementation
• Suppose you have:
• Xtrain: N x d matrix of predictor variables (training set)
• Ytrain: N x 1 vector of response variables (training set)
• Xtest: M x d matrix of predictor variables (test set)
Example  

• 1-dimensional data:
• Generated 100 data points (use 20% for training)

 
Multiple-Regression Analysis
If more than one independent variable is to be
used in the model, linear regression can be
extended to multiple regression to accommodate
several independent variables

Computationally, this is quite


complex and generally done on the
computer
Multiple-Regression Analysis
In the Nodel example, including interest rates in the
model gives the new equation:

An improved correlation coefficient of r = .96 suggests


this model does a better job of predicting the change
in construction sales

Sales = 1.80 + .30(6) - 5.0(.12) = 3.00


Sales = $3,000,000
Tutorial
• https://www.kaggle.com/code/ryanholbrook/linear
-regression-with-time-series
Data Analytics & Visualization
Lec01 – Big Data, Analytics & Data Science

You might also like