Lec10 - Big Data, Forecasting and Linear Regression

Data Analytics & Visualization
Lec01 – Big Data, Analytics & Data Science

Who is Generating Big Data?
o Facebook : 1.39
billion monthly users
o YouTube: 300hrs o 4.9 billion
video uploaded/min devices (2015)
o Instagram:70 Million
photos/day
o ASKA Pathfinder: 36 antennas stre

o Walmart: 1
250GB/sec/antenna
million o LHC at CERN:15 petabytes/year
transactions each
hour
o Walmart holds
2.5 petabytes
ASKA: Australian Square Kilometer Array LHC: Large Hadron Collider CERN: European Organization for Nuclear Research
DB size = 50 billion sites
Google server farms

2 million machines (est)
The Million-Server Data Center
4
Big Data Overview
220K
50K Apps
photos
AppStore
Instagram
300K 200 million

tweets emails
1 $80K
2.5 million
amazon
shares minute sales
Variety
Volume Velocity
5
V
Variability Veracity
What is Data Science
• Just having platforms and big data are not enough,
we need to have “Data Science” to fill the gap
between platforms and data.
• Data science employs techniques and theories
drawn from many fields such as statistics and
machine learning to extract knowledge and insights
from big data by leveraging big data platform
“Data Science” an Emerging
Field
O’Reilly Radar report 8

Some recent ML
Competitions
Data Science – One Definition
Time-Series Forecasting
► Set of evenly spaced numerical data

► Obtained by observing response variable at
regular time periods
► Forecast based only on past values, no other
variables important
► Assumes that factors influencing past and
present will continue influence in future
Time-Series Components
Trend Cyclical
Seasonal Random
Components of Demand
Trend
component
Demand for product or service
Seasonal peaks
Actual demand
line
Average demand
over 4 years
Random variation
| | | |
1 2 3 4
Time (years)
Figure 4.1
Trend Component
► Persistent, overall upward or downward
pattern
► Changes due to population, technology,
age, culture, etc.

► Typically several years duration
Seasonal Component
► Regular pattern of up and down
fluctuations
► Due to weather, customs, etc.
► Occurs within a single year
PERIOD LENGTH “SEASON” LENGTH NUMBER OF “SEASONS” IN PATTERN

Week Day 7
Month Week 4 – 4.5
Month Day 28 – 31
Year Quarter 4
Year Month 12
Year Week 52
Cyclical Component
► Repeating up and down movements
► Affected by business cycle, political, and
economic factors
► Multiple years duration
► Often causal or
associative
relationships
0 5 10 15 20
Random Component
► Erratic, unsystematic, ‘residual’ fluctuations
► Due to random variation or unforeseen events
► Short duration
and nonrepeating
M T W T
F
Naive Approach
► Assumes demand in next
period is the same as
demand in most recent period
► e.g., If January sales were 68, then
February sales will be 68
► Sometimes cost effective and
efficient
► Can be good starting point
Moving Average Method
► MA is a series of arithmetic means
► Used if little or no trend
► Used often for smoothing
► Provides overall impression of data over time
Moving Average Example
MONTH ACTUAL SHED SALES 3-MONTH MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 (10 + 12 + 13)/3 = 11 2/3
May 19 (12 + 13 + 16)/3 = 13 2/3
June 23 (13 + 16 + 19)/3 = 16
July 26 (16 + 19 + 23)/3 = 19 1/3
August 30
(19 + 23 + 26)/3 = 22 2/3
September 28
(23 + 26 + 30)/3 = 26 1/3
October 18
November 16 (29 + 30 + 28)/3 = 28
December 14 (30 + 28 + 18)/3 = 25 1/3

(28 + 18 + 16)/3 = 20 2/3
Weighted Moving Average
► Used when some trend might be present
► Older data usually less important
► Weights based on experience and intuition
Weighted
moving
average
MONTH ACTUAL SHED SALES 3-MONTH WEIGHTED MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 [(3 x 13) + (2 x 12) + (10)]/6 = 12 1/6
May 19
June WEIGHTS
23 APPLIED PERIOD
July 26 3 Last month
August 30 2 Two months ago
September 28 1 Three months ago
October 18 6 Sum of the weights
November Forecast for
16this month =
December 3 x 14
Sales last mo. + 2 x Sales 2 mos. ago + 1 x Sales 3 mos. ago
Sum of the weights
MONTH ACTUAL SHED SALES 3-MONTH WEIGHTED MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 [(3 x 13) + (2 x 12) + (10)]/6 = 12 1/6
May 19 [(3 x 16) + (2 x 13) + (12)]/6 = 14 1/3
June 23
[(3 x 19) + (2 x 16) + (13)]/6 = 17
July 26
[(3 x 23) + (2 x 19) + (16)]/6 = 20 1/2
August 30
[(3 x 26) + (2 x 23) + (19)]/6 = 23 5/6
September 28
[(3 x 30) + (2 x 26) + (23)]/6 = 27 1/2
October 18
November 16 [(3 x 28) + (2 x 30) + (26)]/6 = 28 1/3
December 14 [(3 x 18) + (2 x 28) + (30)]/6 = 23 1/3
[(3 x 16) + (2 x 18) + (28)]/6 = 18 2/3
Potential Problems With
Moving Average
► Increasing n smooths the forecast but makes
it less sensitive to changes
► Does not forecast trends well
► Requires extensive historical data

Graph of Moving Averages
Weighted moving average
30 –
25 –
Sales demand
20 –
Actual sales
15 –
Moving average
10 –
5–
| | | | | | | | | | | |
J F M A M J J A S O N D
Figure 4.2 Month
© 2014 Pearson Education, Inc. 4 - 26

Common Measures of Error
Mean Absolute Deviation (MAD)

Mean Squared Error (MSE)

Mean Absolute Percent Error (MAPE)

Correlation
► How strong is the linear relationship
between the variables?
► Correlation does not necessarily imply
causality!
► Coefficient of correlation, r, measures
degree of association
► Values range from -1 to +1
Correlation Coefficient
Correlation Coefficient
Figure 4.10
y y
x x
(a) Perfect negative (e) Perfect positive
correlation y correlation
y
y
x x
(b) Negative correlation (d) Positive correlation
x
(c) No correlation
High Moderate Low Low Moderate High

| | | | | | | | |
–1.0 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1.0
Correlation coefficient values
Regression
Regression
Response variable (y)
Predictor variable (x)
• Regression attempts to explain the variability in the dependent

(target) variable in terms of the variability in independent
(predictor) variables.
• If the independent variable(s) sufficiently explain the variability in
the dependent variable, then the model can be used for prediction.
Examples of Regression Task
Task Independent variables Target/Response
(Predictor), x variable, y
Forecasting the Historical monthly sales Monthly sales at
monthly sales and other predictor time t
variables (inventory, etc)
Predicting power Sensor measurements Expected power

consumption at of temperature, fan consumed
data centers speed, etc
Predicting crime Statistics about housing, Crime rate in a

rate population, job/income, given city or
education, etc region
•
Problem Definition
Given:
• A training set {(x1, y1), (x2, y2),…, (xN, yN)}, where each xi,
corresponds to a set of predictor variables and yi is the
corresponding value of the target variable
• Task
• Learn a target function f(x;w) to predict the value of y for any
given input x
• w is the model parameter
Examples of Regression Models
• Linear models
• Multiple linear regression
• Ridge regression
• Nonlinear models
• Neural networks
• Kernel ridge regression
• Support vector regression
• Locally weighted regression
• Regression trees
Multiple Linear Regression
• Assume the target function is linear
 Estimation: find w that minimizes residual sum of square
 Prediction: given a test

Implementation
• Suppose you have:
• Xtrain: N x d matrix of predictor variables (training set)
• Ytrain: N x 1 vector of response variables (training set)
• Xtest: M x d matrix of predictor variables (test set)
Example
• 1-dimensional data:
• Generated 100 data points (use 20% for training)

Multiple-Regression Analysis
If more than one independent variable is to be
used in the model, linear regression can be
extended to multiple regression to accommodate
several independent variables
Computationally, this is quite

complex and generally done on the
computer
Multiple-Regression Analysis
In the Nodel example, including interest rates in the
model gives the new equation:
An improved correlation coefficient of r = .96 suggests

this model does a better job of predicting the change
in construction sales
Sales = 1.80 + .30(6) - 5.0(.12) = 3.00

Sales = $3,000,000
Tutorial
• https://www.kaggle.com/code/ryanholbrook/linear
-regression-with-time-series
Data Analytics & Visualization
Lec01 – Big Data, Analytics & Data Science

Lec10 - Big Data, Forecasting and Linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec10 - Big Data, Forecasting and Linear Regression

Uploaded by

Copyright:

Available Formats

Data Analytics & Visualization

Lec01 – Big Data, Analytics & Data Science

o ASKA Pathfinder: 36 antennas stre

Google server farms

300K 200 million

O’Reilly Radar report 8

► Set of evenly spaced numerical data

age, culture, etc.

PERIOD LENGTH “SEASON” LENGTH NUMBER OF “SEASONS” IN PATTERN

December 14 (30 + 28 + 18)/3 = 25 1/3

► Requires extensive historical data

© 2014 Pearson Education, Inc. 4 - 26

Mean Absolute Deviation (MAD)

Mean Squared Error (MSE)

Mean Absolute Percent Error (MAPE)

High Moderate Low Low Moderate High

Predictor variable (x)

• Regression attempts to explain the variability in the dependent

Predicting power Sensor measurements Expected power

Predicting crime Statistics about housing, Crime rate in a

 Estimation: find w that minimizes residual sum of square

 Prediction: given a test

Computationally, this is quite

An improved correlation coefficient of r = .96 suggests

Sales = 1.80 + .30(6) - 5.0(.12) = 3.00

You might also like