Professional Documents
Culture Documents
Lec10 - Big Data, Forecasting and Linear Regression
Lec10 - Big Data, Forecasting and Linear Regression
ASKA: Australian Square Kilometer Array LHC: Large Hadron Collider CERN: European Organization for Nuclear Research
DB size = 50 billion sites
4
Big Data Overview
220K
50K Apps
photos
AppStore
Instagram
1 $80K
2.5 million
amazon
shares minute sales
Variety
Volume Velocity
5
V
Variability Veracity
What is Data Science
• Just having platforms and big data are not enough,
we need to have “Data Science” to fill the gap
between platforms and data.
• Data science employs techniques and theories
drawn from many fields such as statistics and
machine learning to extract knowledge and insights
from big data by leveraging big data platform
“Data Science” an Emerging
Field
Trend Cyclical
Seasonal Random
Components of Demand
Trend
component
Demand for product or service
Seasonal peaks
Actual demand
line
Average demand
over 4 years
Random variation
| | | |
1 2 3 4
Time (years)
Figure 4.1
Trend Component
► Persistent, overall upward or downward
pattern
► Changes due to population, technology,
0 5 10 15 20
Random Component
► Erratic, unsystematic, ‘residual’ fluctuations
► Due to random variation or unforeseen events
► Short duration
and nonrepeating
M T W T
F
Naive Approach
► Assumes demand in next
period is the same as
demand in most recent period
► e.g., If January sales were 68, then
February sales will be 68
► Sometimes cost effective and
efficient
► Can be good starting point
Moving Average Method
► MA is a series of arithmetic means
► Used if little or no trend
► Used often for smoothing
► Provides overall impression of data over time
Moving Average Example
MONTH ACTUAL SHED SALES 3-MONTH MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 (10 + 12 + 13)/3 = 11 2/3
May 19 (12 + 13 + 16)/3 = 13 2/3
June 23 (13 + 16 + 19)/3 = 16
July 26 (16 + 19 + 23)/3 = 19 1/3
August 30
(19 + 23 + 26)/3 = 22 2/3
September 28
(23 + 26 + 30)/3 = 26 1/3
October 18
November 16 (29 + 30 + 28)/3 = 28
Weighted
moving
average
Weighted Moving Average
MONTH ACTUAL SHED SALES 3-MONTH WEIGHTED MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 [(3 x 13) + (2 x 12) + (10)]/6 = 12 1/6
May 19
June WEIGHTS
23 APPLIED PERIOD
July 26 3 Last month
August 30 2 Two months ago
September 28 1 Three months ago
October 18 6 Sum of the weights
November Forecast for
16this month =
December 3 x 14
Sales last mo. + 2 x Sales 2 mos. ago + 1 x Sales 3 mos. ago
Sum of the weights
Weighted Moving Average
MONTH ACTUAL SHED SALES 3-MONTH WEIGHTED MOVING AVERAGE
January 10
February 12
12
March 13
13
April 16 [(3 x 13) + (2 x 12) + (10)]/6 = 12 1/6
May 19 [(3 x 16) + (2 x 13) + (12)]/6 = 14 1/3
June 23
[(3 x 19) + (2 x 16) + (13)]/6 = 17
July 26
[(3 x 23) + (2 x 19) + (16)]/6 = 20 1/2
August 30
[(3 x 26) + (2 x 23) + (19)]/6 = 23 5/6
September 28
[(3 x 30) + (2 x 26) + (23)]/6 = 27 1/2
October 18
November 16 [(3 x 28) + (2 x 30) + (26)]/6 = 28 1/3
December 14 [(3 x 18) + (2 x 28) + (30)]/6 = 23 1/3
[(3 x 16) + (2 x 18) + (28)]/6 = 18 2/3
Potential Problems With
Moving Average
► Increasing n smooths the forecast but makes
it less sensitive to changes
► Does not forecast trends well
30 –
25 –
Sales demand
20 –
Actual sales
15 –
Moving average
10 –
5–
| | | | | | | | | | | |
J F M A M J J A S O N D
Figure 4.2 Month
x x
(a) Perfect negative (e) Perfect positive
correlation y correlation
y
y
x x
(b) Negative correlation (d) Positive correlation
x
(c) No correlation
–1.0 –0.8 –0.6 –0.4 –0.2 0 0.2 0.4 0.6 0.8 1.0
Correlation coefficient values
Regression
Regression
Response variable (y)
• Task
• Learn a target function f(x;w) to predict the value of y for any
given input x
• w is the model parameter
Examples of Regression Models
• Linear models
• Multiple linear regression
• Ridge regression
• Nonlinear models
• Neural networks
• Kernel ridge regression
• Support vector regression
• Locally weighted regression
• Regression trees
Multiple Linear Regression
• Assume the target function is linear
• 1-dimensional data:
• Generated 100 data points (use 20% for training)
Multiple-Regression Analysis
If more than one independent variable is to be
used in the model, linear regression can be
extended to multiple regression to accommodate
several independent variables