Analytics Task-1

ANALYTICS TASK
Outlook Group
TASK 1
⚫ Refer to the dataset of “Incinerator”
⚫ Find 5 problem statements from the entire
dataset.
⚫ Model the relationships.
⚫ Find the solutions to the problem
statements.
⚫ Justify the solution with concrete details.
TASK 2
⚫ Refer to the dataset “Tree Making”
⚫ Analyze the dataset and analyze the nature
of the variables.
⚫ Find out the course of action using a
suitable decision tree.
TASK 3
⚫ Refer to the dataset “MRA Analytics”
⚫ 2 problem statements should be analysed.
⚫ Choice of software is yours.
⚫ Excel Miner may be used in trial version.
⚫ Analyze all the parameters of the model
and prepare a descriptive report.
TASK 4
⚫ Refer to the dataset “Forecasting”
⚫ Fill in the excel sheet with valid
calculations.
⚫ Give logic behind the same.
Submission Format
⚫ Task 4 should be done in the same excel
sheet.
⚫ Other tasks may be done in word or ppt or
excel.
HINTS
MULTIVARIATE ANALYSIS
⚫ UNIVARIATE
⚫ MEASURES OF CENTRAL TENDENCY
⚫ MEASURES OF DISPERSION
⚫ BIVARIATE
⚫ CROSS-TABULATION
⚫ CHI-SQUARE
⚫ ONE-WAY ANOVA
⚫ CORRELATION ANALYSIS
⚫ SIMPLE REGRESSION ANALYSIS
⚫ MULTIVARIATE
⚫ MULTIPLE VARIABLES
⚫ MULTIPLE VARIATES
WHAT TYPE OF RELATIONSHIP IS
BEING EXAMINED
DEPEND INTERDEPEN
ENCE DENCE
NUMBER OF DEPENDENT STRUCTURE OF
VARIABLES RELATIONSHIP IS AMONG
SEVERAL ONE DV
VARIAB CASES/ OBJE
DVs IN A
LES RESPOND CTS
IN A SINGLE
MANO EFA ENTS
CLA MDS
SINGLE RELATIO
MEASUREM
VA
RELATIO NSHIP
ENT
MULTIPLE
NSHIP SCALE OF
RELATION
SHIPS MET DV NON-ME
OF DVs & RIC
MRA TRIC
MDA
IVs
SEM CJA LRA
APPLICATIONS
⚫ Correlation
⚫ How strongly are sales related to advertising expenditure?
⚫ Is there an association between market share and size of
the sales force?
⚫ Partial correlation
⚫ How strongly are sales related to advertising expenditure
when the effect of price is controlled
⚫ Is there an association between market share and size of
the sales force after adjusting for the effect of sales
promotion
⚫ Regression
⚫ Can variation in sales be explained in terms of variation in
advertising expenditures? What is the structure and form
of this relationship, and can it be modeled
REGRESSION ANALYSIS
Regression analysis examines associative relationships
between a metric dependent variable and one or more
independent variables in the following ways:
⚫ Determine whether the independent variables explain a
significant variation in the dependent variable: whether
a relationship exists.
⚫ Determine how much of the variation in the dependent
variable can be explained by the independent variables:
strength of the relationship.
⚫ Determine the structure or form of the relationship: the
mathematical equation relating the independent and
dependent variables.
⚫ Predict the values of the dependent variable.
⚫ Control for other independent variables when
evaluating the contributions of a specific variable or set
STATISTICS ASSOCIATED WITH BIVARIATE
REGRESSION ANALYSIS
⚫ Bivariate regression model. The basic regression equation
β 0 β1
is Yi = β+ Xi + ei, where Y = βdependent
1
or criterion variable,
X = independent or predictor variable,
0
= intercept of the
line, = slope of the line, and ei is the error term associated
with the i th observation.
⚫ Coefficient of determination. The strength of association is

measured by the coefficient of determination, r 2. It varies
between 0 and 1 and signifies the proportion of the total
variation
β0 β 1 in Y that is accounted for by the variation in X.
⚫ Estimated or predicted value. The estimated or predicted

value of Yi is i = a + b x, where i is the predicted value of Yi,
and a and b are estimators of
and , respectively.
BIVARIATE
REGRESSION ANALYSIS
⚫ Regression coefficient. The estimated

parameter b is usually referred to as the
non-standardized regression coefficient.
⚫ Scattergram. A scatter diagram, or
scattergram, is a plot of the values of two
variables for all the cases or observations.
Y
⚫ Standard error of estimate. This statistic,
SEE, is the standard deviation of the actual Y
values from the predicted values.
BIVARIATE
REGRESSION ANALYSIS
⚫ Standardized regression coefficient. Also

termed the beta coefficient or beta weight,
this is the slope obtained by the regression of
Y on X when the data are standardized.
2
⚫ SumΣof
e j squared errors. The distances of all
the points from the regression line are
squared and added together to arrive at the
sum of squared errors, which is a measure of
total error,
ANALYSIS
PLOT THE SCATTER DIAGRAM
⚫ A scatter diagram, or scattergram, is a

plot of the values of two variables for all
the cases or observations.
⚫ The most commonly used technique for

fitting a straight line Σto
e j2 a scattergram is
the least-squares procedure. In fitting
the line, the least-squares procedure
minimizes the sum of squared errors,
STEPWISE REGRESSION
STAGE 1: RESEARCH EXAMINE STATISTICAL &
PROBLEM PRACTICAL SIGNIFICANCE
STAGESELECT OBJECTIVE
2: RESEARCH DESIGN COEFFICIENT OF DETERMINATION
SELECT INDEPENDENT &
ISSUES
DEPENDENT VARIABLES
ADJUSTED COEFFICIENT OF
DETERMINATION
OBTAIN ADEQUATE SAMPLE STANDARD ERROR OF ESTIMATE
SIZE TO ENSURE STATISTICAL SIGNIFICANCE OF
STATISTICAL POWER REGRESSION COEFFICIENT
STAGE 5: INTERPRET THE
GENERALIZABILITY
REGRESSION VARIATE
EVALUATE THE PREDICTION EQUATION
CREATING ADDITIONAL
WITH THE REGRESSION COEFFICIENTS
VARIABLES EVALUATE THE RELATIVE IMPORTANCE
TRANSFORMATION TO MEET OF THE INDEPENDENT VARIABLES WITH
ASSUMPTIONS BETA COEFFICIENTS
DUMMY VARIABLES ASSESS MULTICOLLINEARITY AND ITS
POLYNOMIALS FOR CURVILINEAR EFFECTS
RELATIONSHIPS
INTERACTION
STAGE TERMS FOR
3: ASSUMPTIONS
MODERATOR EFFECTS
NORMALITY
LINEARITY STAGE 6: VALIDATE THE
HOMOSCEDASTICITY RESULTS
INDEPENDENCE OF ERROR TERMS SPLIT-SAMPLE ANALYSIS
FORMULATE THE BIVARIATE REGRESSION
MODEL
In the bivariate regression model, the

general
whereform of a β 0 + β 1
straight line is: Y =or criterion
Y = dependent X
variable
X= == slope
independent
intercept of line
of the or predictor
the
βvariable
0line
β1
The regression procedure adds an error term to
account for the
probabilisticXior
+ estochastic
i
nature of the relationship:
β0 + β1
Ywhere
i
= ei is the error term associated with the i th
observation.
PLOT OF ATTITUDE WITH DURATION
9
Attitude
2.25 4.5 6.75 9 11.25 13.5 15.75 18
Duration of Residence
WHICH STRAIGHT LINE IS BEST?
Line
1
Line 2
9 Line 3
Line 4
6
2.25 4.5 6.75 9 11.25 13.5 15.75 18

REGRESSION
β 0 + β 1X
Y
YJ
eJ
eJ
YJ
X
X1 X2 X3 X4 X5
VARIATION IN BIVARIATE
REGRESSION
Y
t a l Residual Variation
To ation SSres
ari S y Explained Variation
V S SSreg Y
X
X1 X2 X3 X4 X5
MULTIPLE REGRESSION
The general form of the multiple regression model

is as follows:
Y = β 0 + β 1 X1 + β 2 X2 + β 3 X3+ . . . + β k X k + e
Ywhich is estimated by the following equation:

= a + b1X1 + b2X2 + b3X3+ . . . + bkXk
As before, the coefficient a represents the intercept,

but the b's are now the partial regression coefficients.
STATISTICS ASSOCIATED WITH
MULTIPLE REGRESSION
⚫ Adjusted R2. R2, coefficient of multiple determination, is

adjusted for the number of independent variables and the
sample size to account for the diminishing returns. After the
first few variables, the additional independent variables do not
make much contribution.
⚫ Coefficient of multiple determination. The strength of

association in multiple regression is measured by the square of
the multiple correlation coefficient, R2, which is also called the
coefficient of multiple determination.
ANALYSIS
STRENGTH OF ASSOCIATION
SSy = SSreg + SSres
where
n
SSy = Σ (Y i - Y )2
i =1
n 2
S S reg = Σ
i =1
(Y i - Y )
n 2
S S res = Σ
i =1
(Y i - Y i )
ANALYSIS
STRENGTH OF ASSOCIATION
The strength of association is measured by the square

of the multiple
correlation coefficient, R2, which is also called the
coefficient of
2multiple
R =
SS reg determination.
SS
y
R2 is adjusted for the number of independent variables
and the sample
size by using the following formula:
k(1 - R 2 )
R2 -
Adjusted R2 = n - k - 1
ANALYSIS
SIGNIFICANCE TESTING
H0 : R2pop = 0
This is equivalent to the following null

hypothesis:
H0: β 1 = β 2 = β 3 = . . . = β k = 0
The overall test can be conducted by using
an F statistic:
reg /k
F = SS SS
res /(n - k - 1)
= R 2 /k
(1 - R 2 )/(n- k - 1)
which has an F distribution with k and (n - k -1)
degrees of freedom.
ANALYSIS
SIGNIFICANCE TESTING
Testing to
similar forthat
the significance of the
in the bivariate can be done in a
β i'scase by using t tests.
manner
The
significance of the partial coefficient for importance
attached to weather may be tested by the following
equation:
t= b
S Eb
which has a t distribution with n - k -1 degrees
of freedom.
ANALYSIS
EXAMINATION OF RESIDUALS
⚫ A residual is the difference between

Y the observed value of Yi
and the value predicted by the regression equation i.
Y
⚫ Scattergrams of the residuals, in which the residuals are
plotted against the predicted values, i, time, or predictor
variables, provide useful insights in examining the
appropriateness of the underlying assumptions and
regression model fit.
⚫ The assumption of a normally distributed error term can be

examined by constructing a histogram of the residuals.
⚫ The assumption of constant variance of the error term can

STEPWISE REGRESSION
The purpose of stepwise regression is to select, from a large
number of predictor variables, a small subset of variables that
account for most of the variation in the dependent or criterion
variable. In this procedure, the predictor variables enter or are
removed from the regression equation one at a time. There are
several approaches to stepwise regression.
⚫ Forward inclusion. Initially, there are no predictor variables in
the regression equation. Predictor variables are entered one at
a time, only if they meet certain criteria specified in terms of F
ratio. The order in which the variables are included is based on
the contribution to the explained variance.
⚫ Backward elimination. Initially, all the predictor variables are
included in the regression equation. Predictors are then
removed one at a time based on the F ratio for removal.
⚫ Stepwise solution. Forward inclusion is combined with the
MULTICOLLINEARITY
⚫ Multicollinearity arises when intercorrelations

among the predictors are very high.
⚫ Multicollinearity can result in several problems, including:

⚫ The partial regression coefficients may not be estimated
precisely. The standard errors are likely to be high.
⚫ The magnitudes, as well as the signs of the partial
regression coefficients, may change from sample to
sample.
⚫ It becomes difficult to assess the relative importance of
the independent variables in explaining the variation in
the dependent variable.
⚫ Predictor variables may be incorrectly included or
removed in stepwise regression.
PREDICTORS
Unfortunately, because the predictors are correlated, there
is no unambiguous measure of relative importance of the
predictors in regression analysis.
However, several approaches are commonly used to assess
the relative importance of predictor variables.
⚫ Statistical significance. If the partial regression coefficient

of a variable is not significant, as determined by an
incremental F test, that variable is judged to be unimportant.
An exception to this rule is made if there are strong
theoretical reasons for believing that the variable is
important.
⚫ Square of the simple correlation coefficient. This measure,
r 2, represents the proportion of the variation in the
RELATIVE IMPORTANCE OF PREDICTORS
⚫ Square of the partial correlation coefficient. This measure,

R 2yxi.xjxk, is the coefficient of determination between the
dependent variable and the independent variable,
controlling for the effects of the other independent
variables.
⚫ Square of the part correlation coefficient. This coefficient
represents an increase in R 2 when a variable is entered into
a regression equation that already contains the other
independent variables.
⚫ Measures based on standardized coefficients or beta
weights. The most commonly used measures are the
absolute values of the beta weights, |Bi| , or the squared
values, Bi 2.
⚫ Stepwise regression. The order in which the predictors
enter or are removed from the regression equation is used
CLASSIFICATION AND REGRESSION
TREES (CART)
APPLICATIONS
⚫ Database marketing
⚫ Target customers who are more likely to
respond to a marketing campaign
⚫ Market research
⚫ Uncover the key drivers for customer
satisfaction
⚫ Credit risk scoring
⚫ Predict which customers are more likely to
TREES AND RULES
Goal: Classify or predict an outcome based on a set of
predictors
⚫ The output is a set of rules
Example:
⚫ Goal: classify a record as “will accept credit card
offer” or “will not accept”
⚫ Rule might be “IF (Income > 92.5) AND (Education
< 1.5) AND (Family <= 2.5) THEN Class = 0
(non-acceptor)
KEY IDEAS
Recursive partitioning: Repeatedly split the
records into two parts so as to achieve maximum
homogeneity within the new parts
Pruning the tree: Simplify the tree by pruning

peripheral branches to avoid overfitting
RECURSIVE PARTITIONING
RECURSIVE PARTITIONING STEPS
⚫ Pick one of the predictor variables, xi
⚫ Pick a value of xi, say si, that divides the
training data into two (not necessarily
equal) portions
⚫ Measure how “pure” or homogeneous each
of the resulting portions are
“Pure” = containing records of mostly one
class
⚫ Idea is to pick x and s to maximize purity
EXAMPLE: RIDING MOWERS
⚫ Data: 24 households classified as owning or not

owning riding mowers
⚫ Predictors = Income, Lot Size

HOW TO SPLIT
⚫ Order records according to one variable, say lot size
⚫ Find midpoints between successive values
⚫ E.g., first midpoint is 14.4 (halfway between 14.0
and 14.8)
⚫ Divide records into those with lot size > 14.4 and
those < 14.4
⚫ After evaluating that split, try the next one, which is
15.4 (halfway between 14.8 and 16.0)
NOTE: CATEGORICAL
VARIABLES
⚫ Examine all possible ways in which the categories
can be split.
⚫ E.g. categories A, B, C can be split 3 ways
⚫ {A} and {B, C}
⚫ {B} and {A, C}
⚫ {C} and {A, B}
⚫ With many categories, # of splits becomes huge
⚫ XLMiner supports only binary categorical variables
THE FIRST SPLIT: LOT SIZE = 19,000
SECOND SPLIT: INCOME = $84,000
AFTER ALL SPLITS
MEASURING IMPURITY
GINI INDEX
Gini Index for rectangle A containing m records
p = proportion of cases in rectangle A that belong to class

k
⚫ I(A) = 0 when all cases belong to same class

⚫ At a max when all classes are equally represented (=
0.50 in binary case)
Note: XLMiner uses a variant called “delta splitting rule”
RECURSIVE PARTITIONING
⚫ Obtain overall impurity measure (weighted avg. of

individual rectangles)
⚫ At each successive stage, compare this measure
across all possible splits in all variables
⚫ Choose the split that reduces impurity the most
⚫ Chosen split points become nodes on the tree
FIRST SPLIT – THE TREE
TREE AFTER SECOND SPLIT
TREE STRUCTURE
⚫ Split points become nodes on tree (circles with split
value in center)
⚫ Rectangles represent “leaves” (terminal points, no
further splits, classification value noted)
⚫ Numbers on lines between nodes indicate # cases
⚫ Read down tree to derive rule, e.g.
⚫ If lot size < 19, and if income > 84.75, then class =
“owner”
DETERMINING LEAF NODE LABEL
⚫ Each leaf node label is determined by “voting” of the

records within it, and by the cutoff value
⚫ Records within each leaf node are from the training
data
⚫ Default cutoff=0.5 means that the leaf node’s label is
the majority class.
⚫ Cutoff = 0.75: requires majority of 75% or more “1”
records in the leaf to label it a “1” node
TREE AFTER ALL SPLITS
THE OVERFITTING PROBLEM
STOPPING TREE GROWTH
⚫ Natural end of process is 100% purity in
each leaf
⚫ This overfits the data, which end up fitting
noise in the data
⚫ Overfitting leads to low predictive accuracy
of new data
⚫ Past a certain point, the error rate for the
validation data starts to increase
FULL TREE ERROR RATE
CHAID
CHAID, older than CART, uses chi-square statistical

test to limit tree growth
Splitting stops when purity improvement is not

statistically significant
PRUNING
⚫ CART lets tree grow to full extent, then
prunes it back
⚫ Idea is to find that point at which the
validation error begins to rise
⚫ Generate successively smaller trees by
pruning leaves
⚫ At each pruning stage, multiple trees are
possible
⚫ Use cost complexity to choose the best tree at
COST COMPLEXITY
CC(T) = Err(T) + α L(T)
CC(T) = cost complexity of a tree
Err(T) = proportion of misclassified records
a = penalty factor attached to tree size (set by user)
⚫ Among trees of given size, choose the one with

lowest CC
⚫ Do this for each size of tree
PRUNING RESULTS
⚫ This process yields a set of trees of different sizes
and associated error rates
Two trees of interest:

⚫ Minimum error tree
⚫ Has lowest error rate on validation data
⚫ Best pruned tree
⚫ Smallest tree within one std. error of min. error
⚫ This adds a bonus for simplicity/parsimony
ERROR RATES ON PRUNED TREES
REGRESSION TREES
REGRESSION TREES FOR
PREDICTION
⚫ Used with continuous outcome variable

⚫ Procedure similar to classification tree
⚫ Many splits attempted, choose the one that
minimizes impurity
DIFFERENCES FROM CT
⚫ Prediction is computed as the average of
numerical target variable in the rectangle
(in CT it is majority vote)
⚫ Impurity measured by sum of squared

deviations from leaf mean
⚫ Performance measured by RMSE (root

mean squared error)
ADVANTAGES OF TREES
⚫ Easy to use, understand
⚫ Produce rules that are easy to interpret &
implement
⚫ Variable selection & reduction is automatic
⚫ Do not require the assumptions of statistical models
⚫ Can work without extensive handling of missing
data
DISADVANTAGES
⚫ May not perform well where there is structure in
the data that is not well captured by horizontal or
vertical splits
⚫ Since the process deals with one variable at a time,

no way to capture interactions between variables
SUMMARY
⚫ Classification and Regression Trees are an easily
understandable and transparent method for
predicting or classifying new records
⚫ A tree is a graphical representation of a set of rules
⚫ Trees must be pruned to avoid over-fitting of the

training data
⚫ As trees do not make any assumptions about the

FORECASTING - DEFINITION
⚫ A planning tool that helps management in
its attempts to cope with the uncertainty of
the future, relying mainly on data from the
past and present and analysis of trends.
⚫ The use of historic data to determine the
direction of future trends.
⚫ A skillful mix of quantitative forecasting,
good judgment and common sense.
⚫ Forecasting should not be viewed as a substitute
for prophecy but rather as the best way of
FORECASTING - NECESSITY
⚫ How can RBI realistically adjust interest rates
without some notion of future economic
growth and inflationary pressure?
⚫ How can operations manager realistically set
productions schedules without some
estimation of future sales?
⚫ How can a company determine staffing for its
call centers without some guess of the future
demand for service?
⚫ How can a bank make realistic plans for
FORECASTING - STEPS
⚫ Problem formulation and data collection
⚫ Problem determines the appropriate data.
⚫ Data editing and cleaning
⚫ Missing values
⚫ Data conversion
⚫ Model building and evaluation
⚫ Fitting the collected data into a forecasting
model that is appropriate in terms of
minimizing forecasting error
⚫ Model implementation (the actual forecast)
FORECASTING - TERMS
⚫ Prediction vs. Forecast: Forecasting would be a
subset of prediction. Any time you predict into
the future it is a forecast. All forecasts are
predictions, but not all predictions are
forecasts, as when you would use regression to
explain the relationship between two variables
⚫ Ex-ante forecast: A forecast that uses
information that would have been available at
the forecast origin: it does not use actual
values of variables from later periods
(Armstrong, 2001).
FORECASTING - TERMS
⚫ Short term forecasting
⚫ Provides information for tactical decisions
⚫ Helps in preparing suitable sales policy
⚫ Seasonal patterns are of much importance
⚫ It may cover a period of three months, six
months or one year
⚫ Long term forecasting

⚫ Provides information for major strategic
decisions
⚫ Helpful in suitable capital planning
BUSINESS FORECASTING
EXPLORING DATA PATTERNS

DATA COLLECTION
⚫ A forecast can not be more accurate than
the data on which it is based:
⚫ Reliable
⚫ Relevant
⚫ Consistent
⚫ Timely
⚫ Types of data:
⚫ Cross-sectional: Observations collected at a
single point in time
TIME SERIES DATA
PATTERNS
⚫ Influences choice of appropriate forecasting
method
⚫ Four general types of patterns:
⚫ Horizontal/Stationary:
⚫ Data fluctuates around a constant level or mean
⚫ Basic statistical properties, such as mean and variance,
remain constant over time
⚫ Trend/Non-stationary: The long-term
component that represents the growth or
decline in time series over an extended period
Cost Age
859 8
682 5
471 3
708 9
1094 11
224 2
320 1
651 8
1049 12
Year Operating Year Operating Year Operating
Revenue Revenue Revenue
1955 3307 1972 10991 1989 53794

1956 3556 1973 12306 1990 55972
1957 3601 1974 13101 1991 57242
1958 3721 1975 13639 1992 52345
1959 4036 1976 14950 1993 50838
1960 4134 1977 17224 1994 54559
1961 4268 1978 17946 1995 34925
1962 4578 1979 17514 1996 38236
1963 5093 1980 25195 1997 41296
1964 5716 1981 27357 1998 41322
1965 6357 1982 30020 1999 41071
1966 6769 1983 35883 2000 40937
1967 7296 1984 38828 2001 36151
1968 8178 1985 40715 2002 30762
1969 8844 1986 44282 2003 23253
1970 9251 1987 48440 2004 19701
1971 10006 1988 50251
Year Kilowatts Year Kilowatts Year Kilowatts
Jan-80 1071 Jan-84 1047 Jan-88 953
Apr-80 648 Apr-84 667 Apr-88 604
Jul-80 480 Jul-84 495 Jul-88 508
Oct-80 746 Oct-84 794 Oct-88 758
Jan-81 965 Jan-85 1068 Jan-89 1054
Apr-81 661 Apr-85 625 Apr-89 635
Jul-81 501 Jul-85 499 Jul-89 538
Oct-81 768 Oct-85 850 Oct-89 752
Jan-82 1065 Jan-86 975 Jan-90 969
Apr-82 667 Apr-86 623 Apr-90 655
Jul-82 486 Jul-86 496 Jul-90 568
Oct-82 780 Oct-86 728 Oct-90 752
Jan-83 926 Jan-87 933 Jan-91 1085
Apr-83 618 Apr-87 582 Apr-91 692
Jul-83 483 Jul-87 490 Jul-91 568
Oct-83 757 Oct-87 708 Oct-91 783
AUTOCORRELATION
ANALYSIS
⚫ Autocorrelation is the correlation between a variable
lagged one or more time periods and itself.
⚫ Measured using autocorrelation coefficient.
⚫ When a variable is measured over time, observations
in different time periods are frequently correlated.
⚫ Example 3.1: Computation of lag 1 autocorrelation
coefficient
⚫ Calculation of autocorrelation coefficient using
excel-miner
n
Σ (Yt – Y ) (Yt-k – Y )
⚫ rk = t = k+1
AUTOCORRELATION
ANALYSIS
⚫ Random data
⚫ Successive values of a time series are not related to each other
⚫ Autocorrelation coefficient for any time lag is close to zero
⚫ Example 3.3 (Table 3-3)
⚫ Stationary data
⚫ Autocorrelation coefficients decline to zero fairly rapidly,
generally after the second or third time lag
⚫ Trend (Non-stationary) data
⚫ Successive observations are highly correlated
⚫ Autocorrelation coefficients are significantly different from
zero for the first several time lags
⚫ Gradually drop towards zero as the number of lags increases
⚫ Example 3.4 (Table 3-4)
⚫ Seasonal data
CHOOSING A FORECASTING
TECHNIQUE
⚫ Stationary data
⚫ Naïve method; Simple average; Moving average
⚫ Trend
⚫ Moving averages; Holt’s linear exponential
smoothing; Exponential models; ARIMA
⚫ Seasonal data
⚫ Classical decomposition; Winter’s exponential
smoothing; ARIMA
⚫ Cyclical series
⚫ Classical decomposition; Econometric models;
Multiple regression; ARIMA
⚫ Other factors
METHODS OF FORECASTING
MEASURING FORECAST ERROR
⚫ Mean absolute deviation (MAD)
⚫ Average size of the “miss” regardless of direction
⚫ Mean squared error (MSE)
⚫ Penalizes large forecasting errors
⚫ A technique that produced moderate errors is preferable
to one that usually has small errors but occasionally
yields extremely large ones
⚫ Root mean squared error (RMSE)
⚫ Can be more easily interpreted since it has the same unit
as the series
⚫ Mean absolute percentage error (MAPE)
⚫ Useful when error relative to the time series is important
in evaluating accuracy
⚫ When Y(t) values are large
FORECASTING METHODS
⚫ Naïve: Used to develop simple models
assuming that very recent data provide the
best predictors of the future
⚫ Moving averages: Generates forecast based
on an average of past observations
⚫ Smoothing: Past data ProduceYou are here (t) forecasts Periods to be forecastby
averaging
…… Y , Y , Y ,
t-3 t-2
past values
t-1
Yt’
of a series t+1 t+2
with a
Y , Y , Y , ……..
t+3
decreasing
t (exponential) series of weights
Y is the most recent observation of a variable. Y is the forecast for one period in the future.
t+1
e is the forecast error - the difference between the observed and the forecasted value.
t
FORECASTING METHODS
⚫ Stepsinvolved in evaluating forecasting
methods:
⚫ Forecasting method is selected based on the
forecaster’s analysis of and intuition about the
nature of the data
⚫ Data set is divided into two sections: training or
fitting section and test or forecasting section
⚫ Selected forecasting technique is used to
develop fitted values for training the data
⚫ The technique is used to forecast for the test
FORECASTING METHODS
⚫ Naïve models
⚫ Based on the most recent information available
⚫ Assume that the recent periods are the best
predictors of the future
⚫ Technique can be adjusted to take trend into
consideration
⚫ Simple averages
⚫ Uses the mean of all relevant historical
observations as the forecast for the next period
⚫ The objective is to use past data to develop a
forecasting model for future periods
FORECASTING METHODS
⚫ Moving averages
⚫ MA of order k is the mean value of k
consecutive observations
⚫ Equal weights are assigned to each observation
⚫ Deals only with the latest k periods of known
data
⚫ Does not handle trend or seasonality very well
⚫ Smaller the number, larger the weight given to
recent periods
⚫ Larger the number, greater is the smoothing
METHODS OF FORECASTING
FORECASTING METHODS –
EXPONENTIAL SMOOTHING
⚫ Smoothing: Produce forecasts by
averaging past values of a series with a
decreasing (exponential) series of weights
⚫ Exponential smoothing method:
Procedure for continually revising a
forecast in the light of more recent
observations
⚫ The smoothing constant α serves as the
weighting factor
DOUBLE EXPONENTIAL
SMOOTHING – HOLT’S METHOD
⚫ Adjusted for trend variation in the data
⚫ Allows evolving local linear trends to
generate forecasts
⚫ Estimate of current level and current trend
(slope) is determined
⚫ Two smoothing constants: α (Smoothing
constant for level) and β (Smoothing
constant for trend) are estimated
⚫ α and β generated by minimizing MSE
EXPONENTIAL SMOOTHING –
WINTER’S METHOD
⚫ Adjusted for trend and seasonal variation in
the data
⚫ Three parameters α, β, and ϒ are determined
for estimating level, trend and seasonality.
⚫ Generated by minimizing MSE
⚫ Popular technique for short-term forecasting
⚫ Low cost and simple
⚫ Based on past values which include both
random fluctuations as well as information
⚫ Assumes that extreme fluctuations represent
randomness in a series
TIME SERIES ANALYSIS

DECOMPOSITION
⚫ Identification of the component factors
that influence each values in a series
⚫ Used for both short-term and long-term
forecasting
⚫ Decomposition primarily a tool for
understanding a time series
⚫ The four components in the time-series
⚫ Trend (T): The underlying growth or decline in
a time series
DECOMPOSITION
⚫ Additive model
⚫ Time series values are treated as a sum of the
components
⚫ Works best when the time series being analyzed
has roughly the same variability throughout the
length of the series
⚫ Multiplicative model
⚫ Time series values are treated as a product of the
components
⚫ Works best when the variability of the time
series increases with the level
⚫ Seasonal adjustment
TREND
⚫ Long term movements described by
straight line or smooth curve
⚫ Forces producing trend: population
change, price change, technological
change, productivity increase, PLC
⚫ Convenient to fit
⚫ Indicates general direction
⚫ Can be removed from data to obtain
seasonality
⚫ Pattern
SEASONALITY
⚫ Repeats itself year after year
⚫ Not an issue for annual data
⚫ Seasonality index numbers are percentages
that show changes over time
⚫ An index of 1.25 for a month implies the
observation for that month is expected to be
25% more
⚫ Seasonal variation can be measured using
ratio-to-moving average method
⚫ Deseasonalising the time series:
Seasonally adjusted data

Analytics Task-1

Uploaded by

Copyright:

Available Formats

You might also like

Analytics Task-1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analytics Task-1

Uploaded by

Copyright:

Available Formats

ANALYTICS TASK

⚫ Coeﬃcient of determination. The strength of association is

⚫ Estimated or predicted value. The estimated or predicted

⚫ Regression coeﬃcient. The estimated

⚫ Standardized regression coeﬃcient. Also

⚫ A scatter diagram, or scattergram, is a

⚫ The most commonly used technique for

In the bivariate regression model, the

2.25 4.5 6.75 9 11.25 13.5 15.75 18

2.25 4.5 6.75 9 11.25 13.5 15.75 18

The general form of the multiple regression model

Ywhich is estimated by the following equation:

As before, the coeﬃcient a represents the intercept,

⚫ Adjusted R2. R2, coeﬃcient of multiple determination, is

⚫ Coeﬃcient of multiple determination. The strength of

SSy = SSreg + SSres

The strength of association is measured by the square

This is equivalent to the following null

⚫ A residual is the diﬀerence between

⚫ The assumption of a normally distributed error term can be

⚫ The assumption of constant variance of the error term can

⚫ Multicollinearity arises when intercorrelations

⚫ Multicollinearity can result in several problems, including:

⚫ Statistical signiﬁcance. If the partial regression coeﬃcient

⚫ Square of the partial correlation coeﬃcient. This measure,

Pruning the tree: Simplify the tree by pruning

⚫ Data: 24 households classified as owning or not

⚫ Predictors = Income, Lot Size

p = proportion of cases in rectangle A that belong to class

⚫ I(A) = 0 when all cases belong to same class

⚫ Obtain overall impurity measure (weighted avg. of

⚫ Each leaf node label is determined by “voting” of the

CHAID, older than CART, uses chi-square statistical

Splitting stops when purity improvement is not

⚫ Among trees of given size, choose the one with

Two trees of interest:

⚫ Used with continuous outcome variable

⚫ Impurity measured by sum of squared

⚫ Performance measured by RMSE (root

⚫ Since the process deals with one variable at a time,

⚫ A tree is a graphical representation of a set of rules

⚫ Trees must be pruned to avoid over-fitting of the

⚫ As trees do not make any assumptions about the

⚫ Long term forecasting

EXPLORING DATA PATTERNS

1955 3307 1972 10991 1989 53794

TIME SERIES ANALYSIS

You might also like