Class Notes

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 32

What are customer delights and customer need satisfaction?

Linear Model: Change in output is constant for the change in input & output variable and it is going to be on the same line. Change is proportionately same in input and output variables Non-Linear: Change is proportionately NOT same in input and output variables. SOV: share of voice: this is share of voice ratio to the entire industry. Time Series decomposition: Trend is indentified and predicted over time due to the following effects o Cyclical o Seasonal o Trend o Intervention o Irregular Putting people in to homogeneous buckets: Cluster Analysis BA can help in the following ways: Data Analyst User Analyst Tool Developer Journalist KPI: Key performance Indicator. Need to identify what drives the business. These will give us the KPI. KPI should have a good ROI, Sustainable, measurable Gross Rating Points: Eyeballs measurement Copy Wear Out chart: how long the message stays in the mind

SAS practical:
TABLE: TABULATE THE FIELD File names= tables Field names= variables A Frequency report will give you the TOP N analysis Frequency distribution is always done to do a ordering

Functions: CONTENT FREQ MEANS FISRTOBS

Regression: Sales = F (base sales+ TV + promotion + print PRICE + Season + Trend) dS/da = a + da Change in sales wrt. adv = base adv + change in adv Sales = alpha + b1.x1 +b2.x2+.. bnxn All b1..bn are called the IMPORTANCE, WEIGHT, SLOPE ------------------------------------------------------------------------Bank Default = F (- Interest rate + reason - economy - salary - amount + collateral)

Intercept = Base Standard Error = the deviation from the mean. Error of estimate Probability >|t| , probability that this will be wrong, remove the variable if the probability is < .05 of success. Except Intercept and GRP_s others have to go as they are more error prone. Others spend is 8890/1000 times wrong of Parameter of estimate. R-Square : Adjusted Rsquare : GRP, Intercept, MG Spend, Other Spends only explain 11.08 % of New Users. This is the significance of adjusted R^2. Adjusted R^2 adjusts all the degrees of freedom and it takes care of any wrong variable being put in to the equation. More than 60 % is a good model.

Model for Current_Users_ Under 5%

This is the equation that is generated for us from the above figures. Current User = 4189.52 + .76462 * TV Spend

data XYZ.MMM1; // Destination File set XYZ.MMM1; // Source File PRED_CURRENTUSER = 4189.52021 + TV_Spends * 0.76462; // Equation RUN;

SAS Code: The most important is to assign the library of the file. LIBNAME XYZ 'C:\XyzAnalytics\ANALYTICS\'; PROC CONTENTS DATA =XYZ.MARKS;RUN; PROC PRINT DATA = XYZ.MARKS; var name state gender ; RUN; PROC FREQ DATA =XYZ.MARKS ORDER = FREQ; TABLE STATE; RUN; PROC MEANS DATA =XYZ.MARKS MIN MEAN MEDIAN MAX SUM N STD; VAR MARK1; RUN; PROC FREQ DATA =XYZ.MARKS(FIRSTOBS=10); TABLE STATE; RUN; PROC REG DATA =XYZ.MMM1; MODEL New_Users__ = GRP_s Print_Spends MG_Spends Others__Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL New_Users__ = GRP_s MG_Spends Others__Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL New_Users__ = GRP_s; RUN; QUIT; PROC PRINT DATA = XYZ.MMM1;RUN; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = GRP_s SOV TV_Spends Print_Spends MG_Spends Others__Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = GRP_s SOV TV_Spends Print_Spends MG_Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = SOV TV_Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = TV_Spends; RUN; QUIT; data XYZ.MMM1; set XYZ.MMM1; PRED_CURRENTUSER = 4189.52021 + TV_Spends * 0.76462; RUN; PROC gplot data XYZ.MMM1; SYMBOL1 I=JOIN COLOR=BLUE; SYMBOL2 I=JOIN COLOR=RED;

PLOT Current_Users_ * MONTHS = 1; PLOT PRED_CURRENTUSER * MONTHS = 2; RUN; PROC gplot data XYZ.MMM1; SYMBOL1 I=JOIN COLOR=BLUE; SYMBOL2 I=JOIN COLOR=RED; PLOT Current_Users_ * MONTHS = 1; PLOT PRED_CURRENTUSER * MONTHS = 2 / OVERLAY; RUN; PROC gplot data XYZ.MMM1; SYMBOL1 I=JOIN COLOR=BLUE; SYMBOL2 I=JOIN COLOR=RED; PLOT (Current_Users_ PRED_CURRENTUSER )* MONTHS /OVERLAY ; RUN; Sample SAS Code: ********************************************** LIBNAME XYZ 'C:\XyzAnalytics\ANALYTICS\'; PROC CONTENTS DATA =XYZ.MARKS;RUN; PROC PRINT DATA = XYZ.MARKS; var name state gender ; RUN; PROC FREQ DATA =XYZ.MARKS ORDER = FREQ; TABLE STATE; RUN; PROC MEANS DATA =XYZ.MARKS MIN MEAN MEDIAN MAX SUM N STD; VAR MARK1; RUN; PROC FREQ DATA =XYZ.MARKS(FIRSTOBS=10); TABLE STATE; RUN; PROC REG DATA =XYZ.MMM1; MODEL New_Users__ = GRP_s Print_Spends MG_Spends Others__Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1;

MODEL New_Users__ = GRP_s MG_Spends Others__Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL New_Users__ = GRP_s; RUN; QUIT; PROC PRINT DATA = XYZ.MMM1;RUN; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = GRP_s SOV TV_Spends Print_Spends MG_Spends Others__Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = GRP_s SOV TV_Spends Print_Spends MG_Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = SOV TV_Spends; RUN; QUIT; PROC REG DATA =XYZ.MMM1; MODEL Current_Users_ = TV_Spends; RUN; QUIT; data XYZ.MMM1; set XYZ.MMM1; PRED_CURRENTUSER = 4189.52021 + TV_Spends * 0.76462;

RUN; PROC gplot data XYZ.MMM1; SYMBOL1 I=JOIN COLOR=BLUE; SYMBOL2 I=JOIN COLOR=RED; PLOT Current_Users_ * MONTHS = 1; PLOT PRED_CURRENTUSER * MONTHS = 2; RUN; PROC gplot data XYZ.MMM1; SYMBOL1 I=JOIN COLOR=BLUE; SYMBOL2 I=JOIN COLOR=RED; PLOT Current_Users_ * MONTHS = 1; PLOT PRED_CURRENTUSER * MONTHS = 2 / OVERLAY; RUN; PROC gplot data XYZ.MMM1; SYMBOL1 I=JOIN COLOR=BLUE; SYMBOL2 I=JOIN COLOR=RED; PLOT (Current_Users_ PRED_CURRENTUSER )* MONTHS /OVERLAY ; RUN; **********************************************

DAY 2
Advertising: 2 properties. Its diminishing returns. It has a persistence Effect so scheduling of Ads is necessary. Ad has some effect after the ad has been displayed. Sales and adv. has a non linear relationship. So the change in adv for a unit change in sale decreases as we go on. So the minimum is the saturation point, ADVtisers are interested to know this

A D V

Sales

LAMBDA: - the point of saturation. Retention rate Power Transformation is done to transform a linear to a non linear equation. This is in simple terms called to raise the power. If the power is >1 then the curve is up word moving and curved upward. If the power is <1 then the curve is up word moving and curved downward. The more the power is .9 .8 it is as good as rising as 1. So the relationship is linear. If the power is raised to .3 .4 it is going to make the relationship non linear. Threshold: sales dont take off as compared to the advertising. Nothing happens when u advertise. So the equation is like X^2 or X^3.

Persistence: Memorability AdStock: GRP with a borrowed effect from past. Stock of the AD in peoples head about the positioning, benefits, message. Do u remember the ad after the AD is gone. Converting GRPs in to ADstock

Adstock Equation: At = (Gt)^n + LAMDA (A)t-1 // the n value is the power transformation of the GRP. Add Stock Today= GRP today + LAMDA borrows from Yday. LAMDA: is the retention rate. Multicollinearity is the cause for reversal of the sign. Regress RSquare and Total R Square would let me know if the variables are relevant. If there are a difference or a decrease in the values then the new variable is not contributing relevance. U got to look at Estimate and the Approx Pr.

The trend variable shows that with a 5.51 probability of error, I can say that 10 customers are going to go away from this product. LAN: This says that there are 302 customers with a propensity to move away from the product with a .11 probability of being incorrect. Now drop Trend: the result is like this

The AR1 is telling us that when trend is dropped SAS is introducing a new variable AR1. And the regress/total R2 is less than earlier 84 %. So trend has to be there. AR1 is the headwind. Remove AR1 at any cost. Multicolinearity Check:

LAN & Change are highly correlated to the tune of 49.237 with a probability of error 0.27 %. So this is highly significant. It means that Autoreg & REG Diff: Autoreg will take care of the all the variables including missing variables, Like including AR1 by SAS. To figureout if anything is missing. AUTOREG will work only in time series. REG model is for no timeseries data. Always put Trend in any timeseries data.

Maximum Likely hood: The probability of

OLS cant work: Multicolinearity Non leaner transformations Business Forcasting : John Hanke Heteroscadasticity: STD of Y should not change Linear Transformations only No Multi co linearity of two independent variables. New Topic Classification: Putting similar things in to bucket . Basically this is segmentation of markets. Insurance = extreme event + financial gain at then end Type of Insurance Policy: Term: It covers only Risk and no investment. The term expires no money Endowment: Covers life and death (risk) + some investment in bonds . U will get money at the end of the term. Whole Life : Similar to term but after u r death u r family can claim for the money. ULIP : Like Endowment but the investment is in Mutual Fund. It has a term.

Sum Assured: Financial composition Term: Premium: Benefits: AT the end of the term u will get the benefit.= Sum Assured + benefits

Stock market: Fundamental Analysis: Technical Analysis: Value Investing: Buying cheap Growth Investing: SIP:

Value Averaging: The correlation is built on the actual returns.

Return Rt= (Pt- Pt-1)+ Dt/Pt-1 Pension Planning Rules: Buy negatively SIGNIFICANTLY negatively / Zero correlated stocks. Factor Analysis: it puts the variables in to a group based on correlation not on Size. Rules 1. It will give u as many components as many entities. 72 companies so 72 factors. 2. All components are non-correlated. 3. The top component gives the maximum returns 4. Do the PROC FACTOR and choose the EIGEN VALUES > 1 After factor Analysis run the following screen appeared. We need to take the variables where * is assigned to it. These group (*) of variables contribute towards the Cumulative %. If it is flagged in one then it should not be flagged in another.

Lets do with flag= 0.6 as there are many stocks which are falling in to different baskets. To make sure that a stock is falling in to any basket, we improve the flag to .6

Check for the rotated factor:

The next screen

When u have more than one bucket containing a stock go on increasing the FLAG= value so that the stocks are in only one bucket. Rules: 1. 2. 3. 4. 5. There are as many components as there are entities.(Stock) Components with Eigen value > 1 contribute to the success. Components with Eigen value < 1 do not contribute to the success. The first component has the highest Eigen value and contribute most to the success. The second component the second most to the Success, but doesnt duplicate with the first.(Unique Success) 6. Every Entity contributes to every basket, however stocks that are highly correlated to the components( say > .6 ) are termed useful 7. An entity can be cast in to only one component by using the flag criteria 8. An entity cant have >=2 *. If such a solution is obtained, redo the analysis by increasing the correlation flag.

9. The best solution is one where any entity is having a * and falling to one category. 10. All the factors will be uncorrlelated with each other. 11. Cos of the angle between two vectors = Correlation between them. CLUSTER ANALYSIS: When the business wishes to put customer in to buckets based on the sizes of the business they do with me. Behavior that gives me more money. A good customer in retail has to have the following qualities: Recency Frequency Monetary Range of products Full Price: He doesnt bother on discounts Returns Cash/credit First Bill Value: Love at first sight. Is it increasing ? Tenure of the relationship

Day -3
Attribution Models: Regression models states that. We are doing a post mortem analysis only. I can use linear regression to predict the near future. We are attributing the independent variables weightage to the dependent variable. But In Timeseries we are doing some forecasting as well but we are having only the dependent variable Y over a period of time. We dont need any independent variable.

Time Series:
1. 2. 3. 4. Data that is recorded chronologically. Same entity is measured at different point in time. Data should be equi spaced. Not having missing values

Forecasting:

1. Draw the picture, timeseries graph: PROC GPLOT 2. Look for seasonal perturbation Seasonal trends (weather related seasons such as winter, or religion based sales like Diwali). Period with more than one year is called Cyclical perturbation. 3. Amplitude of the perturbation 4. Any series can have an irregular series. 5. Level (upward trend, downward trend , constant),Volatility(Homoskadastics, heteroskadastics), Cyclical, Seasonal, Random are the 4 components of a time series. 6. We need to know, is the variation around the trend line is CONSTANT over time? Volatility. SD measures the volatilities around mean. 7. Homeskadasticity: If the volatility is constant around the mean , it is called homoscadasticity. The amplitude does not changing over time 8. HeteroSkadesticity: If the volatility is not constant around the mean. The amplitude is changing over time 9. Interventions affects time series.: This is where regression comes in to play. The marketing manager can come to know that when he increases the marketing campaign spend.. how is it going to affect the sales. So St+1= Level + Seasonal+ Cyclical+ Interventions+ Et(Error) 10. We are not modeling the random component. 11. Rate = change in unit time, proportion, % = change in one variable wrt to another 12. Mean Forecast: The next months price will be equal to the mean of all the past values. Syntax= PROC MEANS DATA a. VAR = PRICE b. The mean forecast does not take in to account the trend, and extreme situations,, so it is not a good measure. This can be used in stationary data. 13. Moving Average: Mean of the last 5/7/20 etc values 7day MVA has more info abt the recent past. Or it has given more weight age to the events more than the 20 days MVA does. This is the difference. A 7 day MVA captures more recent information in to itself than a 20 MVA. The weight that is assigned can be 100 or less than 100. If it is 100 then it is called Simple Moving Average. If it is <100 and the weights are same, then it is called Linear Weight moving Average. a. 1 1924 b. 2 2065 c. 3 2020 d. 4 2092 e. 5 2302 14. Exponential Moving Average: Example St+1 = 2302(80) + 2092(40) + 2020(20) +2065(10) +1924(10) the bracket represents the weightage. (The decaying factor should eb constant, the decaying factor is within the bracket.)SAS will decide the decaying factor. 15. Summary of Models till now: Mean, Simple Moving Average, Linear Weight moving Average, Exponential Moving Average 16. Random Walk Model: The next step cant be predicted by knowing the previous walk. St+1 = St + Et where Et is the error which cant be predicted. Error is not the mistake

17.Random Walk + Drift Model: St+1 = St ++ Et. Alpha is the trend component. This can be a

> or < 0 then we will have a drift. 18.Box Genkins Method: St = + St-1(order 1 as only 1 variable). Called the
positive or a negative sign. If in SAS value of

Auto Regressive Process. BETA is called the weight attached to the previous days prices.

St+1 = + 1 St-1 + 2 St-2 ++ Et (order 2 as only 2 variable)


Beta1 > Beta 2, but we are not making all weights equal or exponential. Et is the error term. ALPHA is the average price over a number of days or the base price. In EXPONENTIAL model the weigh factor is constant. But here it is not. a) Moving Average Model:

St+1 = + 1 Et + 2 Et-1 ++ Et+1 All

E are the errors. This model is trying to tell that how much % the Errors are incorrect. The Beta are the weights associated with the Errors.
b) ARMA (of order 2,2, as we are taking 2 auto regression and 2 moving average for 2 variables from price St, St-1 & Et, Et-1. The 2,2 depends on how many stock prices and how many error prices u will take. In this case stock prices & 2 error prices.) :

St+1 =

+ 1 St + 2 St-1 + 3 Et + 4 Et-1 + Et+1


Assume a Et+1 for today. This value is going to be calculated by tomorrow when u observe the price. Point to take a note: Et+1 should not be a big factor a. Stationary: A series is said to be a stationary if its mean, variance,& Autocorrelation remain constant over time. Every variable should not be correlated with the previous variable. b. All Box Genkin method will work only with Stationary Methods. c. 19. ARIMA: One way to make it Stationary is to find the differences. A trend value can become stationary if u differencing it. Design Of Experiments Defn: One of the statictical tecq. In measuring dependent variable is a number, the independent variables are characters and can be numbers as well. General Linear Model: Similar to regression Model. Only the difference is that the right hand side contains the character variables. In OLS model we would have left side variables = numbers such as Adv. Spend, Print spend. GLM=

In a GLM the beta is not to treated like the regression beta. Since the right side has a character data, so use use GLM Full factorial Design: Mail to all 27 factors. Fractional Factoring: It is not necessary for me to test all 27 models of a 3^3 matrix. We would eliminate some of the combination of factors. It will tell us to mail only a fraction of the mailers. And it will tell u what combinations are relevant. By Elimination: It is the minimum no of treatments needed for the information It will tell what combinations are good to target It will tell us what would have the not so good targeted combinations would have contributed towards the sales.

Day 4
A powerpoint presentation on whatever we have learnt till date on analytics. Around 50 slides. Techniques used in Analytics. Attribution model: Attribute sales to adv etc Classification:Cluster & factor Analysis: depending on the correlation Design of experiments: lean thru experiments, from input to output Time series: forecasting ARIMA, ARMA, exponential soothing, Simulation: Optimization: Data Mining: Simulation: when u wish to mimic real life situation and generate data, we need to use simulation. Business model questions: What business they are in, manufacturing(buy raw material and produce a product), service, production(like mines),governance Whats the competition, who are the competitors, market place, customer profile(type of business, b2b or b2c)and the factors which can impact the b2b or b2c business models.these can be like age, gender, sec,industry,size,geography etc Know the business model/revenue model. How am I going to generate Money? Distribution model: how is the product getting distributed? What are the KPIs of the business which makes it a success. KPIs are financial and operational KPI.

What technologies these companies use to operates Life Insurance Company P&L, Balance Sheet Simulate the balance sheet under the condition as below: Fall in interest rate Fall in share price Increase in claims Fall in commodity prices Increase in surrender policies Increase in lapse policies The balance sheet parameters are going to change the balance sheet variables. Use Monte Carlo Optimization: Moving Average Mean reversion Pairs trading

Markovitchs theory: A prudent investor will invest in one that gives him better returns. He should invest in one that is more safer. Quicker Money is better than later money. Take a chance with little money.

How do I get the maximum possible return at the lowest possible risk. We can use Standard Deviation to achieve this. This is what Mutual Fund managers do on a daily basis. http://support.sas.com/documentation/cdl/en/ormpug/59679/HTML/default/viewer.htm#qpsolver _sect13.htm

********************************************************************

Revision: Defn of BA: Mathematical techiniques process to use data emanating from business process to give actionable insight and provide competitive adv to company over others. Players of BA: Buyer of analytics Consultant of analyst Tool manufacture Academics Journalist who writes on BA

Student studies BA Diff between BI & BA: BI:- summarizing, telling what happened in the past. Cognos, SAP, Hyperian, DataStage, Informatica BA:-summarizing past data + predicting future based on the past data. SAS, SPSS, STATISTICA, NURAL NETWORK. Data Mining:- Let the mining tool go thru the data and give me the insights.. without knowing what I am going to get in future. Statistical Analysis:- write a hypothesis and try to prove/disprove it. Analyzer know what he has to do. Impetus for growth of BA: Competitive business Plethora of mkting channel Regulatory env More customer centric customer How the business is aligned to BA: Captive workshop: HSBC, HP Extensions: IT Companies having analytics as a offering, Management consultants, Media agencies, market research company. Pure play Analytics/standalone: Rainman, Musigma,eserve Decision Craft: What are the things to know for BA: Business Domains (Mkt model, KIPs we need to measure) BA Tools BA Techniques Technology

Techiniques: Exploratory data analysis techiniques: Line chart, bar chart, histogram. Can do freq distribution, cross tabulation, means, Sum, Minmax, Dispersion, SD, Seasonal Influence, trend, correlations, covariance.

Data types: Nominal: No numeric order Ordinal: there is a order but the order is not numeric. For Ex: Good bad ugly. Designations, Dates Numeric: all mathematical operations

Cross sectional data: Everybodys markes for today. Data view at one point in time. Time series: data over a series of time. Chronological order from a starting time. Attribution Model: Regression Classification of Regression: General Linear Model: relationship between X&Y, DELTA remains constant over the entire line. If Y is numeric , X is linear: Ordinary regression If Y is binary , X is character: Logistic If Y is numeric , X is character: GLM It Y is BETAs are the weights for the independent variables. The independent variables loose its significance and become the dependent variable Y once it is multiplied with BETA. Thats whay it is called the ATTRIBUTION model. Regression assumptions: X&Y are linearly related. All the Xs are not correlated No Auto correlation between variables. Y should not have a relation with the previous Y.or previous Xs should not have any influence on the X Heteroscadasticity: the variance of X variable should be constant across time After regression we are going to predict the model. The difference between the actual and the predicted should be a normal distribution. The ERROR should be normally distributed with a NORMAL distribution. The errors are truly NORMAL

Non linear Models: normally not followed

Logistic regression: Same like a regression except the LHS is a logit(log of odds) of the event is a linear combination of some explanatory variables. Logit is measured in Probability, odds, hazards units. Forecasting: How does data gets created, there is a mean model. It will take the mean value always.. it is the easiest model Trend: the mean shifts up or down. Seasonal,cyclical, moving average, exponentional soothing of the weights

Classification: Factors use correlation Clusters use proximity Design of Experiments: experiment is conducted to

Questions in the BA final: Analytical Industry Definations BA techiniques on a case

You might also like