Professional Documents
Culture Documents
Analytics Prepbook Laterals 2019-2020
Analytics Prepbook Laterals 2019-2020
Dear Reader,
The domain of analytics is quickly gaining popularity across corporate firms worldwide. From a
subject which was prominently deployed in research projects in the 19th & 20th century to becoming
a Cambridge Analytics scam, and affirming its importance in turning the tides of elections around the
world, analytics has surely come a long way. Its importance has scaled to such an extent that firms
which are not utilizing its capabilities in any of its operations are bound to lose out in the long run.
Hence corporates are either building their own analytics capabilities or leveraging the services of
prominent analytics firms.
Intuition is slowly taking a backseat and every decision gets scrutinized through the lens of analytics.
As a result, managers who have the technical know-how of analytics combined with the grit to lead
capability development activities in this field are in great demand today. This demand would grow
exponentially as more and more firms push themselves to develop such capabilities. Increase in the
number of analytics profiles visiting top B-Schools is hence not a mere coincidence. With this thought,
the Analytics Society aims to keep students of IIMB ahead of the curve by inculcating the motivation
to learn analytics and provide the requisite tools at all stages.
As an effort in this direction, we present to you a compendium that can help you prepare for your
placement interviews. It is a three-part booklet with the first part focusing on prominent analytics
definitions and terminologies which are ‘must to know’ for any analytics interview. The techniques
on solving guess estimates and sample scenarios across the same interviews. The second part
discusses Summer Insights collected through years of interview experiences of IIM Bangalore
students. We would like to thank all the PGP1 students who shared their interview experiences with
us. We hope you find this useful. All the best for your interviews!
Regards,
The Analytics Society of IIM Bangalore
PROJECT DEFINITION
Describing and recording business
intentions for predictive analytics 1 DATA COLLECTION
DEPLOYMENT Data detection and assessment to
Executing deployment procedures for 6 2 assess your data’s readiness for
implanting insights from predictive predictive analytics
models into your business
4 DATA MODELING
Developing modeling method,
procedure and/or environment
Data exploration is done to become familiar with the data. This step is especially important when
dealing with new data. There are a number of things you will want to do in this step –
What is there in the data – look at the list of all the variables in the data set. Understand the meaning
of each variable using the data dictionary. Go back to the business for more information in case of
any confusion.
How much data is there – look at the volume of the data (how many records), look at the time frame
of the data (last 3 months, last 6 months etc.)
Quality of the data – how much missing information, quality of data in each variable. Are all fields
usable? If a field has data for only 10% of the observations, then maybe that field is not usable etc.
You will also identify some important variables and may do a deeper investigation of these. Like
looking at averages, min and max values, maybe 10th and 90th percentile as well…
5. What does Data Preparation entail?
The first step is to identify variables with missing values. Assess the extent of missing values. A few
possible methods are –
✓ Deleting Rows
Deletion ✓ Deleting Columns
✓ Pairwise Deletion
Handling Missing
Data
✓ Mean, Median, Mode, Random
Imputation Sample Imputation
✓ Linear Regression
✓ Logistic Regression
You can identify outliers using graphical analysis and univariate analysis. If there are only a few
outliers, you can assess them individually. If there are many, you may want to substitute the outlier
values with the 1st percentile or the 99th percentile values.
If there is a lot of data, you may decide to ignore records with outliers.
Not all extreme values are outliers. Not all outliers are extreme values.
8. Correlation
Correlation analysis is a method of statistical evaluation used to study the strength of linear
relationship between two, numerically measured, continuous variables.
9. Covariance
Covariance is the expected value of variations of two random variates from their expected values. It
is a measure of how changes in one variable are associated with changes in a second variable. A
positive correlation means that higher values of one variable are associated with higher values of the
other variable.
10. What is Hypothesis Testing?
Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true.
The usual process of hypothesis testing consists of four steps.
ANALYTICS SOCIETY, IIM BANGALORE 5
i. Formulate the null hypothesis H0 (commonly, that the observations are the result of pure
chance) and the alternative hypothesis Ha (commonly, that the observations show a real
effect combined with a component of chance variation)
ii. Identify a test statistic that can be used to assess the truth of the null hypothesis
iii. Compute the P-value, which is the probability that a test statistic at least as significant as the
one observed would be obtained assuming that the null hypothesis were true. The smaller
the P-value, the stronger the evidence against the null hypothesis
iv. Compare the p-value to an acceptable significance value alpha (sometimes called an alpha
value). If p<=alpha, that the observed effect is statistically significant, the null hypothesis is
ruled out, and the alternative hypothesis is valid
11. R2 value
Coefficient of determination is the proportion in the variance of the dependent variable that can be
predicted from the independent variable. In regression, the R2 coefficient of determination is a
statistical measure of how well the regression predictions approximate the real data points. An R2 of
1 indicates that the regression predictions perfectly fit the data.
12. P value
A probability that provides a measure of the evidence against the null hypothesis given by the sample.
Smaller value indicate more evidence against H0.
13. What is Bias and Variance in a model? What is Bias Variance Trade off?
Bias refers to model error and variance refers to the consistency in predictive accuracy of models
applied to other data sets. The best models have low bias (low error, high accuracy) and low variance
(consistency of accuracy from data set to data set).
Unfortunately, there is always a tradeoff between these two building predictive models. You can
achieve low bias on training data, but may suffer from high variance on held-out data because the
models were overfit.
Error due to bias: The error due to bias is the difference between the expected (or average)
prediction of our model and the correct value which we are trying to predict. Imagine you could
repeat whole model building process more than once: each time you gather new data and run anew
analysis creating a new model. Due to randomness in the underlying datasets, the resulting model
will have range of predictions. Bias measures how far off in general these model's prediction are from
the correct value. A high bias error means we have an under-fitting model which keeps on missing
important labels
Error due to variance: The error due to variance is taken as the variability of a model prediction for
a given data point. Again, imagine you can repeat the entire model building process multiple times.
The variance is how much the predictions for a given point varies between different iterations of the
model. A high variance model will over-fit on your training population and perform very badly on any
observation outside the training data
Model Ensembles, or simply “ensembles,” are combinations of two or more predictions from
predictive models into a single composite score. Advantages of ensembling -
• Improved model accuracy (low error) - Ensembles nearly always improve model predictive
accuracy and rarely predict worse than single models
• Improved model robustness (less overfitting) - averaging multiple models into a single
prediction, no single model dominates the final predicted value of the models, reducing the
likelihood that a flaky prediction will be made (so long as all the models don’t agree on the
flaky prediction)
• Segmentation
As the demand varies across various segments, segmentation needs to be done while solving the
guesstimate. Examples of segments are income bracket, age, rural/ urban, gender.
To determine the number of cars, income bracket could be used as the higher income people use
cars more. To determine the number of cigarettes, gender segmentation could be useful. On the
same lines, to determine number of burgers made in India, rural/urban segmentation could prove to
be helpful.
Sample Guestimates
Cars
Commercial - No Household
30 crore families
(120 cr. population per
family)
Demand of tyres
Hence, the total demand for tires in India is 3.5 crores per year.
3. Ola cab services are starting in Vizag. Estimate the number of cabs required on first week.
Assumptions:
Population of Vizag -
20 lakh
There are 1.72 lakh potential customers for cabs. Considering, the public transportation and auto
rickshaws, it can be assumed that about 50% of the middle class and 30% upper class (who owns
more cars) can be converted to use cabs.
Suppose 5% of them can be converted to customers of Ola in first week. This gives scope for almost
10K Ola customers in first week. Considering the size of Vizag, it can be assumed that each
customer takes average half an hour of travel on cab per day. Adding a 50% factor for waiting time,
and assuming on an average each cab driver works 9 hours a day.
Cabs required= 10000*(1/2) *(1+1/2) / (9 hours) = 833 cabs
So, about 800 cabs need to be acquired on first week
4. How many square feet of pizza are eaten in the United States each month?
Take your figure of 300 million people in America. How many people eat pizza?
Let’s say 200 million. Now let’s say the average pizza-eating person eats pizza twice a month and
eats two slices at a time. That’s four slices a month. If the average slice of pizza is perhaps six inches
at the base and 10 inches long, then the slice is 30 square inches of pizza. So, four pizza slices would
be 120 square inches. Since one square foot equals 144 square inches, let’s assume that each
Clarify type of books (here, notebooks, xerox study material, magazines excluded)
Number of books with faculty (for instance 10 areas, 10 profs each, average 5 years’ experience, 5
books added per year)
Library: 1 floor (operational currently), 10 racks per floor, 5 shelves per rack, 30 books per shelf)
Sample Puzzles
1. Bag of Coins
You have 10 bags full of coins. In each bag are infinite coins. But one bag is full of forgeries, and you
can’t remember which one. But you do know that a genuine coins weigh 1 gram, but forgeries weigh
1.1 grams. You have to identify that bag in minimum readings. You are provided with a digital
weighing machine.
Answer
1 reading
Explanation - Take 1 coin from the first bag, 2 coins from the second bag, 3 coins from the third bag
and so on. Eventually, we’ll get 55 (1+2+3…+9+10) coins. Now, weigh all the 55 coins together.
Depending on the resulting weighing machine reading, you can find which bag has the forged coins
such that if the reading ends with 0.4 then it is the 4th bag, if it ends with 0.7 then it is the 7th bag
and so on.
2. Prisoners and hats
There are 100 prisoners all sentenced to death. One night before the execution, the warden gives
them a chance to live if they all work on a strategy together. The execution scenario is as follows –
On the day of execution, all the prisoners will be made to stand in a straight line such that one
prisoner stands just behind another and so on. All prisoners will be wearing a hat either blue or red
in color. The prisoners don’t know what color of hat they are wearing. The prisoner who is standing
at the last can see all the prisoners in front of him (and what color of hat they are wearing). A prisoner
can see all the hats in front of him. The prisoner who is standing in the front of the line cannot see
anything.
• Start the 7 minute sand timer and the 4 minute sand timer
• Once the 4 minute sand timer ends turn it upside down instantly
• Once the 7 minute sand timer ends turn it upside down instantly
• After the 4 minute sand timer ends turn the 7 minute sand timer upside down(it has now
minute of sand in it)
So effectively 8 + 1 = 9
5. Chaotic Bus
There is a bus with 100 labeled seats (labeled from 1 to 100). There are 100 persons standing in a
queue. Persons are also labelled from 1 to 100.
People board on the bus in sequence from 1 to n. The rule is, if person ‘i’ boards the bus, he checks
if seat ‘i’ is empty. If it is empty, he sits there, else he randomly picks an empty seat and sit there.
Given that 1st person picks seat randomly, find the probability that 100th person sits on his place i.e.
100th seat.
• On the first turn, choose a diagonally opposite pair of glasses and turn both glasses up
• On the second turn, choose two adjacent glasses at least one will be up as a result of the
previous step. If the other is down, turn it up as well. If the bell does not ring, then there are
now three glasses up and one down
• On the third turn, choose a diagonally opposite pair of glasses. If one is down, turn it up and
the bell will ring. If both are up, turn one down. There are now two glasses down, and they
must be adjacent
• On the fourth turn, choose two adjacent glasses and reverse both. If both were in the same
orientation then the bell will ring. Otherwise there are now two glasses down and they must
be diagonally opposite
• On the fifth turn, choose a diagonally opposite pair of glasses and reverse both. The bell will
ring
Sample Questions
Question When do you generally use linear regression, and can you explain the basic
steps?
Interviewee A simple linear regression is used to predict the value of a dependent variable
using a single independent variable. In case of multiple linear regression, the
value of the dependent variable is predicted using two or more independent
variables.
Steps of the process were also explained.
Question What do you know about Clustering and Principle Component Analysis? Can you
talk about various types of Clustering?
holds for any integer n>2, where x,y and z are positive integers
Contributors
1. Srijit Mondal 6. Vignesh S 11. Shouvik Das
2. Ravindra 7. Uday Kumar 12. Abhishek Kumar Sachan
3. Ayush Singh 8. Ramya Kolli 13. Ayush Singh
4. Gouthami N 9. Arpan Ghosh 14. Avinash Deka
5. Kunal Kumar 10. Nilesh Paliwal
Contributors
Contributors
1. Kedar Mahamuni
2. Lakshya Verma
Contributors
1. Anusmita Saha
2. Kausik Tamuli
3. Abhijit Khonde
4. Dr Nisha Sharma
5. Nishant Jaiswal
Sample Question
Question Why could a driver cancel a trip. How would you solve it?
Interviewee The driver may cancel the trip because the customer is too late. To solver for this
issue, I would suggest incentivizing the driver.
(The interviewee then went on to approximate the amount that might be paid to
the driver as incentive)
Question Interview was based on problems faced by Uber and how can we tackle them.
Interviewees generally look for specific solutions and not generic ones. E.g.
Rather than saying improve UI for booking a ride, one can pinpoint exact issue
2nd round – Major challenges faced by Uber and ways to circumvent it. Also, two
strategies that should be followed by Uber to increase its market share. How
does it differentiate from Ola.
Contributors
1. Nishanth Buddhavarapu
2. Amrit Satsangi
3. Kandalam Sai Madhuri
4. Navjot Kaur
5. Vikas Attri
Sample Question
Question Tell us the name of 3 startups you have heard about/ you know about.
Interviewee a. Udaan
b. We work
c. Neuralink
They chose Wework and asked to come up with a new product that we work
could potentially offer. Firstly, I was asked to define the criteria I would use to
determine the type of product and the demand for the product. I used the
CIRCLES approach and came up with 4-5 product ideas.
Then I was asked to select one of the presented ideas. I did a pros and cons
analysis of all the options and selected 2 options based on ease of
implementation and expected revenues.
Question Second part of interview was a guesstimate to estimate the number of people
flying in and out of Bangalore per day.
Contributors
1. Shubhankar Sarkar
2. Siddharth Gupta
3. Rishabh Garg
Sample Questions
Question 100 prisoners in jail are standing in a queue facing in one direction. Each prisoner
is wearing a hat of color either black or red. A prisoner can see hats of all
prisoners in front of him in the queue, but cannot see his hat and hats of
prisoners standing behind him.
The jailer is going to ask color of each prisoner’s hat starting from the last
prisoner in queue. If a prisoner tells the correct color, then is saved, otherwise
executed. How many prisoners can be saved at most if they are allowed to
discuss a strategy before the jailer starts asking colors of their hats.
If both of them repeat the process n times. Which player has got a larger
expected average area?
Interviewee Player A.
Let the random variables representing A’s n trials be X1, X2, etc.
Let the random variables representing B’s trials be (Y1, Z1), (Y2, Z2), etc. (B uses
the distribution twice)
Area = (X21+X22+…X2n)/n
Area = (Y1Z1+Y2Z2+…..+YnZn)/n
E(X)= E(Y)=E(Z)
Contributors
1. Abhishek Joshi GS
2. Adithya PSS
3. Kritika Saini
4. Saumya Dixit
5. Madhur Gupta
6. Nikitha Hegde
1. Finance
With the introduction of quantitative analytical models, large financial institutions and hedge funds
have shifted from manual trading to trading backed by technology. These models can be used to
analyse big data to
• Route planning - Big data can be used to understand and estimate users’ needs on different
routes and on multiple modes of transportation and then utilize route planning to reduce
their wait time
• Congestion management and traffic control - Using big data, real-time estimation of
congestion and traffic patterns is now possible. For examples, people are using Google Maps
to locate the least traffic-prone routes
• Safety level of traffic - Using the real-time processing of big data and predictive analysis to
identify accident-prone areas can help reduce accidents and increase the safety level of
traffic
6. Telecom
Customer retention/improving customer loyalty
There is an increased focus on improving user experience. To do so, analysts are creating
sophisticated 360-degree profiles assembled from a number of sources like, voice, SMS and data
usage patterns, video choices, customer care history, social, consumer demographics, service
usage, etc. This allows telecom companies to offer personalized services or products at every step
of the purchasing process. Businesses can tailor messages to appear on the right channels (e.g.,
mobile, web, call centre, in-store), in the right areas and in the right words and images.
Network optimization
Telecommunication companies tend to regard the customer's engagement process and internal
channels as a guarantee of smooth functioning of the operations. Network management and
optimization gives an opportunity to define the score points in operations to identify the root
causes of these complications. Looking into historical data and predicting possible future problems
or, on the contrary, beneficial scenarios is a great benefit for the telecom providers.
7. Health care
A few examples of applying big data and predictive analytics in healthcare are
• Big data can be used to predict negative health events that seniors could experience from
home-care. At AlayaCare, the analysis reduced hospitalizations and ER visits by 73 percent,
and 64 percent amongst chronically ill patients
8. Manufacturing
• Supply chain management and big data go hand-in-hand, which is why manufacturing is one
of the top industries to benefit from the use of big data.
• Monitoring the performance of production sites is more efficient with big data analytics. The
use of analytics is also extremely useful for quality control, especially in large-scale
manufacturing projects.
• Big data analytics plays a key role in tracking and managing overhead and logistics across
multiple sites. For example, being able to accurately measure the cost of shop floor tasks
can help reduce labour costs.
• Then there’s predictive analytics software, which uses big data from sensors attached to
manufacturing equipment. Early detection of equipment malfunctions can save sites from
costly repairs capable of paralyzing production.