Professional Documents
Culture Documents
Project Report
Project Report
Project Report
Bachelor of Technology
by
&
Varun Singhal
(Roll- B516051)
Undertaking
I declare that the work presented in this report titled Emerging Business
Opportunities for Food and Beverages Manufacturer, submitted
to the Department of Computer Science and Engineering, International
Institute of Information Technology, Bhubaneswar, for the award of the
Bachelors of Technology degree in the Computer Science and Engineering,
is my original work. I have not plagiarized or submitted the same work for
the award of any other degree. In case this undertaking is found incorrect, I
accept that my degree may be unconditionally withdrawn.
Varun Singhal
B516051
International Institute of Information
Technology Bhubaneswar
Bhubaneswar Odisha -751 003, India. www.iiit-bh.ac.in
Certificate
This is to certify that the work in the report entitled Emerging Business
Opportunities for Food and Beverages Manufacturer by Swayam
Prakash Pal & Varun Singhal is a record of an original research work carried
out by her under my supervision and guidance in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology in
Computer Science and Engineering. Neither this thesis nor any part of it has
been submitted for any degree or academic award elsewhere.
Varun Singhal
ABSTRACT
The food and beverages industry manufactures products across a variety of
themes that ranges from fruits, poultry, organic, seafood etc. Each manufacturer
has their own brand for a product of any particular theme, for example say
potato chips would have different brands like Lay’s by PepsiCo and Bingo by ITC.
So, the purpose of this project is to understand the growth patterns of consumer
preferences(themes) and to calculate the positioning of brands across different
themes for the given client. By analysing the growth patterns of themes, we can
further identify the key drivers behind the sales of these products.
The major challenge in this project was to map the data available across different
datasets. We had 6 other major data sources which were entirely product specific
like the sales data of products, search-volume across different search engines,
posts across different social media channels, a product to theme mapping,
product to vendor mapping etc. Our objective in this project is to understand the
market share of a particular vendor for any given theme and thus identify its
potential competitors across those themes. We also found out the themes that
were emerging in the different data sources like sales, social and search and also
tried finding out if there was any visible trend or an order in which the theme
appeared across the data sources. Generally, a product first trends in social
media due to the advertisements done by the company, then if people get
interested in the product, they search for it and compare the specifications, price
etc. before making a final choice and ending of buying the product. We tried to
validate this trend by using hypothesis testing and also found out the lag period
for shifting of trend from one medium to another.
Finally, after all the data aggregations and transformation, model building was
carried out to find out the key predictors that lead to increase sales of product for
a given theme.
Contents
Abstract v
1. Introduction 8
1.1 Data Analysis and Exploration……………………………………………………9
vi
3.5 Latency Observed in Trend Shift………………………………...22
4. Conclusion 28
vi
CHAPTER 1: INTRODUCTION
In this project we try to figure out the high opportunity themes for a given
vendor and also the key driving factors which the vendor can leverage to
increase the sales of products across the given set of themes. The project can
be divided into several stages which would act as a pipeline that shows the
flow of raw data from the processing stage till the final modelling phase. The
first objective would be to understand the data that is present across different
data sources, what is the granularity of data (weekly, monthly, yearly) present,
finding out outliers and treating for any missing values etc. Then we look for
the fields by which we can map different additional data sources to get the
unique themes that are present in our sales, social and search data. After
getting the unique themes, we find the preferred themes based on the total
sales, social posts and search volume and also find the themes that can be
classified as ‘emerging’ in the given data sources. The next objective was to
calculate the overall market share of our vendor in comparison to other
competitors is also calculated which gives a better understanding of the market
hold of our vendor, also for some common themes we compared the sales
value of our vendor to other vendors. Generally, when a new product is
launched in the market, it tends to follow a certain marketing trend before
actual sales start, these include paid advertisements in social media channels,
paid ads in search channels etc. Here we tried to figure out what was the trend
that most of the themes followed, whether it was first visible in social media
and then searched or was it first searched in some search engine and then
posted in social media. After getting the trend and the lag between the
different channels we aggregate the data sources and move unto the modelling
phase. For modelling we have used Multiple Linear Regression with
8
Backward Elimination based approach. A more comprehensive explanation of
all the steps is presented in the further section.
9
As we have multiple independent variables or predictors it is important to only
include the ones that are necessary. So, in this project we have used a feature
selection technique known as Backward Elimination to remove the non-
essential predictors from the model. Backward Elimination works in 5 simple
steps.
10
CHAPTER 2: PROJECT
DESCRIPTION AND OVERVIEW
So, by using data analysis on the historical data that was available to us
regarding the sales of the product and what all mediums did they use to
advertise the product i.e. social media and search channels, we can get
information about what all product themes are popular in the market,
what is the market share of our vendor compared to it’s competitors, is
there any time period between shifting of trend from one channel to
other so that the marketing team can focus on advertising more to
reduce the gap, what are all the themes that are profitable for the vendor
and also the ones which have a negative impact on profit. Finally, we can
also get what are the controllable factors the client can leverage to
increase their profit and by what margin they can increase it.
11
2.2 PROBLEM STATEMENT AND DESCRIPTION
The problem statement of the project that we had been assigned was to
understand the growth patterns of consumer preferences(themes) i.e. different
varieties of food and beverages products for a given CPG client and evaluate the
positioning of their brand across different themes. Further, the client also wants
to understand the KPIs of their products which they could leverage to increase
the products sales and in return increase the net profit of the company.
The problem statement can be broken down into 4 small subsections which
indicates the flow of our work.
1. The data provided for this project is of 3 different types. First, we have
the sales data for client and competitors at an UPC level. Second, we
have the social media data which has mentions of themes across all social
media platforms and third we have Google Search data which has the
search volume information for all the themes. We found out the number
of unique themes present in the 3 data sources and also looked for the
consumer preferences. The main challenge here was to understand the
time granularity as it varied from weekly to daily level data.
2. The second objective of the project was to assimilate the required data
sources after proper mapping of the themes to products. After this step we
get a product level data on the sales, posts number and search volume
which would help us in further analysis. The client’s and other vendors
market share were computed. We also looked for themes that have more
contribution towards sales as compared to other themes in the last 3 years
and considered them as emerging themes.
12
3. The third objective involves performing any kind of transformation or
and model results to pick out the themes that showed signs for high
business opportunity. KPIs to drive the sales of our client were also
identified. The main challenge for any manufacturer would be to
correctly estimate by what margin their sales would increase if they bring
in some adjustment with their strategy. So, by the help of this model we
get certain insights like the intercept values of controllable factors which
can give a nearly correct answer to our question.
Other developers should easily understand the code base and pick up the
project later without difficulties.
Training the system’s models must take a reasonable amount of time.
Working model and model fit metrics like R-square, Adjusted R-square
etc. that assesses model accuracy and performance.
Documentation of the models and key insights as well as presenting such
insights to Company’s management
Well documented Python production code.
13
The project is therefore articulated around those constraints to not violate them.
14
CHAPTER 3: DATA ANALYSIS AND
SOLUTION APPROACH
15
Fig 3.2.1. Stats from Sales Dataset
From the above chart we can infer that there are no missing values in this
given data source. We also generated box plots and scatter plots using the
sales unit’s value and sales dollars value fields.
From the Box plot we can tell the median falls under 500 mark while the
outliers lie above 4000 mark. Scatter plot was generated between, sales
dollar’s value and sales lbs value as it captures the relationship between two
continuous variables.
16
Fig 3.2.4. Scatter Plot (dollar’s vs lbs)
From the scatter plot graph, we can see as the value of lbs increase there is
almost a linear increase in the sales dollar’s values. Similarly, we have also
plotted box plots for search and social media data and found out the outlier’s
region.
Fig 3.2.5. Box plot for Search Media Fig 3.2.6. Box plot for Social Media
From the box plots we observe for search media the median lies around1000
mark while minimum value is 1, outliers fall above the 12000 mark. Similarly,
for the social media median falls around 50 mark while outliers in data lie
beyond 800 mark. For search media we have no missing values while for
social media we observe some missing values in the theme id column.
17
Fig 3.2.1.1 Pie chart for Vendor Market Share
From the given chart we can conclude the client share’s in market is 25.1%
which is fairly good as compared to other vendors. We also found out vendors
who can be treated as potential competitors for our client on a theme level
granularity. For better analysis we have plotted bar plots that have information
regarding vendors and sales dollars value for a particular theme.
18
Fig 3.2.1.2 Bar plot for theme salmon
From the above bar plot we can say vendor A outperforms its competitors for
the theme salmon. Similar bar plots are generated for all unique themes of our
client.
For each data source at a theme level we first calculated the percentage share
and assigned the rank accordingly. Next, we used the logic, that with each
passing year the % share should increase but at same time the rank should
decrease and filtered out only those themes that followed this rule. After
obtaining the resultant data we plotted 3 bar charts for each data source to depict
the emerging themes more easily.
19
Fig 3.3.1 Emerging themes in Sales data
From the above bar charts, we can see the themes which have come out as
emerging over the period of 4 years. The client can leverage this information to
find out which themes are going out of demand and simultaneously figure out
how to increase their sales and convert them to emerging themes.
20
The next step in the project was to validate a hypothesis that trend flows from
social to search to sales i.e. let’s say for any product it first trends on social
media and then people start searching for it in different search engines and
finally they decide whether to buy it or not. So, for validating this hypothesis we
made use of the sales, posts and search volume information.
We assumed if for a theme the total posts increases/decreases from one year to
another, then it would have a similar impact on search volume and sales unit’s
value. We formulated our Null and Alternative Hypothesis based on this.
Null Hypothesis: There is no trend between the data sources, i.e. change in total
posts would not affect change in search volume and change in sales units for a
particular theme.
Alternative Hypothesis: There exists a trend between the data sources. Total
posts, search volume and sales unit’s value for a particular theme are related
with each other.
We used student’s t-test for hypothesis testing and calculated both t-value and
p-value. Confidence interval of 95% was chosen which meant p-value is 0.05. If
the calculated p-value is greater than 0.05 then accept null hypothesis else reject
it. We used pair t test, first between social media posts and search volume, then
between search volume and sales unit’s value.
21
Fig 3.4.1 p-values for common themes
The above results are of themes where the calculated p-values are less than
0.05, these themes which reject null hypothesis. These themes follow the trend
of social to search and then to sales. We also obtain certain themes that accept
the null hypothesis.
After the testing we also found few themes that do not follow the trend as their
p-values are greater than 0.05 hence they accept the null hypothesis.
22
they can properly plan their advertising or promoting agenda’s so as to decrease
the latency and in return increase the demand and sales in a shorter period of
time.
From preliminary data analysis we found there was a lot of noise in the datasets
as the starting and ending dates of themes varied significantly across them, we
can see the difference in days going up to a period of 1 year or more in certain
cases. Hence in order to fix this ambiguity we have used a method of Weighted
Days in which for each theme in a particular dataset we found a corresponding
average date. The average date calculation processed is as follows:
We start with 2 data sources, say social media and search data source. For social
media data we first find a minimum date for a theme i.e. the first date of
occurrence and consider this as our reference date and same for google search
data. Next, we declare two lists l1 and l2. In l1 we have the minimum dates for a
given theme present in both the datasets and in l2 we have the maximum dates
for given theme present in both datasets. Now we filter out all the dates for the
theme based on a condition, the date is greater than max (value in l1) and lesser
than min (value in l2). After filtering out the dates we store them in a temp
variable.
Now we assign weights to the filtered dates by multiplying the corresponding
posts/search volume value with them.
temp['weight'] = temp['total_post'] * temp['days_diff']
Here days_diff = actual date – reference date for a given theme.
Then we calculate the number of days by dividing sum of weights with sum of
posts. The number of days obtained is added with the minimum reference date
and we get the average date corresponding to a given theme.
This process is then repeated for search data and sales data and we get similar
average date for both the datasets.
For latency between two datasets, we subtract the average dates column of one
dataset from the other and have the latency in days. The directionality of latency
may be positive or negative depending on which dataset the theme has appeared
first.
The mean latency observed between search and sales data is around 48 days.
23
Fig 3.5.1 Latency between social-search
24
Fig 3.6.1 Dataset after aggregation
25
Fig 3.6.3. Model Output for Theme Salmon
Similarly, we have created a model and got their summary for each of theme in
our data. As we can see from the above 2 model statistics, for blueberry we
have per_unit_price_A, other_vendors_price as strongly significant factors and
for salmon we have per_unit_price_A, price_per_lbs_A, units_per_lbs_A
coming out as significant. So, for each theme we can identify what are the key
factors that the client can control to increase its sales.
We also performed EDA on the final data and found out the features that have a
high correlation with the dependent variable and also the features which had a
very high correlation among themselves. If features have a higher correlation
between themselves then we can include only one of them in our model instead
of including both.
For our final model we selectively pick features that are significant across
themes and can be controlled by our vendor. We used Multiple Linear
Regression with backward elimination as the feature selection approach for
modelling as this is a case of multivariate regression. For feature elimination we
looked at the p-values after each step of modelling and the variable with highest
p-value greater than the threshold of 0.05 was removed from the model for next
iteration.
26
Fig 3.6.4 Final Model Output
27
Root Mean Square Error – It is the standard deviation of residuals or
prediction errors. Residuals are basically a measure of how far from the
regression line are the data points. RMSE tells us how spread out the
residuals are or how concentrated the data is around the line of best fit.
28
CHAPTER 4: CONCLUSION
The project covers all the above-mentioned objectives by a series of data pre-
processing, exploratory data analysis, hypothesis testing and model building
steps. We are successful at predicting what factors can the company possibly
leverage to increase the sales for products they manufacture belonging to a
particular theme. This would help the client immensely to have a better
understanding of their current position in the market as they get to know their
potential competitors for each theme and also some information regarding how
their products are trending across social media channels like twitter, Facebook
etc, and search channels like google, amazon etc. The marketing teams can
work with this information and plan certain campaigns that would help decrease
the latency in shifting of trend from one channel to another. The aim here is to
help the client to have a fair % increase in their sales and we successfully
achieve this by helping them know the possible controllable factors for each of
their themes on which they can work on and thus have a better control over their
profit margin.
29
30