Project Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 30

Emerging Business Opportunities

for Food and Beverages


Manufacturer
A report is submitted to the department of

Computer Science and


Engineering
of
International Institute of Information Technology Bhubaneswar

in partial fulfilment of the requirements


for the degree of

Bachelor of Technology

by

Swayam Prakash Pal


(Roll- B516046)

&

Varun Singhal
(Roll- B516051)

under the supervision of


Prof. Swati Vipsita

Computer Science and Engineering


International Institute of Information Technology Bhubaneswar
Bhubaneswar Odisha - 751003, India
2020
International Institute of Information
Technology Bhubaneswar
Bhubaneswar Odisha -751 003, India. www.iiit-bh.ac.in

May 25, 2020

Undertaking

I declare that the work presented in this report titled Emerging Business
Opportunities for Food and Beverages Manufacturer, submitted
to the Department of Computer Science and Engineering, International
Institute of Information Technology, Bhubaneswar, for the award of the
Bachelors of Technology degree in the Computer Science and Engineering,
is my original work. I have not plagiarized or submitted the same work for
the award of any other degree. In case this undertaking is found incorrect, I
accept that my degree may be unconditionally withdrawn.

Swayam Prakash Pal


B516046

Varun Singhal
B516051
International Institute of Information
Technology Bhubaneswar
Bhubaneswar Odisha -751 003, India. www.iiit-bh.ac.in

May 25, 2020

Certificate

This is to certify that the work in the report entitled Emerging Business
Opportunities for Food and Beverages Manufacturer by Swayam
Prakash Pal & Varun Singhal is a record of an original research work carried
out by her under my supervision and guidance in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology in
Computer Science and Engineering. Neither this thesis nor any part of it has
been submitted for any degree or academic award elsewhere.

Prof. Swati Vipsita


Acknowledgment

The elation and gratification of this seminar will be incomplete without


mentioning all the people who helped me to make it possible, whose
gratitude and encouragement were invaluable to me. I would like to thank
God, almighty, our supreme guide, for bestowing is blessings upon me in
my entire endeavor. I express my sincere gratitude to Prof. Swati Vipsita,
for her guidance and support and students of my class for their support
and suggestions.

Swayam Prakash Pal

Varun Singhal
ABSTRACT
The food and beverages industry manufactures products across a variety of
themes that ranges from fruits, poultry, organic, seafood etc. Each manufacturer
has their own brand for a product of any particular theme, for example say
potato chips would have different brands like Lay’s by PepsiCo and Bingo by ITC.
So, the purpose of this project is to understand the growth patterns of consumer
preferences(themes) and to calculate the positioning of brands across different
themes for the given client. By analysing the growth patterns of themes, we can
further identify the key drivers behind the sales of these products.

The major challenge in this project was to map the data available across different
datasets. We had 6 other major data sources which were entirely product specific
like the sales data of products, search-volume across different search engines,
posts across different social media channels, a product to theme mapping,
product to vendor mapping etc. Our objective in this project is to understand the
market share of a particular vendor for any given theme and thus identify its
potential competitors across those themes. We also found out the themes that
were emerging in the different data sources like sales, social and search and also
tried finding out if there was any visible trend or an order in which the theme
appeared across the data sources. Generally, a product first trends in social
media due to the advertisements done by the company, then if people get
interested in the product, they search for it and compare the specifications, price
etc. before making a final choice and ending of buying the product. We tried to
validate this trend by using hypothesis testing and also found out the lag period
for shifting of trend from one medium to another.

Finally, after all the data aggregations and transformation, model building was
carried out to find out the key predictors that lead to increase sales of product for
a given theme.
Contents

Abstract v

1. Introduction 8
1.1 Data Analysis and Exploration……………………………………………………9

1.2 Multiple Linear Regression with Backward Elimination……………….10

2. Project description and Overview 11

2.1 Project overview…………………………………………………………………11

2.2 Problem statement and description…………………………………….12

2.3 Project Constraints…………………………………………………………….13

2.4 Technologies used………………………………………………………………14

3. Data Analysis and Solution Approach 15

3.1 Data Understanding……………………………………………………15

3.2 Exploratory Data Analysis……………………………………………15


3.2.1 Market Share Analysis………………………………………….17

3.3 Finding Emerging Themes………………………………………......19

3.4 Hypothesis Testing for Flow of Trend………………………….20

vi
3.5 Latency Observed in Trend Shift………………………………...22

3.6 Sales Model Building…………………………………………………23


3.6.1 Model Evaluation………………………………………………26
3.6.2 Model Insights…………………………………………………27

4. Conclusion 28

vi
CHAPTER 1: INTRODUCTION

In this project we try to figure out the high opportunity themes for a given
vendor and also the key driving factors which the vendor can leverage to
increase the sales of products across the given set of themes. The project can
be divided into several stages which would act as a pipeline that shows the
flow of raw data from the processing stage till the final modelling phase. The
first objective would be to understand the data that is present across different
data sources, what is the granularity of data (weekly, monthly, yearly) present,
finding out outliers and treating for any missing values etc. Then we look for
the fields by which we can map different additional data sources to get the
unique themes that are present in our sales, social and search data. After
getting the unique themes, we find the preferred themes based on the total
sales, social posts and search volume and also find the themes that can be
classified as ‘emerging’ in the given data sources. The next objective was to
calculate the overall market share of our vendor in comparison to other
competitors is also calculated which gives a better understanding of the market
hold of our vendor, also for some common themes we compared the sales
value of our vendor to other vendors. Generally, when a new product is
launched in the market, it tends to follow a certain marketing trend before
actual sales start, these include paid advertisements in social media channels,
paid ads in search channels etc. Here we tried to figure out what was the trend
that most of the themes followed, whether it was first visible in social media
and then searched or was it first searched in some search engine and then
posted in social media. After getting the trend and the lag between the
different channels we aggregate the data sources and move unto the modelling
phase. For modelling we have used Multiple Linear Regression with
8
Backward Elimination based approach. A more comprehensive explanation of
all the steps is presented in the further section.

1.1. DATA ANALYSIS AND EXPLORATION


Data analysis can be defined as the process of cleaning, transforming, and
modelling data to discover useful information for business decision-making.
The purpose of Data Analysis is to extract useful information from data and
taking the decision based upon the data analysis. Data-driven businesses make
decisions based on data, which means they can be more confident that their
actions will bring success since there is data to support them. It is a good
practice to understand the data first and try to gather as many insights from it. By
performing Exploratory Data Analysis on our raw data, we can discover
patterns, spot anomalies or exceptions present, test our hypothesis and check any
assumptions with the help of detailed summary statistics and graphical
representations. In this project we carried out EDA in the first phase, as we
cleaned the data and looked for any outliers. We also plotted few graphs that
highlighted the outliers.

1.2. Multiple Linear Regression with Backward


Elimination

Multiple linear regression (MLR), also known simply as multiple regression, is


a statistical technique that uses several independent variables to predict the
outcome of a dependent variable. The motive of multiple linear regression
(MLR) is to highlight the linear relationship between the explanatory
(independent) variables and response (dependent) variable. It is the extension of
ordinary least-squares (OLS) regression that involves only one independent
variable.

9
As we have multiple independent variables or predictors it is important to only
include the ones that are necessary. So, in this project we have used a feature
selection technique known as Backward Elimination to remove the non-
essential predictors from the model. Backward Elimination works in 5 simple
steps.

1. Select a significance level or select the p-value. Usually 5%


significance level is chosen so p-value will be 0.05.
2. Fit the model with all the predictors selected.
3. Identify the predictor with highest p-value.
4. If the highest p-value is greater than significance level, then remove
the predictor, move to step 5. If p-value is less than significance level,
move to step 6.
5. After removing the predictor, fit the model again with remaining
features and jump back to step 3. This process continues until we reach
a point in step 4 where the highest P-value from all the remaining
features in the dataset is less than the significance selected in step 1.
6. Once we reach step 6, we are done with the feature selection process.
We have successfully used backward elimination to filter out features
which were not significant enough for our model.

10
CHAPTER 2: PROJECT
DESCRIPTION AND OVERVIEW

2.1 PROJECT OVERVIEW


The food and beverages manufacturer falls under the category of consumer-
packaged goods (CPG) industry. In our case the manufacturer release products
in the market that vary over a large spectrum of themes such as vegetables,
protein, poultry, seafood etc. Now our vendor also has competitors in the
market who also have a similar set of products. So, the success rate of any new
product launched by our vendor will be depending upon certain controllable and
non-controllable factors. The controllable factors for a product can be price per
unit, weight per unit, demand based on certain areas, discount on certain
products etc. These all can be leveraged by the vendor to increase the product
sales, but non-controllable factors such as other vendor’s price and sales values
are of no use to us.

So, by using data analysis on the historical data that was available to us
regarding the sales of the product and what all mediums did they use to
advertise the product i.e. social media and search channels, we can get
information about what all product themes are popular in the market,
what is the market share of our vendor compared to it’s competitors, is
there any time period between shifting of trend from one channel to
other so that the marketing team can focus on advertising more to
reduce the gap, what are all the themes that are profitable for the vendor
and also the ones which have a negative impact on profit. Finally, we can
also get what are the controllable factors the client can leverage to
increase their profit and by what margin they can increase it.

11
2.2 PROBLEM STATEMENT AND DESCRIPTION
The problem statement of the project that we had been assigned was to
understand the growth patterns of consumer preferences(themes) i.e. different
varieties of food and beverages products for a given CPG client and evaluate the
positioning of their brand across different themes. Further, the client also wants
to understand the KPIs of their products which they could leverage to increase
the products sales and in return increase the net profit of the company.

The problem statement can be broken down into 4 small subsections which
indicates the flow of our work.

1. The data provided for this project is of 3 different types. First, we have
the sales data for client and competitors at an UPC level. Second, we
have the social media data which has mentions of themes across all social
media platforms and third we have Google Search data which has the
search volume information for all the themes. We found out the number
of unique themes present in the 3 data sources and also looked for the
consumer preferences. The main challenge here was to understand the
time granularity as it varied from weekly to daily level data.

2. The second objective of the project was to assimilate the required data
sources after proper mapping of the themes to products. After this step we
get a product level data on the sales, posts number and search volume
which would help us in further analysis. The client’s and other vendors
market share were computed. We also looked for themes that have more
contribution towards sales as compared to other themes in the last 3 years
and considered them as emerging themes.

12
3. The third objective involves performing any kind of transformation or

aggregation that is required on the data. We created some new


independent variables from the existing set of predictor variables that
might come out as a controllable factor during later analysis. After this
step the dependent variable was chosen and the right modelling technique
was chosen keeping in mind this is a multi-variate model.
4. The final objective of the project was to use Exploratory Data Analysis

and model results to pick out the themes that showed signs for high
business opportunity. KPIs to drive the sales of our client were also
identified. The main challenge for any manufacturer would be to
correctly estimate by what margin their sales would increase if they bring
in some adjustment with their strategy. So, by the help of this model we
get certain insights like the intercept values of controllable factors which
can give a nearly correct answer to our question.

2.3 PROJECT CONSTRAINTS


Quite naturally, the designed solution must also follow certain guidelines
enforced by the company:

 Other developers should easily understand the code base and pick up the
project later without difficulties.
 Training the system’s models must take a reasonable amount of time.
 Working model and model fit metrics like R-square, Adjusted R-square
etc. that assesses model accuracy and performance.
 Documentation of the models and key insights as well as presenting such
insights to Company’s management
 Well documented Python production code.

13
The project is therefore articulated around those constraints to not violate them.

2.4 TECHNOLOGIES USED IN PROJECT


• Python 3.7.3
• R studio
• Python Libraries:
o Matplotlib: used for data visualization
o Seaborn: used for data visualization
o Pandas: used for data manipulation
o Sklearn: used for importing ML libraries
o numpy:
o datetime:
 R Libraries:
o dplyr
o catools

14
CHAPTER 3: DATA ANALYSIS AND
SOLUTION APPROACH

3.1 DATA UNDERSTANDING


The data provided by client for this project can be divided into two
categories, one set of files are used for mapping to get a relation between
product and which themes they belong to and the other set of files gives us
information regarding the sales, search volume and social media posts at a
product level granularity. We went through the files in depth and later also
added few new variables for our analysis based on the existing variables.

3.2 Exploratory Data Analysis


Exploratory Data Analysis (EDA) is a method for analysing datasets to
summarize their main characteristics, often with visual methods. It is used
for seeing what the data can tell us before the modelling task. By using EDA,
we get the summary statistics of our 3 major data sources.

15
Fig 3.2.1. Stats from Sales Dataset

Fig 3.2.2. Missing values data

From the above chart we can infer that there are no missing values in this
given data source. We also generated box plots and scatter plots using the
sales unit’s value and sales dollars value fields.

Fig 3.2.3. Box plot for sales units

From the Box plot we can tell the median falls under 500 mark while the
outliers lie above 4000 mark. Scatter plot was generated between, sales
dollar’s value and sales lbs value as it captures the relationship between two
continuous variables.

16
Fig 3.2.4. Scatter Plot (dollar’s vs lbs)

From the scatter plot graph, we can see as the value of lbs increase there is
almost a linear increase in the sales dollar’s values. Similarly, we have also
plotted box plots for search and social media data and found out the outlier’s
region.

Fig 3.2.5. Box plot for Search Media Fig 3.2.6. Box plot for Social Media

From the box plots we observe for search media the median lies around1000
mark while minimum value is 1, outliers fall above the 12000 mark. Similarly,
for the social media median falls around 50 mark while outliers in data lie
beyond 800 mark. For search media we have no missing values while for
social media we observe some missing values in the theme id column.

3.2.1 Market Share Analysis


The next challenge was to get the overall market share of our client in
comparison to the other vendors. For this analysis we chose the sales units
value as the comparing field and generated a pie chart. The client is named as
vendor A.

17
Fig 3.2.1.1 Pie chart for Vendor Market Share

From the given chart we can conclude the client share’s in market is 25.1%
which is fairly good as compared to other vendors. We also found out vendors
who can be treated as potential competitors for our client on a theme level
granularity. For better analysis we have plotted bar plots that have information
regarding vendors and sales dollars value for a particular theme.

18
Fig 3.2.1.2 Bar plot for theme salmon

From the above bar plot we can say vendor A outperforms its competitors for
the theme salmon. Similar bar plots are generated for all unique themes of our
client.

3.3 Finding Emerging Themes


The next objective in our project was to pick out the themes that are emerging
across all the data sources. Emerging themes cannot be simply the themes that
have highest sales units for one year but they should be consistently performing
well over the past few years. Themes can also be considered as emerging if in
the past their sales were not good but in the recent year’s they have come out
strongly and are in high demand. So, basically for any given theme we
calculated its overall percentage market share for one year and assigned it a
rank accordingly. We had sales, search and social media data for 4 years, from
2016 to 2019.

For each data source at a theme level we first calculated the percentage share
and assigned the rank accordingly. Next, we used the logic, that with each
passing year the % share should increase but at same time the rank should
decrease and filtered out only those themes that followed this rule. After
obtaining the resultant data we plotted 3 bar charts for each data source to depict
the emerging themes more easily.

19
Fig 3.3.1 Emerging themes in Sales data

Fig 3.3.2 Emerging themes in Search data

From the above bar charts, we can see the themes which have come out as
emerging over the period of 4 years. The client can leverage this information to
find out which themes are going out of demand and simultaneously figure out
how to increase their sales and convert them to emerging themes.

3.4 Hypothesis Testing for Flow of Trend

20
The next step in the project was to validate a hypothesis that trend flows from
social to search to sales i.e. let’s say for any product it first trends on social
media and then people start searching for it in different search engines and
finally they decide whether to buy it or not. So, for validating this hypothesis we
made use of the sales, posts and search volume information.
We assumed if for a theme the total posts increases/decreases from one year to
another, then it would have a similar impact on search volume and sales unit’s
value. We formulated our Null and Alternative Hypothesis based on this.
Null Hypothesis: There is no trend between the data sources, i.e. change in total
posts would not affect change in search volume and change in sales units for a
particular theme.
Alternative Hypothesis: There exists a trend between the data sources. Total
posts, search volume and sales unit’s value for a particular theme are related
with each other.
We used student’s t-test for hypothesis testing and calculated both t-value and
p-value. Confidence interval of 95% was chosen which meant p-value is 0.05. If
the calculated p-value is greater than 0.05 then accept null hypothesis else reject
it. We used pair t test, first between social media posts and search volume, then
between search volume and sales unit’s value.

21
Fig 3.4.1 p-values for common themes

The above results are of themes where the calculated p-values are less than
0.05, these themes which reject null hypothesis. These themes follow the trend
of social to search and then to sales. We also obtain certain themes that accept
the null hypothesis.

After the testing we also found few themes that do not follow the trend as their
p-values are greater than 0.05 hence they accept the null hypothesis.

3.5 Latency Observed in Trend Shift


Latency can be defined as the gap in number of days that is observed when a
particular theme moves from one channel to other, channel here refers to social,
search and sales. This information would be highly beneficial for a vendor as

22
they can properly plan their advertising or promoting agenda’s so as to decrease
the latency and in return increase the demand and sales in a shorter period of
time.
From preliminary data analysis we found there was a lot of noise in the datasets
as the starting and ending dates of themes varied significantly across them, we
can see the difference in days going up to a period of 1 year or more in certain
cases. Hence in order to fix this ambiguity we have used a method of Weighted
Days in which for each theme in a particular dataset we found a corresponding
average date. The average date calculation processed is as follows:
We start with 2 data sources, say social media and search data source. For social
media data we first find a minimum date for a theme i.e. the first date of
occurrence and consider this as our reference date and same for google search
data. Next, we declare two lists l1 and l2. In l1 we have the minimum dates for a
given theme present in both the datasets and in l2 we have the maximum dates
for given theme present in both datasets. Now we filter out all the dates for the
theme based on a condition, the date is greater than max (value in l1) and lesser
than min (value in l2). After filtering out the dates we store them in a temp
variable.
Now we assign weights to the filtered dates by multiplying the corresponding
posts/search volume value with them.
temp['weight'] = temp['total_post'] * temp['days_diff']
Here days_diff = actual date – reference date for a given theme.
Then we calculate the number of days by dividing sum of weights with sum of
posts. The number of days obtained is added with the minimum reference date
and we get the average date corresponding to a given theme.
This process is then repeated for search data and sales data and we get similar
average date for both the datasets.

For latency between two datasets, we subtract the average dates column of one
dataset from the other and have the latency in days. The directionality of latency
may be positive or negative depending on which dataset the theme has appeared
first.
The mean latency observed between search and sales data is around 48 days.

23
Fig 3.5.1 Latency between social-search

Fig 3.5.2 Latency between search-sales

3.6 Sales Model Building


The final step of our project is the model building stage. We started with
aggregating the data sources so as to get all the independent variables and the
dependent variable (sales_units_value) in one dataset. Keeping the sales data as
our base dataset, we added total posts from social media dataset and search
volume from search dataset. We also applied feature engineering to create new
independent variables derived from the existing variables. These variables were
specific to both our client and its competitors.
Variables created are price_per_unit, price_per_lbs, units_per_lbs. The
dependent variable chosen for our analysis is sales_units_value for our client
i.e. Vendor A.

24
Fig 3.6.1 Dataset after aggregation

After the aggregation and independent variable creation, we start model


building using a multi variate regression approach. As we have different themes
in the final data it is possible for each theme it may have a particular set of
variables that come out as significant after modelling, so for each unique theme
we have created a model i.e. 18 different models.

Fig 3.6.2. Model Output for Theme Blueberry

25
Fig 3.6.3. Model Output for Theme Salmon

Similarly, we have created a model and got their summary for each of theme in
our data. As we can see from the above 2 model statistics, for blueberry we
have per_unit_price_A, other_vendors_price as strongly significant factors and
for salmon we have per_unit_price_A, price_per_lbs_A, units_per_lbs_A
coming out as significant. So, for each theme we can identify what are the key
factors that the client can control to increase its sales.
We also performed EDA on the final data and found out the features that have a
high correlation with the dependent variable and also the features which had a
very high correlation among themselves. If features have a higher correlation
between themselves then we can include only one of them in our model instead
of including both.
For our final model we selectively pick features that are significant across
themes and can be controlled by our vendor. We used Multiple Linear
Regression with backward elimination as the feature selection approach for
modelling as this is a case of multivariate regression. For feature elimination we
looked at the p-values after each step of modelling and the variable with highest
p-value greater than the threshold of 0.05 was removed from the model for next
iteration.

26
Fig 3.6.4 Final Model Output

3.6.1 Model Evaluation


For model evaluation we have used r-square and adjusted r-square metric and
the residual standard error and root mean square error (rmse).
 R-Square – It is a goodness-of-fit measure for linear regression models.
This statistic indicates the percentage of the variance in the dependent
predictors the independent predictors can explain collectively. It
measures the strength of the relationship between the model and the
dependent variable on a scale of 0-100%. Our model gives a R-Squared
value of around 97% which on an average a very good value.

 Adjusted R-Squared – Adjusted R-squared is a slightly modified version


of the R-Squared that has been adjusted for the number of predictors in
the model. It increases only if the new variable improves the model more
than the expected value and decrease if it improves the model less than
the expected value. The value can be negative and it is always lower than
R-Squared value. The model gives an Adjusted R-Squared value of
around 96%.

27
 Root Mean Square Error – It is the standard deviation of residuals or
prediction errors. Residuals are basically a measure of how far from the
regression line are the data points. RMSE tells us how spread out the
residuals are or how concentrated the data is around the line of best fit.

3.6.2 Model Insights


The insights that can be drawn from the model output are that, the themes that
are significant and contribute to high sales for our client are blueberry, low carb,
salmon, soy foods, no additives preservatives.
The controllable factor that came out as significant is per_unit_price_A and its
estimate is also negative which means if the per_unit_price for a theme increase
it results in decrease in sales units. By building models for each theme we have
the controllable factors for each theme, hence the client can control these factors
accordingly and get a desired % increase in sales.

28
CHAPTER 4: CONCLUSION

In the food and beverages industry there is always a competition between


different vendors for getting the best product out in the market and having a fair
amount of market share. In order to achieve a company has to analyse the
market trends properly like what products are of high demand currently or the
products that are currently trending across different social and search channels,
what changes can they bring in to increase their profit and by what margin can
they increase it.

The project covers all the above-mentioned objectives by a series of data pre-
processing, exploratory data analysis, hypothesis testing and model building
steps. We are successful at predicting what factors can the company possibly
leverage to increase the sales for products they manufacture belonging to a
particular theme. This would help the client immensely to have a better
understanding of their current position in the market as they get to know their
potential competitors for each theme and also some information regarding how
their products are trending across social media channels like twitter, Facebook
etc, and search channels like google, amazon etc. The marketing teams can
work with this information and plan certain campaigns that would help decrease
the latency in shifting of trend from one channel to another. The aim here is to
help the client to have a fair % increase in their sales and we successfully
achieve this by helping them know the possible controllable factors for each of
their themes on which they can work on and thus have a better control over their
profit margin.

29
30

You might also like