Professional Documents
Culture Documents
Data Science: Sales Forecasting For Marketing
Data Science: Sales Forecasting For Marketing
1. INTRODUCTION
1.1. Data science
Data science is the process of deriving knowledge and insights from a huge and diverse set of
data through organizing, processing and analyzing the data. It involves many different
disciplines like mathematical and statistical modeling, extracting data from it source and
applying data visualization techniques. Often it also involves handling big data technologies
to gather both structured and unstructured data.
Below we will see some example scenarios where Data science is used.
Recommendation system: Create models predicting the shopper’s needs and show
the products the shopper is most likely to buy.
Financial Risk management: The financial risk involving loans and credits are better
analysed by using the customers past spend habits, past defaults, other financial
commitments. The outcome is minimizing loss for the financial organization by
avoiding bad debt.
Improvement in Health Care services: The health care industry deals with a variety
of data which can be classified into technical data, financial data, patient information,
drug information and legal rules. All this data need to be analysed to produce insights
that will save cost both for the health care provider and care receiver.
Computer Vision: The advancement in recognizing an image by a computer
involves processing large sets of image data from multiple objects of same category.
For example, face recognition.
The programming requirements of data science demand a very versatile yet flexible language
which is simple to write the code but can handle highly complex mathematical processing.
Python is most suited for such requirements as it has already established itself both as a
language for general computing as well as scientific computing. More over it is being
continuously upgraded in form of new addition to its plethora of libraries aimed at different
programming requirements.
Machine learning is a discipline that deals with programming the systems so as to make them
automatically learn and improve with experience. Here, learning implies recognizing and
understanding the input data and taking informed decisions based on the supplied data. It is
very difficult to consider all the decisions based on all possible inputs.
To solve this problem, algorithms are developed that build knowledge from a specific data
and past experience by applying the principles of statistical science, probability, logic,
mathematical optimization, reinforcement learning, and control theory.
For example, machine learning programs can scan and process huge databases detecting
patterns that are beyond the scope of human perception.
The developed machine learning algorithms are used in various applications such as
Vision processing
Language processing
Pattern recognition
Games
Data mining
Expert systems
Robotics
Unsupervised Learning
Reinforcement Learning
Supervised Learning:
Supervised learning involves building a machine learning model that is based on labeled
samples. Learning data comes with description, labels, targets or desired outputs and the
objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data.
For example, if we build a system to estimate the price of a plot of land or a house based on
various features, such as size, location, and so on, we first need to create a database and label
it. We need to teach the algorithm what features correspond to what prices. Based on this
data, the algorithm will learn how to calculate the price of real estate using the values of the
input features.
Regression trains on and predicts a continuous-valued response, for example predicting real
estate prices.
Regression algorithms:
Linear regression
Logistic regression
Polynomial Regression
Classification algorithms:
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for
identifying patterns and trends. They are most commonly used for clustering similar input
into logical groups. Unsupervised learning algorithms include
Clustering algorithms
Kmeans
Random Forests
Reinforcement Learning
Here learning data gives feedback so that the system adjusts to dynamic conditions in order
to achieve a certain objective. The system evaluates its performance based on the feedback
responses and reacts accordingly. The best known instances include self-driving cars and
chess master algorithm Alpha Go.
1.3. Project Introduction
The sales forecast indicates as to how much of a particular product is likely to be sold
in a specified future period in a specified market at specified price. Forecasting Sales is the
process of making predictions based on dataset. Causal forecasting attempts to predict a
variable by trying to explain what factors cause it to change. For example, a causal model to
predict market demand for a product might use the product's price, competitors' prices and the
amount of money spent on advertising to explain what product demand might be six months
in the future.
For this Prediction we use K-Means and Linear Regression Machine learning
algorithms, where K-means is Clustering algorithm in and Linear regression is Regression
algorithm. Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting.
Here we are using a dataset called “Advertising”. To do data science project we must know
about some python libraries like
NumPy
Pandas
Scikitlearn
Matplotl
ib And IDE’s
like
Jupyter
Spyder
SALES FORECASTING FOR MARKETING INSTALLATIONS
2. INSTALLATIONS
2.1 ANACONDA:
Anaconda is a package manager, an environment manager,
and Python distribution that contain a collection of many open source packages. This is
advantageous as when you are working on a data science project, you will find that you need
many different packages (NumPy, Scikit-learn, SciPy, pandas to name a few), which an
installation of Anaconda comes preinstalled with.
1. Go to the Anaconda Website and choose a Python 3.x graphical installer (A) or a Python
2.x graphical installer (B). If you aren't sure which Python version you want to install, choose
Python 3. Do not choose both.
7. Click on Next.
8. Click on Next
9. Click on Finish.
Anaconda provides various IDE’s like Jupyter, Spyder, etc. You can launch them and use
them.
Jupyter:
The Jupyter Notebook is an incredibly powerful tool for interactively developing and
presenting data science projects.
A notebook integrates code and its output into a single document that combines
visualisations, narrative text, mathematical equations, and other rich media.
It is possible to use many different programming languages within Jupyter Notebooks,
this article will focus on Python as it is the most common use case.
Spyder:
Spyder was developed specifically for data science
Spyder is an open source cross-platform IDE for data science.
Spyder does the job of integrating the essentials libraries for data science like
IPython, SciPy, Matplotlib and NumPy.
Spyder has features like code completion, a text editor with syntax highlighting, and
variable exploring, whose values you may edit using a GUI.
An online help browser, allowing users to search and view Python and package
documentation inside the IDE
SALES FORECASTING FOR MARKETING PYTHON LIBRARIES
Libraries:
NumPy:
Pandas:
Pandas is a Python module that contains high-level data structures and tools designed
for fast and easy data analysis operations.
Pandas is built on NumPy and make it easy to use in NumPy-centric applications,
such as data structures.
It is also easy to handle missing data using Python. Pandas are the best tool for doing
data munging.
Matplotlib:
Scikit-Learn:
Hardware Requirements:
Processor : i3 or higher
Processor Speed : minimum 500Mhz
Hard Disk : minimum 30GB
Input Devices : Keyboard, Mouse
Ram : 8GB or higher.
Software Requirements:
1. Data Inspection
2. Data Scrubbing
3. Data Exploring
4. Model Building
5. Model Evaluation
4.1. Data Inspection:
The First step in Sales Prediction is to Load the relevant Data.The location and
structure of the data to be analyzed will vary from person to person and application to
application.
advertising =pd.DataFrame
CSV (Comma Separated Values) are a common file format for transferring & storing
data where commas are used to separate different columns and newlines used to
separate rows.
Data frames are Pandas data type for storing 2-dimensional data in which each column
contains values of one variable and each row contains one set of values from each
column.
Data frames supports methods for Slicing and Dicing the data, such as rows and
columns where slicing is used for Viewing, segmenting the data in the database.
4.2. Data Scrubbing:
It is the process of identifying and correcting inaccurate data from a dataset. In our
dataset there are incomplete and incorrect data that we have cleaned by using some
libraries.
After cleansing, a dataset should be consistent with other similar data sets in the
system. The inconsistencies detected or removed may have been originally caused by
user entry errors, by corruption in transmission or storage, or by different data
dictionary definitions of similar entities in different stores. Data cleaning differs from
data validation in that validation almost invariably means data is rejected from the
system at entry and is performed at the time of entry, rather than on batches of data.
3. Remove duplicates
4. Highlight Errors
Our dataset i.e., advertising.csv contains some missing values in the newspaper attribute.
Those missing values are represented with NaN (not a null) ,by scrubbing the dataset we fill
those missing values with mean value of newspaper attribute that is shown below
SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 20
4.3. DataAnalysis:
Data exploring is the first step in data analysis and typically involves summarizing the
main characteristics of a dataset including its size, accuracy and attributes.
We are using Box plot for analyzing the attributes, Box plot is a graph that gives you a
good indication of how the values in the data are spread out, it consists of Predictor variables
specifies one or more variables used to determine Target variable and Target Variable it
specifies these values are to be modeled and predicted by other variables.
Box plot is suitable for Comparing range and distribution of numerical data.
Univariate Analysis:
It is the Simplest form of analyzing the data “uni” means one, so in other words your
data has only one variable.
Outlier Analysis:
An Outlier is a rare chance of occurrence within a given data set. In Data Science, an
Outlier is an observation point that is distant from other observations. An Outlier may
be due to variability in the measurement or it may indicate experimental error.
Outliers, being the most extreme observations, may include the sample maximum or
sample minimum, or both, depending on whether they are extremely high or low.
However, the sample maximum and minimum are not always outliers because they
may not be unusually far from other observations.
Correlation Matrix:
A correlation matrix is a table showing correlation coefficients between variables.
Each cell in the table shows the correlation between two variables. A correlation matrix is
used as a way to summarize data, as an input into a more advanced analysis, and as a
diagnostic for advanced analyses.
What is Correlation?
It can be useful in data analysis and modeling to better understand the relationships between
variables. The statistical relationship between two variables is referred to as their correlation.
A correlation could be positive, meaning both variables move in the same direction, or
negative, meaning that when one variable’s value increases, the other variables’ values
decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.
We may also be interested in the correlation between inputs variables with the output variable
in order provide insight into which variables may or may not be relevant as input for
developing a model.
The structure of the relationship may be known, e.g. it may be linear, or we may have no idea
whether a relationship exists between two variables or what structure it may take. Depending
what is known about the relationship and the distribution of the variables, different
correlation scores can be calculated.
How to Use Correlation to Understand the Relationship between Variables
There may be complex and unknown relationships between the variables in your dataset.
It is important to discover and quantify the degree to which variables in your dataset are
dependent upon each other. This knowledge can help you better prepare your data to meet the
expectations of machine learning algorithms, such as linear regression, whose performance
will degrade with the presence of these interdependencies.
1. To summarize a large amount of data where the goal is to see patterns. In our
example above, the observable pattern is that all the variables highly correlate with
each other.
2. To input into other analyses. For example, people commonly use correlation matrixes
as inputs for exploratory factor analysis, confirmatory factor analysis, structural
equation models, and linear regression when excluding missing values pair wise.
3. As a diagnostic when checking other analyses. For example, with linear regression a
high amount of correlations suggests that the linear regression’s estimates will be
unreliable.
4.4. Model Building:
K-Means Clustering is a type of unsupervised machine learning that groups data on the basis
of similarities.K-Means is one technique for finding subgroups within datasets. One
difference in K-Means versus that of other clustering methods is that in K-Means, we have a
predetermined amount of clusters and some other techniques do not require that we predefine
the number of clusters. The algorithm begins by randomly assigning each data point to a
specific cluster with no one data point being in any two clusters. It then calculates the
centroid, or mean of these points.
K-Means is a popular clustering algorithm used for unsupervised Machine Learning. In this
example, we will fed 200 records of fleet drivers data into K- Means algorithm developed in
Python using Pandas, NumPy and Scikit-learn, and cluster data based on similarities between
each data point.
K-means working:
K-means simply partitions the given dataset into various clusters (groups) with different
features.
How exactly?
K refers to the total number of clusters to be defined in the entire dataset. There is a centroid
chosen for a given cluster type which is used to calculate the distance of a given data point.
The distance essentially represents the similarity of features of a data point to a cluster type.
Why to use K-means?
Using k-means, the data is clustered after analyzing the data and not
primitively defining it under a group based on pre-defined labels. Each
centroid is a collection of features that essentially represent the type of cluster
it belongs to. Thus a centroid can be used to interpret the type of cluster
formed.
Real life applications:
1. Segmentation for customer retail dividing into clusters for various data analytics
application for e.g. for knowing loyal customers, for analyzing the spending behavior
of a customer or needs of certain types of customer.
2. Fraud detection for cyber crime frauds.
3. MP3 files, cellular phones are the general areas that use this technique.
Constraints:
Only numerical data can be used. Generally k-means works best for 2 dimensional
numerical data. Visualization is possible in 2d or 3d data. But in reality there are always
multiple features to be considered at a time.
Thus multi-dimensional data can be used but dimensionality reduction has to be performed to
the data before using it for k-means.
Finding the optimum number of k’s, how many clusters should the data be grouped
into, this will be done by testing for different number of k’s and using the “elbow point”,
where we plot the mean of distances between each point in the cluster and the centroid against
the number of k’s used for the test, to help determine the number of suitable clusters for this
dataset.
The steps will describe the code in details and how these packages were used:
Pandas, NumPy are a package for scientific computing with Python, Scikit-learn package
for machine learning, sklearn.cluster for clustering were we use the Kmeans, from the
matplotlib we used the pyplot package to create a visual representation of the data, and
SciPy lib for Euclidian distance calculation for elbow method and we are using the
advertising.csv file.
2. Importing and preparing data
Here we import the data from advertising.csv file into Pandas data frame object with column
names f1 and f2. f1 referring to TV, Newspaper, Radio and f2 referring to Sales respectively.
Then, we transform two vectors f1 and f2 into a NumPy array objects and name it X. X
represents our primary data model to fit into Scikit Kmeans algorithm later.
Here we calculate initial set of random centroids for K value 2, and plot both the raw data and
initial centroids on the scatter plot. K represents the number of clusters, we start by setting it
to 2, the NumPy package is used to generate random values and assign to centroids.
Here is the output when you run code. Initial centroids are indicated as green stars. Green dots
represent the raw fleet data. This is how initial data looks like before running clustering.
SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 30
4. Build K-Means Model to Run with a given K-value
Provided we have the data in required format “X”, here we create the K-Means model and
specify the number of clusters (K value). Next, we fit the model to the data, generate cluster
labels and compute centroids.
Now, we run the K-Means algorithm for K values 2, 4 and 6. In this we take the K value as 2
beacause we want to show the Relationship between the attributes in the given dataset. For
each K-value, we compute the centroids and iterate doing the same until the error value
reaches the zero. In other words, for each cluster, when the distance between old centroid
value and new calculated centroid value becomes zero, we stop. This is called stopping
criteria for calculating centroids per cluster accurately. However, for coding purposes, here
we don’t need to do error calculation explicity as it is already taken care of inside SkLearn K-
Means model.
Starting with k =2,4,6 we ran the code, Squared Euclidean distance measures the distance
between each data point and the centroid, then the centroid will be re- calculated until the stop
criteria.
6. Selecting optimum k
The technique we use to determine optimum K, the number of clusters, is called the
elbow method.
By plotting the number of centroids and the average distance between a data point and the
centroid within the cluster we arrive at the following graph.
The mathematics behind clustering, in very simple terms involves minimizing the sum of
square of distances between the cluster centroid and its associated data points:
K = number of clusters
C=centroid of cluster j
As the number of clusters increase, the WCSS keeps decreasing. The decrease of WCSS
is initially steep and then the rate of decrease slows down resulting in an elbow plot. The
number of clusters at the elbow formation usually gives an indication on the optimum
number of clusters. This combined with specific knowledge of the business requirement
should be used to decide on the optimum number of clusters.
Visualize the final clusters:
The plot shows the distribution of the 3 clusters. We could interpret them as the following
customer segments:
By this plotting we showed that “TV” has highest Ratings Compared to Both
Radio and Newspaper.
Advantages:
Easy to implement
An instance can change cluster (move to another cluster) when the centroids are re-‐
computed.
Disadvantages:
Linear Regression:
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
The first step is to estimate the mean and the variance of both the input and output variables
from the training data.
The mean of a list of numbers can be calculated as:
Mean(x)=sum(x)/count(x)
The variance is the sum squared difference for each value from the mean value. Variance
for a list of numbers can be calculated as:
2. Calculate Covariance
The covariance of two groups of numbers describes how those numbers change together.
3. Make Predictions
The simple linear regression model is a line defined by coefficients estimated from training
data.
Once the coefficients are estimated, we can use them to make predictions.
4. Estimate Coefficients
We must estimate the values for two coefficients in simple linear regression.
Different techniques can be used to prepare or train the linear regression equation from data,
the most common of which is called Ordinary Least Squares.
Ordinary Least Squares
When we have more than one input we can use Ordinary Least Squares to estimate the values
of the coefficients.
The Ordinary least squares procedure seeks to minimize the sum of the squared residuals.
This means that given a regression line through the data we calculate the distance from each
data point to the regression line, square it, and sum all of the squared errors together. This is
the quantity that ordinary least squares seeks to minimize.
This approach treats the data as a matrix and uses linear algebra operations to estimate the
optimal values for the coefficients. It means that all of the data must be available and you
must have enough memory to fit the data and perform matrix operations.
Where y is the
b0 is the intercept
Now we need to split our dataset into two sets — a Training set and a Test set. We will
train our machine learning models on our training set, i.e. our machine learning models
will try to understand any correlations in our training set and then we will test the
models on our test set to check how accurately it can predict.
A general rule of the thumb is to allocate 80% of the dataset to training set and the
remaining 20% to test set, but in our project we split our dataset into 70% of training
data and 30% of test data.
For this task, we will import test_train_split from Scikit learn library
#Splitting the dataset into the Training set and Test set
From sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)
2. R - squared is 0.816
Meaning that 81.6% of the variance in Sales is explained by TV This is
a decent R-squared value.
3. F statistic has a very low p value (practically low)
Meaning that the model fit is statistically significant, and the explained variance isn't
purely by chance.
The fit is significant. Let's visualize how well the model fit the data. From the
parameters that we get, our linear regression equation becomes:
Sales=6.948+0.054×TV
Residual analysis:
To validate assumptions of the model, and hence the reliability for inference
Distribution of the error terms
We need to check if the error terms are also normally distributed (which is infact, one of the
major assumptions of linear regression), let us plot the histogram of the error terms and see
what it looks like.
Advantages and Disadvantages:
The principal advantage of linear regression is its simplicity, interpretability,
scientific acceptance, and widespread availability.
Linear regression is the first method to use for many problems. Analysts can use
linear regression together with techniques such as variable recoding, transformation,
or segmentation.
Its principal disadvantage is that many real-world phenomena simply do not correspond
to the assumptions of a linear model; in these cases, it is difficult or impossible to
produce useful results with linear regression.
K-Means algorithm is one of the clustering algorithms which comes under unsupervised
learning method. In clustering the idea is to assign different classes to different data points
depending on how they group together, Where as linear regression algorithm is one of the
regression algorithms which comes under supervised algorithm. It predicts continuous valued
output. It is the statistical model which is used to predict the numeric data instead of labels. It
can also identify the distribution trends based on the available data or historic data.
Accurate prediction of Sales from advertising through Tv, Sales, Newspaper can be done by
using the Regression algorithm.
5. CONCLUSION
By applying some statistical measures we can ensure the sales and growth of the company
can be predicted very easily and can have an idea about the raise and fall of the company. So
the accurate prediction of sales play vital role in the development of the company which
leads to huge profits.
Sales forecasting is a critical part of the strategic planning process and allows a company to
predict how their company will perform in the future. It allows them to not only plan for new
opportunities, but also allows them to avert negative trends that appear in the forecast. A
mission statement is important because it allows an organization to know exactly why they
exist and serves as a guide for decisions. Both concepts are important to the success of the
company and should not be overlooked throughout the strategic planning process.
SALES FORECASTING FOR MARKETING REFERENCES
6. REFERENCES
https://www.towardsdatascience.com
https://www.datacamp.com
https://www.wikipedia.org
https://www.tutorialspoint.com
https://www.w3schools.com
https://www.guru99.com
https://www.codeproject.com
https://www.google.com
https://www.kaggle.com
https://www.analyticsvidya.com