Data Science: Sales Forecasting For Marketing

SALES FORECASTING FOR MARKETING INTRODUCTION
1. INTRODUCTION
1.1. Data science
Data science is the process of deriving knowledge and insights from a huge and diverse set of
data through organizing, processing and analyzing the data. It involves many different
disciplines like mathematical and statistical modeling, extracting data from it source and
applying data visualization techniques. Often it also involves handling big data technologies
to gather both structured and unstructured data.
Below we will see some example scenarios where Data science is used.
 Recommendation system: Create models predicting the shopper’s needs and show
the products the shopper is most likely to buy.
 Financial Risk management: The financial risk involving loans and credits are better
analysed by using the customers past spend habits, past defaults, other financial
commitments. The outcome is minimizing loss for the financial organization by
avoiding bad debt.
 Improvement in Health Care services: The health care industry deals with a variety
of data which can be classified into technical data, financial data, patient information,
drug information and legal rules. All this data need to be analysed to produce insights
that will save cost both for the health care provider and care receiver.
 Computer Vision: The advancement in recognizing an image by a computer
involves processing large sets of image data from multiple objects of same category.
For example, face recognition.
Python in Data Science:
The programming requirements of data science demand a very versatile yet flexible language
which is simple to write the code but can handle highly complex mathematical processing.
Python is most suited for such requirements as it has already established itself both as a
language for general computing as well as scientific computing. More over it is being
continuously upgraded in form of new addition to its plethora of libraries aimed at different
programming requirements.
SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 1

1.2. Machine learning
Machine learning is a discipline that deals with programming the systems so as to make them
automatically learn and improve with experience. Here, learning implies recognizing and
understanding the input data and taking informed decisions based on the supplied data. It is
very difficult to consider all the decisions based on all possible inputs.
To solve this problem, algorithms are developed that build knowledge from a specific data
and past experience by applying the principles of statistical science, probability, logic,
mathematical optimization, reinforcement learning, and control theory.
For example, machine learning programs can scan and process huge databases detecting
patterns that are beyond the scope of human perception.
Applications of Machine Learning
The developed machine learning algorithms are used in various applications such as
 Vision processing
 Language processing
 Forecasting things like stock market trends, weather
 Pattern recognition
 Games
 Data mining
 Expert systems
 Robotics
Types of machine learning algorithms

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
Supervised Learning:
Supervised learning involves building a machine learning model that is based on labeled
samples. Learning data comes with description, labels, targets or desired outputs and the
objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data.
For example, if we build a system to estimate the price of a plot of land or a house based on
various features, such as size, location, and so on, we first need to create a database and label
it. We need to teach the algorithm what features correspond to what prices. Based on this
data, the algorithm will learn how to calculate the price of real estate using the values of the
input features.
Supervised learning can be further classified into two types -

Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example predicting real
estate prices.
Regression algorithms:
 Linear regression
 Logistic regression
 Polynomial Regression
 Stepwise Regression etc.
Classification attempts to find the appropriate class label, such as analyzing

positive/negative sentiment, male and female persons, benign and malignant tumors, secure
and unsecure loans etc.
Classification algorithms:
 Decision tree algorithms
 K Nearest Neighbor algorithms
 Support Vector Machine algorithms
 Naïve Bayes algorithms etc..

Unsupervised learning:
Unsupervised learning has no labelled data here. When learning data contains only some
indications without any description or labels, it is up to the coder or to the algorithm to find
the structure of the underlying data, to discover hidden patterns, or to determine how to
describe the data. This kind of learning data is called unlabeled data.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for
identifying patterns and trends. They are most commonly used for clustering similar input
into logical groups. Unsupervised learning algorithms include
Clustering algorithms
 Kmeans
 Random Forests
 Hierarchical clustering etc..
Dimensionality reduction algorithms
 PCA (Principle Component Analysis).
Reinforcement Learning
Here learning data gives feedback so that the system adjusts to dynamic conditions in order
to achieve a certain objective. The system evaluates its performance based on the feedback
responses and reacts accordingly. The best known instances include self-driving cars and
chess master algorithm Alpha Go.
1.3. Project Introduction
The sales forecast indicates as to how much of a particular product is likely to be sold
in a specified future period in a specified market at specified price. Forecasting Sales is the
process of making predictions based on dataset. Causal forecasting attempts to predict a
variable by trying to explain what factors cause it to change. For example, a causal model to
predict market demand for a product might use the product's price, competitors' prices and the
amount of money spent on advertising to explain what product demand might be six months
in the future.
For this Prediction we use K-Means and Linear Regression Machine learning
algorithms, where K-means is Clustering algorithm in and Linear regression is Regression
algorithm. Regression models a target prediction value based on independent variables. It is
mostly used for finding out the relationship between variables and forecasting.
Here we are using a dataset called “Advertising”. To do data science project we must know
about some python libraries like
 NumPy
 Pandas
 Scikitlearn
 Matplotl
ib And IDE’s
like
 Jupyter
 Spyder
SALES FORECASTING FOR MARKETING INSTALLATIONS
2. INSTALLATIONS
2.1 ANACONDA:
Anaconda is a package manager, an environment manager,
and Python distribution that contain a collection of many open source packages. This is
advantageous as when you are working on a data science project, you will find that you need
many different packages (NumPy, Scikit-learn, SciPy, pandas to name a few), which an
installation of Anaconda comes preinstalled with.
Download and Install Anaconda:
1. Go to the Anaconda Website and choose a Python 3.x graphical installer (A) or a Python
2.x graphical installer (B). If you aren't sure which Python version you want to install, choose
Python 3. Do not choose both.
2. Locate your download and double click it.

Then download starts….
When the screen below appears, click on Next.
3. Read the license agreement and click on I Agree
4. Click on Next.
5. Note your installation location and then click Next.

6. This is an important part of the installation process. The recommended approach is to not
check the box to add Anaconda to your path. This means you will have to use Anaconda
Navigator or the Anaconda Command Prompt. When you wish to use Anaconda. If you want
to be able to use Anaconda in your command prompt please use the alternative approach and
check the box.
7. Click on Next.
8. Click on Next
9. Click on Finish.
Anaconda provides various IDE’s like Jupyter, Spyder, etc. You can launch them and use
them.

2.2. Integrated Development Environment (IDE):
Jupyter:
 The Jupyter Notebook is an incredibly powerful tool for interactively developing and
presenting data science projects.
 A notebook integrates code and its output into a single document that combines
visualisations, narrative text, mathematical equations, and other rich media.
 It is possible to use many different programming languages within Jupyter Notebooks,
this article will focus on Python as it is the most common use case.
Spyder:
 Spyder was developed specifically for data science
 Spyder is an open source cross-platform IDE for data science.
 Spyder does the job of integrating the essentials libraries for data science like
IPython, SciPy, Matplotlib and NumPy.
 Spyder has features like code completion, a text editor with syntax highlighting, and
variable exploring, whose values you may edit using a GUI.
 An online help browser, allowing users to search and view Python and package
documentation inside the IDE
SALES FORECASTING FOR MARKETING PYTHON LIBRARIES
3. PYTHON LIBRARIES & SYSTEM SPECIFICATIONS
Libraries:
NumPy:
 NumPy is an open source extension module for Python.

 It’s very easy to work with large multidimensional arrays and matrices using
NumPy.
 Another advantage of NumPy is that you can apply standard mathematical operations
on an entire data set without having to write loops.
 Even though NumPy does not provide powerful data analysis functionalities,
understanding NumPy arrays and array-oriented computing will help you use other
Python data analysis tools more effectively.
Pandas:
 Pandas is a Python module that contains high-level data structures and tools designed
for fast and easy data analysis operations.
 Pandas is built on NumPy and make it easy to use in NumPy-centric applications,
such as data structures.
 It is also easy to handle missing data using Python. Pandas are the best tool for doing
data munging.
Matplotlib:
 Matplotlib is a Python module for visualization.

 Matplotlib allows you to quickly make line graphs, pie charts, histograms and other
professional grade figures.
 Using Matplotlib, you can customise every aspect of a figure.
 Matplotlib has interactive features like zooming and panning.
Scikit-Learn:
 Scikit-Learn is a Python package for machine learning.

 It provides a set of common machine learning algorithms to users through a consistent
interface.
 Scikit-Learn help to quickly implement popular algorithms on datasets.
System Specifications:
Hardware Requirements:
 Processor : i3 or higher
 Processor Speed : minimum 500Mhz
 Hard Disk : minimum 30GB
 Input Devices : Keyboard, Mouse
 Ram : 8GB or higher.
Software Requirements:
 Operating system : Windows 10.

 Coding Language : Python
 Libraries : NumPy,Pandas,Matplotlib,Scikitlearn
 Tool : Jupyter, Spyder
 Dataset : Advertising.csv
SALES FORECASTING FOR MARKETING PROJECT LIFE CYCLE
4. DATA SCIENCE PROJECT LIFE CYCLE

This covers every step of the data science project lifecycle from end to end.
1. Data Inspection
2. Data Scrubbing
3. Data Exploring
4. Model Building
5. Model Evaluation
4.1. Data Inspection:
The First step in Sales Prediction is to Load the relevant Data.The location and
structure of the data to be analyzed will vary from person to person and application to
application.
In this case the data we will analyse is in numerical format i.e.,

advertising.csv and load it into the pandas.
advertising =pd.DataFrame
In this case the data we will analyse is in numerical format i.e.,

advertising.csv and load it intothe pandas.
advertising=pd.DataFrame(pd.read_csv(“advertising.csv"))
CSV (Comma Separated Values) are a common file format for transferring & storing
data where commas are used to separate different columns and newlines used to
separate rows.
Pandas is a open source library providing high-performance, easy to use data

structures and data analysis tools for python.
Data frames are Pandas data type for storing 2-dimensional data in which each column
contains values of one variable and each row contains one set of values from each
column.
Data frames supports methods for Slicing and Dicing the data, such as rows and
columns where slicing is used for Viewing, segmenting the data in the database.
4.2. Data Scrubbing:
It is the process of identifying and correcting inaccurate data from a dataset. In our
dataset there are incomplete and incorrect data that we have cleaned by using some
libraries.
After cleansing, a dataset should be consistent with other similar data sets in the
system. The inconsistencies detected or removed may have been originally caused by
user entry errors, by corruption in transmission or storage, or by different data
dictionary definitions of similar entities in different stores. Data cleaning differs from
data validation in that validation almost invariably means data is rejected from the
system at entry and is performed at the time of entry, rather than on batches of data.
Steps in Data Cleaning:
1. Get rid of extra spaces
2. Select and treat all blank cells
3. Remove duplicates
4. Highlight Errors
5. Change text to lower/upper case
Our dataset i.e., advertising.csv contains some missing values in the newspaper attribute.
Those missing values are represented with NaN (not a null) ,by scrubbing the dataset we fill
those missing values with mean value of newspaper attribute that is shown below
4.3. DataAnalysis:
Data exploring is the first step in data analysis and typically involves summarizing the
main characteristics of a dataset including its size, accuracy and attributes.
Issues involved in Data analysis:
1. Having the necessary skills to analyze
2. Selecting data collection methods & appropriate analysis
3. Determining statistical significance
4. Manner of presenting data
We are using Box plot for analyzing the attributes, Box plot is a graph that gives you a
good indication of how the values in the data are spread out, it consists of Predictor variables
specifies one or more variables used to determine Target variable and Target Variable it
specifies these values are to be modeled and predicted by other variables.
Box plot is suitable for Comparing range and distribution of numerical data.
Univariate Analysis:
It is the Simplest form of analyzing the data “uni” means one, so in other words your
data has only one variable.
Outlier Analysis:
An Outlier is a rare chance of occurrence within a given data set. In Data Science, an
Outlier is an observation point that is distant from other observations. An Outlier may
be due to variability in the measurement or it may indicate experimental error.
Outliers, being the most extreme observations, may include the sample maximum or
sample minimum, or both, depending on whether they are extremely high or low.
However, the sample maximum and minimum are not always outliers because they
may not be unusually far from other observations.
Correlation Matrix:
A correlation matrix is a table showing correlation coefficients between variables.
Each cell in the table shows the correlation between two variables. A correlation matrix is
used as a way to summarize data, as an input into a more advanced analysis, and as a
diagnostic for advanced analyses.
What is Correlation?
Variables within a dataset can be related for lots of reasons. For

example:
 One variable could cause or depend on the values of another variable.
 One variable could be lightly associated with another variable.
 Two variables could depend on a third unknown variable.
It can be useful in data analysis and modeling to better understand the relationships between
variables. The statistical relationship between two variables is referred to as their correlation.
A correlation could be positive, meaning both variables move in the same direction, or
negative, meaning that when one variable’s value increases, the other variables’ values
decrease. Correlation can also be neural or zero, meaning that the variables are unrelated.
 Positive Correlation: both variables change in the same direction.
 Neutral Correlation: No relationship in the change of the variables.
 Negative Correlation: variables change in opposite directions.
We may also be interested in the correlation between inputs variables with the output variable
in order provide insight into which variables may or may not be relevant as input for
developing a model.
The structure of the relationship may be known, e.g. it may be linear, or we may have no idea
whether a relationship exists between two variables or what structure it may take. Depending
what is known about the relationship and the distribution of the variables, different
correlation scores can be calculated.
How to Use Correlation to Understand the Relationship between Variables
There may be complex and unknown relationships between the variables in your dataset.
It is important to discover and quantify the degree to which variables in your dataset are
dependent upon each other. This knowledge can help you better prepare your data to meet the
expectations of machine learning algorithms, such as linear regression, whose performance
will degrade with the presence of these interdependencies.
Applications of a correlation matrix:
There are three broad reasons for computing a correlation matrix.
1. To summarize a large amount of data where the goal is to see patterns. In our
example above, the observable pattern is that all the variables highly correlate with
each other.
2. To input into other analyses. For example, people commonly use correlation matrixes
as inputs for exploratory factor analysis, confirmatory factor analysis, structural
equation models, and linear regression when excluding missing values pair wise.
3. As a diagnostic when checking other analyses. For example, with linear regression a
high amount of correlations suggests that the linear regression’s estimates will be
unreliable.
4.4. Model Building:
4.4.1 K-Means Clustering Algorithm:
K-Means Clustering is a type of unsupervised machine learning that groups data on the basis
of similarities.K-Means is one technique for finding subgroups within datasets. One
difference in K-Means versus that of other clustering methods is that in K-Means, we have a
predetermined amount of clusters and some other techniques do not require that we predefine
the number of clusters. The algorithm begins by randomly assigning each data point to a
specific cluster with no one data point being in any two clusters. It then calculates the
centroid, or mean of these points.
K-Means is a popular clustering algorithm used for unsupervised Machine Learning. In this
example, we will fed 200 records of fleet drivers data into K- Means algorithm developed in
Python using Pandas, NumPy and Scikit-learn, and cluster data based on similarities between
each data point.
K-means working:
K-means simply partitions the given dataset into various clusters (groups) with different
features.
How exactly?
K refers to the total number of clusters to be defined in the entire dataset. There is a centroid
chosen for a given cluster type which is used to calculate the distance of a given data point.
The distance essentially represents the similarity of features of a data point to a cluster type.
Why to use K-means?
Using k-means, the data is clustered after analyzing the data and not
primitively defining it under a group based on pre-defined labels. Each
centroid is a collection of features that essentially represent the type of cluster
it belongs to. Thus a centroid can be used to interpret the type of cluster
formed.
Real life applications:
1. Segmentation for customer retail dividing into clusters for various data analytics
application for e.g. for knowing loyal customers, for analyzing the spending behavior
of a customer or needs of certain types of customer.
2. Fraud detection for cyber crime frauds.
3. MP3 files, cellular phones are the general areas that use this technique.
Constraints:
Only numerical data can be used. Generally k-means works best for 2 dimensional
numerical data. Visualization is possible in 2d or 3d data. But in reality there are always
multiple features to be considered at a time.
Thus multi-dimensional data can be used but dimensionality reduction has to be performed to
the data before using it for k-means.
Finding the optimum number of k’s, how many clusters should the data be grouped
into, this will be done by testing for different number of k’s and using the “elbow point”,
where we plot the mean of distances between each point in the cluster and the centroid against
the number of k’s used for the test, to help determine the number of suitable clusters for this
dataset.
The steps will describe the code in details and how these packages were used:
1. Setup the environment and load the data
In this we are importing the following packages:
Pandas, NumPy are a package for scientific computing with Python, Scikit-learn package
for machine learning, sklearn.cluster for clustering were we use the Kmeans, from the
matplotlib we used the pyplot package to create a visual representation of the data, and
SciPy lib for Euclidian distance calculation for elbow method and we are using the
advertising.csv file.
2. Importing and preparing data
Here we import the data from advertising.csv file into Pandas data frame object with column
names f1 and f2. f1 referring to TV, Newspaper, Radio and f2 referring to Sales respectively.
Then, we transform two vectors f1 and f2 into a NumPy array objects and name it X. X
represents our primary data model to fit into Scikit Kmeans algorithm later.
Visualizing our data:
It shows the Relationship between the attributes “TV” and “Sales”
It shows the Reltionship between the attribues “Newspaper” and “Sales”

It shows the Relationship between the attributes “Radio” and “Sales”
3. Visualize Raw Data and initial centroids
Here we calculate initial set of random centroids for K value 2, and plot both the raw data and
initial centroids on the scatter plot. K represents the number of clusters, we start by setting it
to 2, the NumPy package is used to generate random values and assign to centroids.
Here is the output when you run code. Initial centroids are indicated as green stars. Green dots
represent the raw fleet data. This is how initial data looks like before running clustering.
4. Build K-Means Model to Run with a given K-value
Provided we have the data in required format “X”, here we create the K-Means model and
specify the number of clusters (K value). Next, we fit the model to the data, generate cluster
labels and compute centroids.
5. Select k, run algorithm and plot the clusters
Now, we run the K-Means algorithm for K values 2, 4 and 6. In this we take the K value as 2
beacause we want to show the Relationship between the attributes in the given dataset. For
each K-value, we compute the centroids and iterate doing the same until the error value
reaches the zero. In other words, for each cluster, when the distance between old centroid
value and new calculated centroid value becomes zero, we stop. This is called stopping
criteria for calculating centroids per cluster accurately. However, for coding purposes, here
we don’t need to do error calculation explicity as it is already taken care of inside SkLearn K-
Means model.
Starting with k =2,4,6 we ran the code, Squared Euclidean distance measures the distance
between each data point and the centroid, then the centroid will be re- calculated until the stop
criteria.
6. Selecting optimum k
The technique we use to determine optimum K, the number of clusters, is called the
elbow method.
By plotting the number of centroids and the average distance between a data point and the
centroid within the cluster we arrive at the following graph.
The mathematics of clustering
The mathematics behind clustering, in very simple terms involves minimizing the sum of
square of distances between the cluster centroid and its associated data points:
 K = number of clusters
 N= number of data points
 C=centroid of cluster j
 (xij — cj)– Distance between data point and centroid to which it is assigned
Deciding on the optimum number of clusters ‘K’

The main input for k-means clustering is the number of clusters. This is derived using the
concept of minimizing within cluster sum of square (WCSS). A scree plot is created which
plots the number of clusters in the X axis and the WCSS for each cluster number in the y-
axis.
As the number of clusters increase, the WCSS keeps decreasing. The decrease of WCSS
is initially steep and then the rate of decrease slows down resulting in an elbow plot. The
number of clusters at the elbow formation usually gives an indication on the optimum
number of clusters. This combined with specific knowledge of the business requirement
should be used to decide on the optimum number of clusters.
Visualize the final clusters:
The plot shows the distribution of the 3 clusters. We could interpret them as the following
customer segments:
1. Cluster 1: Shows the Ratings of “TV” attribute

2. Cluster 2: Shows the Ratings of “Newspaper” attribute
3. Cluster 3: Shows the Ratings of “Radio” attribute
By this plotting we showed that “TV” has highest Ratings Compared to Both
Radio and Newspaper.
Advantages:
 Easy to implement
 With a large number of variables, K‐Means may be computaHonally faster than

hierarchical clustering (if K is small)
 K-Means may produce Higher clusters than hierarchical clustering
 An instance can change cluster (move to another cluster) when the centroids are re-‐
computed.
Disadvantages:
 Difficult to predict the number of clusters (K‐Value)
 IniHal seeds have a strong impact on the final results
 The order of the data has an impact on the final results
 Sensitive to scale: rescaling your datasets (normalizaHon or standardizaHon) will

completely change results. While this itself is not bad, not realizing that you have to
spend extra a4en (on to scaling your data might be bad).
4.4.2 Performing Simple Linear Regression
Linear Regression:
Regression analysis is a form of Predictive modeling technique which investigates

the relationship between dependent and independent variables.
Linear regression was developed in the field of statistics and is studied as a

model for understanding the relationship between input and output numerical variables, but
has been borrowed by machine learning. It is both a statistical algorithm and a machine
learning algorithm.
In this we have two types of linear regression models they are

Simple linear regression
Multiple linear regression
Simple Linear Regression
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
1. Calculate Mean and Variance.

2. Calculate Covariance.
3. Make Predictions.
4. Estimate Coefficients.
1. Calculate Mean and Variance
The first step is to estimate the mean and the variance of both the input and output variables
from the training data.
The mean of a list of numbers can be calculated as:
Mean(x)=sum(x)/count(x)
The variance is the sum squared difference for each value from the mean value. Variance
for a list of numbers can be calculated as:
2. Calculate Covariance
The covariance of two groups of numbers describes how those numbers change together.
Covariance is a generalization of correlation. Correlation describes the relationship between

two groups of numbers, whereas covariance can describe the relationship between two or
more groups of numbers.
Additionally, covariance can be normalized to produce a correlation value.
3. Make Predictions
The simple linear regression model is a line defined by coefficients estimated from training
data.
Once the coefficients are estimated, we can use them to make predictions.
4. Estimate Coefficients
We must estimate the values for two coefficients in simple linear regression.
Different techniques can be used to prepare or train the linear regression equation from data,
the most common of which is called Ordinary Least Squares.
Ordinary Least Squares
When we have more than one input we can use Ordinary Least Squares to estimate the values
of the coefficients.
The Ordinary least squares procedure seeks to minimize the sum of the squared residuals.
This means that given a regression line through the data we calculate the distance from each
data point to the regression line, square it, and sum all of the squared errors together. This is
the quantity that ordinary least squares seeks to minimize.
This approach treats the data as a matrix and uses linear algebra operations to estimate the
optimal values for the coefficients. It means that all of the data must be available and you
must have enough memory to fit the data and perform matrix operations.
It is unusual to implement the Ordinary Least Squares procedure yourself unless as an

exercise in linear algebra. It is more likely that you will call a procedure in a linear algebra
library. This procedure is very fast to calculate.
• In this we are using the slope equation i.e., y=b1x+b0 We

are taking this as y=b0+b1x1+b2x2+...+bnxn
Where y is the
response b1 is the slope
b0 is the intercept
The b1 values are called the model coefficients or model parameters.

we are checking the goodness of fit by using (R-square) method.R-square value is a
statistical measure of how close the data is to fitted in the regression line. It is also known as
coefficient of determination.
Formula is (R-square) = ∑ ( yi− y¯) 2 / ∑ ( y− y¯) 2
Splitting our dataset
 Now we need to split our dataset into two sets — a Training set and a Test set. We will
train our machine learning models on our training set, i.e. our machine learning models
will try to understand any correlations in our training set and then we will test the
models on our test set to check how accurately it can predict.
 A general rule of the thumb is to allocate 80% of the dataset to training set and the
remaining 20% to test set, but in our project we split our dataset into 70% of training
data and 30% of test data.
 For this task, we will import test_train_split from Scikit learn library
#Splitting the dataset into the Training set and Test set
From sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state
= 0)
 We have 200 records, then

70% train data – 170 records
30% test data – 30
records
 In our data set, we consider TV, Newspaper, Radio and Sales for modelling
Train data is split into

X_train and Y_train
Test data is split into
X_test and Y_test
After splitting our dataset into train and test data we apply linear regression
algorithm
Looking at some key statistics from the summary The
values we are concerned with are -
1. The coefficients and significance (p-values)
2. R-squared
3. F statistic and its significance
1. The coefficient for TV is 0.054, with a very low p value

The coefficient is statistically significant. So the association is not purely by chance.
2. R - squared is 0.816
Meaning that 81.6% of the variance in Sales is explained by TV This is
a decent R-squared value.
3. F statistic has a very low p value (practically low)
Meaning that the model fit is statistically significant, and the explained variance isn't
purely by chance.
The fit is significant. Let's visualize how well the model fit the data. From the
parameters that we get, our linear regression equation becomes:
Sales=6.948+0.054×TV
Train data prediction for the attributes TV and Sales

Test data prediction for the attributes TV and Sales
Train data prediction for the attributes Newspaper and Sales

Test data prediction for the attributes Newspaper and Sales
Train data prediction for the attributes Radio and Sales

Test data prediction for the attributes Radio and Sales
4.5. Model Evaluation:
Residual analysis:
To validate assumptions of the model, and hence the reliability for inference
Distribution of the error terms
We need to check if the error terms are also normally distributed (which is infact, one of the
major assumptions of linear regression), let us plot the histogram of the error terms and see
what it looks like.
Advantages and Disadvantages:
 The principal advantage of linear regression is its simplicity, interpretability,
scientific acceptance, and widespread availability.
 Linear regression is the first method to use for many problems. Analysts can use
linear regression together with techniques such as variable recoding, transformation,
or segmentation.
 Its principal disadvantage is that many real-world phenomena simply do not correspond
to the assumptions of a linear model; in these cases, it is difficult or impossible to
produce useful results with linear regression.
 Linear regression is widely available in statistical software packages and

business intelligence tools.
Clustering vs Regression:
K-Means algorithm is one of the clustering algorithms which comes under unsupervised
learning method. In clustering the idea is to assign different classes to different data points
depending on how they group together, Where as linear regression algorithm is one of the
regression algorithms which comes under supervised algorithm. It predicts continuous valued
output. It is the statistical model which is used to predict the numeric data instead of labels. It
can also identify the distribution trends based on the available data or historic data.
Accurate prediction of Sales from advertising through Tv, Sales, Newspaper can be done by
using the Regression algorithm.

SALES FORECASTING FOR MARKETING CONCLUSION
5. CONCLUSION
By applying some statistical measures we can ensure the sales and growth of the company
can be predicted very easily and can have an idea about the raise and fall of the company. So
the accurate prediction of sales play vital role in the development of the company which
leads to huge profits.
Sales forecasting is a critical part of the strategic planning process and allows a company to
predict how their company will perform in the future. It allows them to not only plan for new
opportunities, but also allows them to avert negative trends that appear in the forecast. A
mission statement is important because it allows an organization to know exactly why they
exist and serves as a guide for decisions. Both concepts are important to the success of the
company and should not be overlooked throughout the strategic planning process.
SALES FORECASTING FOR MARKETING REFERENCES
6. REFERENCES
 https://www.towardsdatascience.com
 https://www.datacamp.com
 https://www.wikipedia.org
 https://www.tutorialspoint.com
 https://www.w3schools.com
 https://www.guru99.com
 https://www.codeproject.com
 https://www.google.com
 https://www.kaggle.com
 https://www.analyticsvidya.com

Data Science: Sales Forecasting For Marketing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science: Sales Forecasting For Marketing

Uploaded by

Copyright:

Available Formats

SALES FORECASTING FOR MARKETING INTRODUCTION

Python in Data Science:

SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 1

Applications of Machine Learning

 Forecasting things like stock market trends, weather

Types of machine learning algorithms

Supervised learning can be further classified into two types -

 Stepwise Regression etc.

Classification attempts to find the appropriate class label, such as analyzing

 Decision tree algorithms

 K Nearest Neighbor algorithms

 Support Vector Machine algorithms

 Naïve Bayes algorithms etc..

 Hierarchical clustering etc..

Dimensionality reduction algorithms

 PCA (Principle Component Analysis).

Download and Install Anaconda:

2. Locate your download and double click it.

5. Note your installation location and then click Next.

SAI SPURTHI INSTITUTE OF TECHNOLOGY Page 10

3. PYTHON LIBRARIES & SYSTEM SPECIFICATIONS

 NumPy is an open source extension module for Python.

 Matplotlib is a Python module for visualization.

 Scikit-Learn is a Python package for machine learning.

 Operating system : Windows 10.

4. DATA SCIENCE PROJECT LIFE CYCLE

In this case the data we will analyse is in numerical format i.e.,

In this case the data we will analyse is in numerical format i.e.,

Pandas is a open source library providing high-performance, easy to use data

Steps in Data Cleaning:

1. Get rid of extra spaces

2. Select and treat all blank cells

5. Change text to lower/upper case

Issues involved in Data analysis:

1. Having the necessary skills to analyze

2. Selecting data collection methods & appropriate analysis

3. Determining statistical significance

4. Manner of presenting data

Variables within a dataset can be related for lots of reasons. For

 One variable could be lightly associated with another variable.

 Two variables could depend on a third unknown variable.

 Positive Correlation: both variables change in the same direction.

 Neutral Correlation: No relationship in the change of the variables.

 Negative Correlation: variables change in opposite directions.

Applications of a correlation matrix:

There are three broad reasons for computing a correlation matrix.

4.4.1 K-Means Clustering Algorithm:

1. Setup the environment and load the data

In this we are importing the following packages:

Visualizing our data:

It shows the Relationship between the attributes “TV” and “Sales”

It shows the Reltionship between the attribues “Newspaper” and “Sales”

5. Select k, run algorithm and plot the clusters

The mathematics of clustering

 N= number of data points

 (xij — cj)– Distance between data point and centroid to which it is assigned

Deciding on the optimum number of clusters ‘K’

1. Cluster 1: Shows the Ratings of “TV” attribute

 With a large number of variables, K‐Means may be computaHonally faster than

 K-Means may produce Higher clusters than hierarchical clustering

 Difficult to predict the number of clusters (K‐Value)

 IniHal seeds have a strong impact on the final results