Final Report-Inqer Printers and Spares-Final - Edited

PRINTERS AND SPARES
PG DIPLOMA DATA SCIENCE-2019-2020 CAPSTONE PROJECT
DEMAND FORECASTING TO ASSIST AN

UPCOMING PRINTER MANUFACTURER
TEAM MEMBERS
ABHYUDAYA MARYA
JEEVA RAMPRASAD
ROMA DADHIRAO
SUDARSHAN RAKHMAJI KADGE
G TAROON SUBRAMANIAM
AMITY UNIVERSITY ONLINE

Final Report-Amity Capstone Demand Forecasting-Printer Spare Parts
Table of Contents
1. Introduction………………………………………………………………...2
2. Project Description & Tools Used……………………………………….4
3. Role of Machine Learning………………………………………………..5
4. Data Exploration…………………………………………………………..6
5. Data Manipulation...............................................................................7
6. Feature Engineering………………………………………………………8
7. Building Training-Test Sample...................……………………………..9
8. Model Selection and
Evaluation……………………………………………………………........10
9. Key Contributions…………………………………………………………12
10. Analysis of Results………………………………………………………..13
11. Tableau Visualizations.......................................................................14
1
Introduction
Through this project, for the benefit of an organization (Inqer Printers and Spares*) dealing
with sale of spare parts of printers, we intend to integrate data on their stock/ inventory and
demand collated for the past 15 months to effectively track their supply chain, enhance
decision making ability and expedite the process of grievance redressal using Machine
Learning Algorithms. Through statistical techniques, we will forecast demand effectively in
the short, medium and long term. In addition, we will optimize the inventory to maximize
profits (while also being cognizant of the need of a buffer stock). This will help envisage
better sales and operations plans across departments and also optimize resourcing efficiency
by creating supply plans based on prioritized demands, allocations and supply chain
constraints.
Dataset Description:
1. The data set consists of sales of a company dealing with a large number of
components.
2. It consists of inventory of the components in 3 different warehouses->AME, APJ,
EMEA.
3. It also consists of parameters to prioritize inventory planning through Local Area
Stock Code, PSMS*, D-Chain Status**, SPT***
D-Chain PSMS
25 C2, C4
55 C5, S6, C8, S8
60, 61 S9
69 C5, S6, C8, S8
*Data is real time (with certain realistic changes) by an organization, however, a fictitious name to uphold its privacy.
2
D-Chain Description
25 Part is not currently available for sale but

will be at a future date. Allows pricing and
costs but does not allow orders.
55 The part is actively available
60 The part is no longer available. Orders and

pricing are blocked.
*orders can be manually created on GCSS
61 The part is no longer available but does have

a replacement. All D-Chain 61 parts must
have a corresponding Material Determination
record
69 The part is actively available but under

allocation. It allows orders but prevent
shipment until the order is released.
Planned Delivery Time
Time it takes to receive a part after a purchase order (PO) is placed with a supplier
Yield Rate (For Repair Parts)
Special Procurement Type (SPT)
1. Part is repairable
2. Part is set to return due to the OEM for warranty coverage
3. If a part is non-returnable, it is assumed it cannot be repaired
3
PSMS Description
C2 In development - the default initial value for corporate parts
C4 Pre launch -the default initial value for SC specific parts
C5 NPI – Part is released for support
S6 Sustaining –supported part which is > 180 past the SAP FCS date
C8 Supplier (or ‘Supply’) EOL, POs still possible with potential limitations
S8 LTB analysis done, LTB PO raised where required
S9 EOSL – Part is no longer supported
C9 Obsolete –blocks all inventory and financial transactions
Project Objectives and Tools Used

Following are the tools to be used for the purpose of the project to demand the forecast:
a. Excel->Conversion of a bulky dataset into a csv file and collate the demand data with
spare parts data
b. Tableau->First an initial descriptive report to bolster the business acumen before we
go about assigning values and weights to the dataset for manipulation in python
c. Jupyter Notebook (Python)->Through Anaconda, a live web based environment for
effective data analytics to handle large datasets for effectively gathering insights
Following are the Python libraries most commonly used:
a. SciKit Learn->Guides through the entire lifecycle of the data analytics life cycle
from preprocessing to learning
i. Label Encoder, One Hot Encoder (Encoding Data)
ii. Standard Scaler, Normalizer, MinMaxScaler (Scaling/ Normalizing
Data)
iii. PCA, Shuffle, Column Transformer (Feature Engineering)
iv. Train_Test_Split (Hold out method for supervised learning)
4
v. Supervised learning methods such as Naïve Bayes (Gaussian,

Bernoulli, Multinomial), SVM, Decision Tree, Random Forest,
Regression (Logistic, Linear)
vi. Evaluation metrics such as accuracy score, precision score, recall
score, f1 score
b. Pandas->Used to work on the dataframes from the stage of its initiation to

complex manipulations
c. Numpy->Predominantly to perform linear algebra operations on the dataframes

(and other forms of datasets too) but not limited to it.
d. Visualization
i. Matplotlib
ii. Seaborn
e. Other
i. Statistics
ii. Itertools
Objective
1. In this project, the target is to predict demand for the product based on its prior sales.
2. We are also trying to create a system to manage the inventory of the different.
warehouses, depending on the sales of a product.
3. This will give idea of much quantity of product to be ordered.
*PSMS->Plant Specific Material Status, it indicates its current position in the life cycle
**D-Chain->Determines if part is for sale and available for immediate delivery
***SPT->Special Procurement Type->Returnable or non-returnable
Role of Machine Learning

Machine Learning Model
For demand planning, since, we have the past sales data, we could perform supervised the
price of the item (business acumen suggesting that lower priced/ support system items would
usually need a bulk order). As is evident in the preliminary results, on experimenting with
various supervised learning techniques, the best result obtained was using ensemble methods,
i.e Random Forest and Decision Tree. We have also taken cognizance that it would be
important to compartmentalize demand in blocks such as low, medium, high, very high and
5
booming given the humongous variance as is expected in a business of a such a wide ambit.
The manipulation is done using Python and SciKitLearn Libraries.
Given the supervised learning techniques and the categorical nature of the predictor variable,
we employed the following supervised learning techniques:
a. GaussianNB
b. Random Forest Classifier
c. Decision Tree Classifier
d. MultinomialNB
e. SVM
f. Bernoulli NB
With the obtained accuracies mentioned below (post feature engineering)
As is evident, Random Forest and DT classifier (ensemble methods) give us the best results,
hence we obtain our results based on this. Based on several trials and errors, including
shuffling of data in hold out and using random samples from dataset, Random Forest is
consistently the best method.
Data Exploration
This was the first step to the demand forecasting process and to most data analytics life cycle
in general. After all the libraries from Numpy, Pandas, ScikitLearn, Matplotlib etc are added,
the steps are:
a. Add the csv file and check for datatypes
b. and read the first few rows (head() function)
c. Predictor variable to a different dataframe (and must be dropped later)
Here, business acumen is of utmost importance.
6
i. Local Stock Advice Code is mentioned in terms of priority of the

product as 1,2,3,4 with similar weights as is the priority
ii. There is a usual trade off of quantity and price and hence the bulk of
orders and to account for objectivity in results, we multiply
LOFNETQTY with Price (eg-a cartridge will always have a lot more
individual parts in orders than a xerox machine but the standards for us
to deem it a high demand is different)
d. Fill in NULL values in the dataframe and other nitty gritties such as separate
month and year and correction of spellings
e. Also, note, PSMS have been given weights in accordance to priority (a manual
label encoding). PSMS of S6 is extremely high priority due to being sustaining
while others are not. (Based on discussion with the lead, S6 given 4 times the
priority of 1)
f. One hot encoded the month since, our initial assumption/ null hypothesis is that
the month per se will not have an impact. If it does, the machine learning model
will account for it based on the equal weights assigned through OHE. Similarly
OHE on the region too.
g. For the predictor variable, we bin it as very low, low, medium, high, very high,
booming based on the needs of the lead. By trial and error based on 25th, 50th, 75th,
85th, 95th and 99th percentiles of predictor variable values, the bins were chosen as:
Data Manipulation
a. Dropping redundant columns/ non numerical. Drop month and region since they were
one hot encoded already.
b. Normalization of columns except local stock advice code (it is essentially a weight)
7
c. Price, DChain and Inventory values had a few NaN/ invalid values so replaced those
with the median
Feature Engineering
Before dwelling into the feature engineering process, presenting first analysis of the various
supervised learning methods (after hold out method was applied)
(Three examples of only basic code snippets, else it would fill the whole page)
With the accuracy results (first just the basic metric)
8
Hence, it was imperative to filter out important features, used random forest to filter out since
that was the most accurate method here.
Plot of features in order of importance
We can choose the top 10 features for our need. (APJ_INVentory_Value,

AME_INVentory_Value,....., AMS) and discard the other
Significant improvement of 4-5% accuracy in some cases. Best is still the ensemble methods
of Random Forest and Decision Tree.
Building Training-Test Sample

We use the hold out training test validation method for this wherein a part of the dataset is
used to train and build a model and in accordance to the trained model, check if the smaller
training set is learning it currently and hence evaluate the model.
In this case,
Training->70% of the dataset
9
Test->30% of the dataset

Validation Process
The training and test dataset were first evaluated with the shuffle validator as “True” so that
the sets are formed randomly, on running the entire workflow 5 times, it is set to “False”, so
it may be feasible for us to label the evaluated dataset and hence provide insights with utmost
accuracy.
The process was performed twice->first on the dataset as it is and then again, post the feature
engineering process.
Model Selection and Evaluation

Since, we had a predictor variable, we limited the modelling to supervised learning
techniques. Regression would not prove to be useful in this case due to the high number of
encoding and weights assignations and could skew the results as either under or overfitting,
hence we limited to the following methods:
a. Naïve Bayes->GaussianNB, MultinomialNB, BernoulliNB
b. Ensemble Methods->Random Forest, Decision Tree
c. Vector methods->Support Vector Machines
d. Neighbors->K-distance Neighbors Classifier
Initially, only the accuracy was tested to check for potential for feature engineering:
While the accuracy is decent in a few methods (Random Forest, DT, KNeighbors, SVM),
there is potential for improvement of both accuracy and also computation time hence, it is
imperative to resort to feature selection.
Feature Selection Snippet

Used Random Forest for the purpose due to its highest accuracy before
10
Accuracy Post Feature Selection
Did not bode well for Naïve Bayes and SVM based methods however, it terms of Ensemble
(Random Forest and DT) and Neighbors based (KNN), the improvement is highly significant.
While either of the three can be chosen, stuck to Random Forest Classifier.
Other Evaluation Metrics

Note->It is not recommended to use Area Under Curve for comparison, since it needs the
process of binarization of data first, which would lead to less interpretability in categorical
predictor variable
Precision and Recall Score for Random Forest
Confidence Matrix
11
Due to a high value of over 75% in all metrics, this method could get a go ahead for the
purpose.
To further improve accuracy, it would be recommended to:

a. Improvement of business acumen for better binning
b. Improvement of business acumen for better weight assignation
c. More data, a year may not be enough to not account for biases
Key Contributions
On actually realizing the contributions that advances in the fields of subsets of Data Science,
viz. Machine Learning, Artificial intelligence and Deep Learning have in merely our day to
day activities and is exponentially having more so every day, perhaps the present millennial
generation and generations to come would find it unfathomable a life without this. While it
may be far sighted at this stage, the zenith of this is expounded by Murray Shanahan as
“Technological Singularity” which in very rudimentary terms would involve our entire life
processes being driven by technology to the point of it being the sole decision maker.
However, not digressing towards the philosophical annotations, Machine Learning used in a
swathe of industries ranging from effective assembly lines in core industries to demand and
inventory planning in the service and e-commerce sector to even having forayed deeply into
the primary sector of Agriculture in a range of processes. Here, we using real time data, we
exemplified its significance for a small to medium scale Printers and Spares organization, as
to how it could aid it for effective decision making and inventory planning. Today,
employing machine learning for a business may be a luxury but in the coming years it would
be sine qua non. Following were the aspects covered and usually is in any analytics
processes:
a. Business Acumen=>Often domain wise knowledge
b. Comprehensive knowledge of statistical as well as Machine Learning mathematics
c. Data Exploration processes, technical and logical/ face value based
d. Data preparation, both for valuable insights, as well as to improve computation time
e. Data Modelling and Evaluation, choose the right techniques and evaluate the most
appropriate
f. Deployment->Export the learnings into a csv file for further analysis
g. Descriptive and Predictive visualizations, tableau/ excel/ powerBI
Analysis of Results
With at least 75% of accuracy, we can guarantee the proprietor of Inqer Printers and Spares,
the demand that he may predict (not to be too pedantic, but this was up to Feb-2020 right
before the economic slump due to the pandemic, so realistically, the analysts might have had
a shock from the far cry in the results, which further highlights the importance of
12
dynamically accounting for factors (economists use the phrases “animal spirits” and “black
swan events”) which is a learning in itself from the project). Nevertheless, now, following is
how, the expectations have been classified in the various demand categories in the graph
As the trend was turning out to be, indeed, there is expectation of lot of products in “Low”
demand as they followed those metrics however, very few in “Very Low” and an
encouraging number in “Very High” and “High”. It is alarming however, that there are
several products with low demand compared to medium which makes it imperative to clear
out the stock appropriately for those that haven’t been off the shelves, either through:
a. Sale at throwaway prices
b. Depending on e-commerce platforms for sale
c. On the shelf marketing around local stores rather than solely own store
d. Chain marketing
e. Grassroot level lead generation for potential bulk orders (although executing this may
be expensive and might not have a great trade off)
For the few within “Very Low”, without a doubt it has to be through Sales at Throwaway
prices.
The goal is to bring as many products in the booming category as possible, which is however,
not a realistic thought and in any case would only involve further raising the yardsticks.
13
A high amount of demand within the “medium” category is indicative of not being threatened
by competition/ having upped the ante from Inqer’s side at the most, however, must always
be on the lookout.
Tableau Visualizations
Descriptive
Predictions
14

Final Report-Inqer Printers and Spares-Final - Edited

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Report-Inqer Printers and Spares-Final - Edited

Uploaded by

Copyright:

Available Formats

PRINTERS AND SPARES

PG DIPLOMA DATA SCIENCE-2019-2020 CAPSTONE PROJECT

DEMAND FORECASTING TO ASSIST AN

AMITY UNIVERSITY ONLINE

55 C5, S6, C8, S8

69 C5, S6, C8, S8

25 Part is not currently available for sale but

55 The part is actively available

60 The part is no longer available. Orders and

61 The part is no longer available but does have

69 The part is actively available but under

Planned Delivery Time

Yield Rate (For Repair Parts)

Special Procurement Type (SPT)

C2 In development - the default initial value for corporate parts

C4 Pre launch -the default initial value for SC specific parts

C5 NPI – Part is released for support

S8 LTB analysis done, LTB PO raised where required

S9 EOSL – Part is no longer supported

C9 Obsolete –blocks all inventory and financial transactions

Project Objectives and Tools Used

v. Supervised learning methods such as Naïve Bayes (Gaussian,

b. Pandas->Used to work on the dataframes from the stage of its initiation to

c. Numpy->Predominantly to perform linear algebra operations on the dataframes

Role of Machine Learning

c. Predictor variable to a different dataframe (and must be dropped later)

Here, business acumen is of utmost importance.

i. Local Stock Advice Code is mentioned in terms of priority of the

With the accuracy results (first just the basic metric)

Plot of features in order of importance

We can choose the top 10 features for our need. (APJ_INVentory_Value,

Building Training-Test Sample

Test->30% of the dataset

Model Selection and Evaluation

Feature Selection Snippet

Accuracy Post Feature Selection

Other Evaluation Metrics

Precision and Recall Score for Random Forest

To further improve accuracy, it would be recommended to:

You might also like