AAIC Self Case Study 1 Reference Doc 2

……………………………………………...……………………………………………...……………………………………………...
……………………………………………
………………...………………
Name: Anu Krishnan
Email address: anu.unni12@gmail.com
Contact number: 9986545036
Anydesk address:
Years of Work Experience:
Date: 11/03/2022
……………………………………………...……………………………………………...……………………………………………...……………………………………………
………………...………………
Self Case Study -1: Instacart Market Basket Analysis
“After you have completed the document, please submit it in the classroom in the pdf format.”
Please check this video before you get started:

https://www.youtube.com/watch?time_continue=1&v=LBGU1_JO3kg
Overview
*** Write an overview of the case study that you are working on. (MINIMUM 200 words)
***
1. Instacart is an American company that operates a grocery delivery and pick-up service in
the United States and Canada.
2. Consumers in Instacart order large basket sizes of diverse foods over and over again from
them. they have more density on user behavior than any e-commerce company.
3. As having a huge user behavior data density, Instacart work in recommendation of
products to the customer.
4. This problem focuses on improving the recommendation system to the user for more
hand on experience for the user to suggest or recommend type of products to fill the
basket, hence Instacart market basket analysis.
5. Based on previous orders the model have to predict what will user ultimately re-order,
and with more precision, N number of most applicable recommendations for a user.
6. Currently they use transactional data to develop models that predict which products a
user will buy again, try for the first time, or add to their cart next during a session.
Problem Definition:
- Given order_id predict all the products that the user will reorder.
Objective:
 Predicting reorders in advance is of huge business value for any delivery business, as it
can significantly help in optimizing the supply-chain management system.
 This can also be used to enhance the customer experience. eg:- Remind the customer to
reorder a product.
 This can also be used to improve the recommender system.
 ML model can help here in predicting the reorders.
 Any delivery or retail business can get benefitted from the solution to this problem.
Constraints:
 We want our model to perform equally good for all the orders.
 Predicting reorders of products that are reordered a lot is more important than the
products that are reordered a few times.
 Since recommending some products which a customer is not going to reorder will lead to
the customer exploring new products, a few False Positives doesn't harm.
 False Negatives for highly reordered products can be costly.
Dataset Description:
1. Orders (3.4m rows, 206k users):

It contains all the prior, train and test users order_id, days_since_prior_order, order_hour_of_day
and order_day_of_week
 order_id: order identifier

 user_id: customer identifier
 eval_set: which evaluation set this order belongs in (see SET described below)
 order_number: the order sequence number for this user (1 = first, n = nth)
 order_dow: the day of the week the order was placed on
 order_hour_of_day: the hour of the day the order was placed on
 days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)
2. Products (50k rows):
It contains product_nam, product_id, Department_id and aisle_id in which product belongs
 product_id: product identifier

 product_name: name of the product
 aisle_id: foreign key
 department_id: foreign key
3. Aisles (134 rows):
It contains aisle_id and aisle_name of product
 aisle_id: aisle identifier

 aisle: the name of the aisle
4. Departments (21 rows):
It contains Department Name and department_id of a product
 department_id: department identifier

 department: the name of the department
5. Order_products_prior.csv : It contains all the prior orders (means all the previous orders)
product_d, order_id and reorderd or not of user
6. Order_products_train.csv : It contains all the train_orers (last orders of train user)

product_d,order_id and reorderd or not of user
 order_id: foreign key

 product_id: foreign key
 add_to_cart_order: order in which each product was added to cart
 reordered: 1 if this product has been ordered by this user in the past, 0 otherwise
where SET is one of the four following evaluation sets (eval_set in orders):
 "prior": orders prior to that users most recent order (~3.2m orders)
 "train": training data supplied to participants (~131k orders)
 "test": test data reserved for machine learning competitions (~75k orders)
Evaluation Metric
 F1 Score Maximization : Here f1_score means calculating f1_score for each order which
will have its own local threshold and pick those products which will maximize the F1- score.
Reason:
 We want our model to give equal weightage to each order.
 We want a model with good Precision and Recall both.
 Confusion Matrix will be used as secondary metric.
Train Test Split:
Train_data : Frist take orders_products_train and merge with orders files on left with
orders_products_train on Left, then add the prior products of each user from the
order_products__prior and keep the reordered equal to zero
Test_data : From the orders file take all the test_orderid after geting the order_id with respect to
user_id take all of the prior products and merge with orders_id fil
Cross_validation : From the train_data take all the unique users and split the data for cross_val by
taking 20 percent of users in train_data
Research-Papers/Solutions/Architectures/Kernels
*** Mention the urls of existing research-papers/solutions/kernels on your problem statement
and in your own words write a detailed summary for each one of them. If needed you can
include images or explain with your own diagrams. it is mandatory to write a brief description
about that paper. Without understanding of the resource please don’t mention it***
1. https://www.lexjansen.com/sesug/2019/SESUG2019_Paper-252_Final_PDF.pdf
1. Kaggle Solution by Paulantoine (fixed threshold)
Here Paulantoine build 4 types of features:
Observations:
 User Features
 Product Features
 Time Features
 User-Product Features
 After converting the data into binary classification problem, he used fixed threshold (using
GridSearchCV) to predict the possible set of recommendations for each order.
 Probability Threshold is used here to decide which product will be reordered by a user and same
is recommended to the user. Generally, If P( X) > 0.5 → class 1 ( reorder in this case) else class
0. So, we can select those products which belong to class 1 ( i.e. P(X) > 0.5) and recommend
them to user. But this threshold 0.5 can be changed in order to improve the performance of the
model.
2. Kaggle 2nd Place Solution by Kazuki Onodera ( Variable Threshold )
Observations:
 Onodera build 4 category of features:
o User Features
o Product Features
o Time Features
o User-Product Features
 23 models were built with over 70+ features to solve this problem using XGBoost
 17 models were used to predict None ( probability of a reordered product to be in user’s
next order ) and 6 models were used to predict the reorder probability (probability of a
product being reordered by user in next order).
 To convert these probabilities to binary label and computing threshold, a special F1 Score
Maximization algorithm is used which outputs the combination of products which
resulted in maximum f1-score . The probability threshold is decided based on F1 score for
each order.
 Weighted average of all these probabilities were used to generate recommendation such
that they have high F1 Score.
 F1 Score Maximization
o To decide what will the probability threshold for each user (instead of, if P(X) >
0.5 → class 1)
o Assuming we can choose from product A or product B for order 1
First Cut Approach

*** Explain in steps about how you want to approach this problem and the initial experiments
that you want to do. (MINIMUM 200 words) ***
*** When you are doing the basic EDA and building the First Cut Approach you should not refer
any blogs or papers ***
Based on the research and readings that I have done. I will follow the below steps –
1. At first, it seems like Multi-Label Classification, but there are 49688 products, and total
product recommendations could be anywhere from None to N. Therefore, this problem is
restructured into a binary classification problem, where we will predict the probability of
an item being reordered.
2. To build a model, we need to extract features from previous order to understand user's
purchase pattern and how popular the particular product is. I extract following features
from the user's transactional data.
3. For each order, we will group these probabilities to pick top K probable products which
will be reordered, and recommend those to the user.
4. Feature engineering is one of the most important aspects of this Kaggle problem. So I
will look to come up with various aggregated features based on purchase data.
i. Add none product as reordered ==1 with respect to order_id wherever if the sum
of the reordered products of that particular order_id ==0. So that we can also
predict None if the user doesn't want to order
ii. User_product_ratio
iii. Day of Week reordered_ratio
iv. Hour_of_day Reordered Ratio
v. Days_since_prior_order Reordered ratio
vi. Product hour of day reordered ratio
vii. Product_dayofweek Ratio
viii. User_dow reordered ratio
ix. User hour reordered_ratio
x. Days since prior order for a product
5. We can see which features are more useful in predicting the product reorder, and come up
with more features based on that feature.
6. We can utilize this anonymized transactional data of customer orders over time to predict
which previously purchased products will be in a user’s next order. This would help
recommend the products to a user.
7. Using the extracted features, prepare a dataframe which shows all the products user has
bought previously, user level features, product level features, asile and department level
features, user-product level features and the information of current order such as order's
day-of-week, hour-of-day, etc. The Traget would be 'reordered' which shows how many
of the previously purchased items, user ordered this time.
8. We make use of f1_score maximization metric which will give a list of products which
has high prob of reorder rate.
9. Modelling – From the extracted featured We will implement various models like Logistic
Regression, RandomForest, Decision Tree, XGBoost etc and select the model which
performs fast and better.
Notes when you build your final notebook:
1. You should not train any model either it can be a ML model or DL model or
Countvectorizer or even simple StandardScalar
2. You should not read train data files
3. The function1 takes only one argument “X” (a single data points i.e 1*d feature) and the
inside the function you will preprocess data point similar to the process you did while
you featurize your train data
a. Ex: consider you are doing taxi demand prediction case study (problem
definition: given a time and location predict the number of pickups that can
happen)
b. so in your final notebook, you need to pass only those two values
c. def final(X):
preprocess data i.e data cleaning, filling missing values etc
compute features based on this X
use pre trained model
return predicted outputs
final([time, location])
d. in the instructions, we have mentioned two functions one with original values
and one without it
e. final([time, location]) # in this function you need to return the predictions, no
need to compute the metric
f. final(set of [time, location] values, corresponding Y values) # when you pass
the Y values, we can compute the error metric(Y, y_predict)
4. After you have preprocessed the data point you will featurize it, with the help of trained
vectorizers or methods you have followed for your train data
5. Assume this function is like you are productionizing the best model you have built, you
need to measure the time for predicting and report the time. Make sure you keep the
time as low as possible
6. Check this live session: https://www.appliedaicourse.com/lecture/11/applied-machine-
learning-online-course/4148/hands-on-live-session-deploy-an-ml-model-using-apis-on-
aws/5/module-5-feature-engineering-productionization-and-deployment-of-ml-models

AAIC Self Case Study 1 Reference Doc 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AAIC Self Case Study 1 Reference Doc 2

Uploaded by

Copyright:

Available Formats

……………………………………………...……………………………………………...……………………………………………...

Name: Anu Krishnan

Email address: anu.unni12@gmail.com

Contact number: 9986545036

Years of Work Experience:

Self Case Study -1: Instacart Market Basket Analysis

Please check this video before you get started:

1. Orders (3.4m rows, 206k users):

 order_id: order identifier

2. Products (50k rows):

It contains product_nam, product_id, Department_id and aisle_id in which product belongs

 product_id: product identifier

3. Aisles (134 rows):

It contains aisle_id and aisle_name of product

 aisle_id: aisle identifier

4. Departments (21 rows):

It contains Department Name and department_id of a product

 department_id: department identifier

6. Order_products_train.csv : It contains all the train_orers (last orders of train user)

 order_id: foreign key

 We want our model to give equal weightage to each order.

 We want a model with good Precision and Recall both.

 Confusion Matrix will be used as secondary metric.

Train Test Split:

1. Kaggle Solution by Paulantoine (fixed threshold)

Here Paulantoine build 4 types of features:

GridSearchCV) to predict the possible set of recommendations for each order.

 Onodera build 4 category of features:

 17 models were used to predict None ( probability of a reordered product to be in user’s

product being reordered by user in next order).

that they have high F1 Score.

o Assuming we can choose from product A or product B for order 1

First Cut Approach

iii. Day of Week reordered_ratio

iv. Hour_of_day Reordered Ratio

v. Days_since_prior_order Reordered ratio

vi. Product hour of day reordered ratio

vii. Product_dayofweek Ratio

viii. User_dow reordered ratio

ix. User hour reordered_ratio

x. Days since prior order for a product

You might also like