Kaggle: Instacart Market Basket Analysis

7 min readNov 18, 2020

Introduction
Business Problem
Machine Learning Problem
Data Explaination
Exploratory Data Analysis
Feature Engineering
Machine Learning Models
References

Introduction

Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada. The company offers its services via a website and mobile app. The service allows customers to order groceries from participating retailers with the shopping being done by a personal shopper.

Their service model is like a personal shopper would pick, pack, and deliver the order within the customer’s designated time frame within one hour or up to five days in advance. They also have other premium plans where users can sign up and can get some more privilege on the deliveries.

Business Problem

There are lots of people who are using this platform for their day-to-day basis for ordering home groceries straight to their door. Unlike any other shop, this platform also has its regular customers who often purchase from this platform.

The aim insta-cart had in this problem statement was to make it easy for the customer to buy the item which he / she orders frequently in a given period of time, which indeed saves the time of the customer to search for that respective product and re-order it.

The problem does sound like product recommendation, but it’s different from recommending similar products. Here based on previous purchases we are trying to recommend the products which the user might “buy again” when the user comes to the platform again

Machine Learning Problem

The data science team of Insta-cart here wants the Machine Learning
Engineers or the Researchers to come up with new methods to suggest the
probability of the product the user is likely to “buy again”

The dataset consists of 6 files, each containing different information of the
insta-cart. It contains the information about the products, the aisle to which
the product is kept, what product was reordered, after how many days did
the user come to shop etc.

The dataset is divided into 3 parts Prior, Train, and Test.

Prior order contains the information of the previous orders from the users, and the last order from the user is either added in Train or Test set. The number of orders from the user range from 4 to 100, there are almost 50K products and 3M orders from the users

The dataset is not balanced in terms of ‘reordered’, Neither the number of
orders of a given product are the same as others. Here there is also a
chance that the customer may order nothing from his / her previously
ordered products. So ‘None’, can also be an answer to a user’s next
purchase. Thus we should also consider None as a different product along
with others.

Data Explaination

Here we are provided with 6 tables

Aisles
Department
Prior orders
Train orders
Orders
Products

All the tables are related with each other as displayed below.

Explaination of few terms:
add_to_cart_order: while shopping what is the order when product is added.
reordered: if the product was reordered or not.
order_dow: Day of week when the order was made.
order_hour_of_day: Hour of day when the order was made.
days_since_prior_order: After how many days the user came to shop since last order.

The data contains around 50k products and purchase history of 206209 users.
As mentioned above, dataset is divided into 3 parts.
Count of Prior orders are 3214874 which will be used to create features. Count of train orders are 131209 and Test orders are 75000.

The orders made by the user range from 4–100.

Exploratory Data Analysis

Lets be a Dora and explore the data.

What time of day people prefer to shop mostly?
Answer: Between 10am-4pm

After how many days most user tend to come back for shopping?
Answer: Mostly after 7 days or 30 days.

Which Deparment has most sold products?
Answer: Dairy Eggs, Produce

What is the basket size when the user checks out?
Answer: 4–6

Check if the the most reordered item is the item which is first kept by the user in the cart

Which product is reordered mostly every hour?
Answer: Banana

Which product is reordered every day?
Answer: Banana

Feature Engineering

I created 4 types of features.

User Features: What is user like?
Product Features: What is the product like?
User x Product Features: How does User feel about the product?
Datetime features: Day and time of item purchased by the user.

User Features

How long is the user using instacart for shopping.
Average days difference of users visit.
Time of the visit by the user
Average basket size when the user checksout.
Distinct products purchased by the user.
How often the user reorders an item.

Product Features

How many users purchased this product.
How often a product is reordered.
Average position in the cart.
What is the streak of the product ordered.
Product’s aisle reorder ratio.
Product’s Department reorder ratio.

User x Product Features

How often the user reorderd a particular product.
After how many days user purchases a particular product.
Average Day of Week for purchased product.
Average cart position of the product by thejmk user.
Mean Hour of the product purchased by the user.

In all, at the end I had 78 features for each ordered product.

To explain some top features:

user_reorder_ratio: How often the user reorders products out of all the products purchased till now.
user_mean_days_prior_order: Mean of after how many days the user comes to shop.
user_average_basket: Size of the basket when user checkouts.
user_distinct_products: How many distinct products have been purchased by the user till now.
prod_user_uniq_reordered: How many user reordered a particular product.
user_order_starts_at: How long the user has been a customer of the insta-cart.
dep_reordered_ratio: From what department there is most reordered product.
up_order_rate: What is the rate of a product bought by the user. i.e how frequent a product is bought by user.

Machine Learning Models

For training the model, we already had the train set and test set given by the insta-cart.

At first I had 92 features, Initially when I loaded all the 92 to XGBoost my threshold based F1Score was 0.28 which was very low So to improve it I decided to get the feature importance which can help me to reduce the dimensionality of the data points. To compute this I performed cross-validation using the LightGBM and XGBoost models. I got better results with LightGBM.

I printed the feature importance of all the features I had and took 78 features out of that.

For my first cut approach I tried XGBoost Model by which I reached the on the top 20% of the Leaderboard score on the kaggle.

Later I also tried the same data with the LightGBM which improved a score by some decimals.

So, I stacked those models for which I wrote a custom stacking ensemble as below:

Flowchart of how Fit method works

Explaination of predict method

The test data is passed through all the trained models. and addition is done of all the probabilities. At last it is divided by self.loop to get the mean score of the probability.

This helped me to score 0.40377 on the public Leaderboard score which gives me a rank of 144 out of 2622 thus coming in the top 6% of the leaderboard.