Machine Learning — Kaggle Competition

INSTACART MARKET BASKET ANALYSIS

WHICH PRODUCTS WILL AN INSTACART CONSUMER PURCHASE AGAIN ?

I. INTRODUCTION

Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.

II. TECHNICAL INFORMATIONS

The dataset for this competition is a relational set of files describing customers’ orders over time. The goal of the competition is to predict which products will be in a user’s next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the blog post accompanying its public release.[3]

III. DEVELOPMENTS AND RESULTS

1. Features computation

We have separated the features into seven categories :
- u : User
- p : Product
- f : User / Product
- d : Department
- a : Aisle
- ud : User / Department
- ua : User / Aisle

2. Classifier choice

We have tested every basic algorithm provided by scikit-learn[4]. Currently, XGBoost is one of the fastest learning algorithm. Actually, this is a meta-classifier, but very efficient.

3. Number of estimators and learning rate

In a lot of classifier, these parameters are present. The number of estimators (or number of rounds) is the number of times that the algorithm runs with different settings. At the end, you got a learned classifier which is the best found. Generally, more you have data, higher the number of estimators should be.

4. Product selection method

A classifier gives to every product a probability to be re-ordered. A second work is to select the products with the help of these probabilities.

IV. CONCLUSION

Our objective was to have a Bronze medal. When we had started, the best score was just above 40% and our first submission was 37.47%. It was accessible.
One week before the end of the contest, our score was 38.32%, but the former leader (with 40.1%) publish his code to the Kaggle community… What a waste !
It would take an hour to compute the score and submit it to Kaggle.

This timeline presents our F1 Score during the contest.