Machine Learning — Kaggle Competition
INSTACART MARKET BASKET ANALYSIS
WHICH PRODUCTS WILL AN INSTACART CONSUMER PURCHASE AGAIN ?
I. INTRODUCTION
Whether you shop from meticulously planned grocery lists or let whimsy guide your grazing, our unique food rituals define who we are. Instacart, a grocery ordering and delivery app, aims to make it easy to fill your refrigerator and pantry with your personal favorites and staples when you need them. After selecting products through the Instacart app, personal shoppers review your order and do the in-store shopping and delivery for you.
Instacart’s data science team plays a big part in providing this delightful shopping experience. Currently they use transactional data to develop models that predict which products a user will buy again, try for the first time, or add to their cart next during a session. Recently, Instacart open sourced this data — see their blog post on 3 Million Instacart Orders, Open Sourced.
In this competition, Instacart is challenging the Kaggle[1] community to use this anonymized data on customer orders over time to predict which previously purchased products will be in a user’s next order. They’re not only looking for the best model, Instacart’s also looking for machine learning engineers to grow their team.[2]
We have done this competition as a couple with my wife : Constance Morel Béguier and me, Nicolas Béguier.
May 7, 2017 — Beginning of the contest.
July 29, 2017 — Our start of the competition.
August 14, 2017 — Final submission deadline.
II. TECHNICAL INFORMATIONS
The dataset for this competition is a relational set of files describing customers’ orders over time. The goal of the competition is to predict which products will be in a user’s next order. The dataset is anonymized and contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders. For more information, see the blog post accompanying its public release.[3]
III. DEVELOPMENTS AND RESULTS
1. Features computation
We have separated the features into seven categories :
- u : User
- p : Product
- f : User / Product
- d : Department
- a : Aisle
- ud : User / Department
- ua : User / Aisle
a. Top 10 best features we had
f4 (8,5%) Number of days since this user purchases this product for the last time
f3 (5,9%) Number of orders since this user purchases this product for the last time
f0 (5,1%) Percentage of times this user purchases this product
u13 (4,9%) Number of days since this user made hist last order
p5 (4,3%) [ Number of users who have re-ordered this product ] / [ Number of users who have ordered this product ]
f6 (3,6%) [ n-1 ] / n : Where ’n’ is number of orders with the product
p8 (3,3%) Days frequency : This means the number of days between two consecutive orders of this product
u14 (3,2%) [ Number of products purchased at least twice by this user ] / [ Number of different product purchased by this user ]
f5 (3,2%) Mean number of days between two orders with the product
p1 (3,2%) [ Number of users who have ordered it (at least one) ] * 100 / [ Number of users ]
b. Gathering features
Create a feature takes time, and a lot of it. Firstly, you need an idea and the corresponding mathematical formula. Then, you should try each possibility you can think about. If you got an average : try a median, a maximum, a minimum, etc.
We had started the contest with 36 different features (some were close) and we finished the contest with 38 features. In fact, this is not really the number of features that put you on the top.
After a while, despite our 50+ features, our score was only decreasing. When there are too many features, the risk is to learn badly. A feature with no information will be ignored by the algorithm, but its weight won’t be null.
After each iteration, we had to watch every features weight and we had to choose wisely the one we like to remove. A feature with a low weight brings few or none information for the exposed problem.
Another reason to limit features is to avoid overfitting. Indeed, the algorithm could focus only on a database particularity.
Finally, with too many features, the computation time will explode. Each new feature brings another dimension in space.
2. Classifier choice
We have tested every basic algorithm provided by scikit-learn[4]. Currently, XGBoost is one of the fastest learning algorithm. Actually, this is a meta-classifier, but very efficient.
For example, I got the same result with a Random Forest, but in 100x more time. It explains why 3 hours of XGBoost computation are gainful.
For XGBoost, we have started with default parameters. To tune these parameters, we have followed this great article.[5]
The obtain result won’t be especially better, but you have to keep in mind that a Kaggle contest is won by the thousandth percent. After a while, earning 0.1% is a true victory.
Be careful, boosting parameters should be done ONLY after you’ve done with your features. If you modify your learning dataset (add/remove features, add/remove users) you have to start from the beginning.
Until we were satisfied of our result with XGBoost, we began a similar learning with LGBoost. Then, each classifier gave us a prediction of each product to be taken again. Our final prediction was the maximum between both predictions.
3. Number of estimators and learning rate
In a lot of classifier, these parameters are present. The number of estimators (or number of rounds) is the number of times that the algorithm runs with different settings. At the end, you got a learned classifier which is the best found. Generally, more you have data, higher the number of estimators should be.
The learning rate is the time you allowed to the algorithm to learn. Generally, more you have noise in your data, the higher it should be.
Bad parameters could give you bad result, due to an overfitting (too much estimators) or a local minimum (learning rate too low).
Finally, take too many estimators or a learning rate too high will add more computation time.
4. Product selection method
A classifier gives to every product a probability to be re-ordered. A second work is to select the products with the help of these probabilities.
Ceil method
A method is to select every product which has a probability greater than a threshold. Our best setting with this method was 0.22 (22%).
N best products
Another method is to take the ’n’ products with the best probability to be re-ordered.
F1 optimization
Our best choice. We have tested an open source library : F1Optimizer.
Each product is associated with a probability to be re-ordered. Let’s suppose this probability is the truth, the choice is to take this product or not into the final result.
The F1Optimizer function computes the F1 Score with and without this product. When this risk is too high, we stop and we don’t select more product.
IV. CONCLUSION
Our objective was to have a Bronze medal. When we had started, the best score was just above 40% and our first submission was 37.47%. It was accessible.
One week before the end of the contest, our score was 38.32%, but the former leader (with 40.1%) publish his code to the Kaggle community… What a waste !
It would take an hour to compute the score and submit it to Kaggle.
For a week, we couldn’t improve our score and a lot of people use the cheat to get more than 40%.
At the end, we use the F1 Optimization for a better selection of products, we jump over 1000 places.
Our final score was 39.80%, 535th on 2623. The winner had 40.91%.
All our code is available on my GitHub account : nbeguier/Kaggle-Instacart
V. REFERENCES
[1] Kaggle website : https://www.kaggle.com/
[2] Kaggle Instacart Description
[3] Kaggle Instacart Data
[4] Python Sklearn — ensemble : http://scikit-learn.org/stable/modules/ensemble.html
[5] Analytics Vidhya : complete-guide-parameter-tuning-xgboost-with-codes-python