AMEX - Default Prediction Kaggle Competition Summary¶

Info

Author: Vincent, Published on 2021-06-06, Reading time: approx. 6 minutes, WeChat article link:

1 Overview¶

American Express (AMEX), a well-known financial services company, hosted a data science competition on Kaggle. Participants were tasked with predicting whether a credit cardholder would default in the future based on anonymized credit card billing data. AMEX provided explanations for feature prefixes:

D_* = Delinquency-related variables
S_* = Spending-related variables
P_* = Payment information
B_* = Balance information
R_* = Risk-related variables

The table below provides a sample of the competition data (values are fictional and for reference only):

customer_ID	S_2	P_2	...	B_2	D_41	target
000002399d6bd597023	2017-04-07	0.9366	...	0.1243	0.2824	1
0000099d6bd597052ca	2017-03-32	0.3466	...	0.5155	0.0087	0

Certain features such as 'B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68' are categorical. The objective is to predict the probability of default (target = 1 or target = 0) for each customer_ID. Negative samples were undersampled at a rate of 5%. The competition has ended, and this article summarizes publicly available solutions and discussions to share insights from the community.

2 Preparation Work¶

Due to the large dataset, memory optimization was crucial. Some efforts included converting floating-point data to integers and storing data in parquet format, as seen in AMEX data - integer dtypes - parquet format. Another compression method was AMEX-Feather-Dataset.

60M sample_submission.csv
32G test_data.csv
16G train_data.csv
30M  train_labels.csv

The competition used a custom evaluation metric combining top 4% capture and gini. Many solutions referred to Amex Competition Metric (Python) and Metric without DF for performance evaluation.

3 Exploratory Data Analysis (EDA)¶

Understanding the dataset thoroughly before modeling is essential. In this competition, key EDA tasks included:

Checking missing values
Identifying duplicate records
Examining label distribution
Analyzing credit card statement counts per customer
Investigating categorical and numerical feature distributions
Detecting anomalies and feature correlations
Identifying artificial noise
Comparing feature distributions between training and test sets

Notable high-score notebooks for EDA:

4 Feature Engineering & Modeling¶

4.1 Feature Engineering¶

As each customer has multiple statements, aggregating them was a focus of many solutions:

For continuous variables: calculating mean, standard deviation, min, max, last statement value, and differences/ratios between last and first statement.
For categorical variables: counting occurrences, tracking last statement value, converting frequencies into numerical features, and encoding accordingly.

High-score notebooks for feature engineering:

4.2 Model Design, Training & Inference¶

Top solutions used XGBoost, LightGBM, CatBoost, Transformer, TabNet, or ensembles of these models. Chris Deotte, a Kaggle Grandmaster at Nvidia, contributed foundational solutions such as XGBoost Starter, TensorFlow GRU, and TensorFlow Transformer. His final 15^th place solution used Transformer with LightGBM knowledge distillation (15^th Place Gold).

The second-place team (2^nd place solution - team JuneHomes) highlighted best practices, including team collaboration using AWS, thorough version control, and model selection strategies.

The first-place solution (1^st solution) was highly complex but not detailed by the author.

5 Conclusion¶

Many useful insights emerged from the competition's Discussion section:

Notable notebooks include:

This competition showcased impressive ML techniques and strategies, offering valuable learning experiences for the community.

Viewed times