Skip to content

AMEX - Default Prediction Kaggle Competition Summary

Info

Author: Vincent, Published on 2021-06-06, Reading time: approx. 6 minutes, WeChat article link:

1 Overview

American Express (AMEX), a well-known financial services company, hosted a data science competition on Kaggle. Participants were tasked with predicting whether a credit cardholder would default in the future based on anonymized credit card billing data. AMEX provided explanations for feature prefixes:

D_* = Delinquency-related variables
S_* = Spending-related variables
P_* = Payment information
B_* = Balance information
R_* = Risk-related variables

The table below provides a sample of the competition data (values are fictional and for reference only):

customer_ID S_2 P_2 ... B_2 D_41 target
000002399d6bd597023 2017-04-07 0.9366 ... 0.1243 0.2824 1
0000099d6bd597052ca 2017-03-32 0.3466 ... 0.5155 0.0087 0

Certain features such as 'B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68' are categorical. The objective is to predict the probability of default (target = 1 or target = 0) for each customer_ID. Negative samples were undersampled at a rate of 5%. The competition has ended, and this article summarizes publicly available solutions and discussions to share insights from the community.

2 Preparation Work

Due to the large dataset, memory optimization was crucial. Some efforts included converting floating-point data to integers and storing data in parquet format, as seen in AMEX data - integer dtypes - parquet format. Another compression method was AMEX-Feather-Dataset.

60M sample_submission.csv
32G test_data.csv
16G train_data.csv
30M  train_labels.csv

The competition used a custom evaluation metric combining top 4% capture and gini. Many solutions referred to Amex Competition Metric (Python) and Metric without DF for performance evaluation.

3 Exploratory Data Analysis (EDA)

Understanding the dataset thoroughly before modeling is essential. In this competition, key EDA tasks included:

  • Checking missing values
  • Identifying duplicate records
  • Examining label distribution
  • Analyzing credit card statement counts per customer
  • Investigating categorical and numerical feature distributions
  • Detecting anomalies and feature correlations
  • Identifying artificial noise
  • Comparing feature distributions between training and test sets

Notable high-score notebooks for EDA:

4 Feature Engineering & Modeling

4.1 Feature Engineering

As each customer has multiple statements, aggregating them was a focus of many solutions:

  • For continuous variables: calculating mean, standard deviation, min, max, last statement value, and differences/ratios between last and first statement.
  • For categorical variables: counting occurrences, tracking last statement value, converting frequencies into numerical features, and encoding accordingly.

High-score notebooks for feature engineering:

4.2 Model Design, Training & Inference

Top solutions used XGBoost, LightGBM, CatBoost, Transformer, TabNet, or ensembles of these models. Chris Deotte, a Kaggle Grandmaster at Nvidia, contributed foundational solutions such as XGBoost Starter, TensorFlow GRU, and TensorFlow Transformer. His final 15th place solution used Transformer with LightGBM knowledge distillation (15th Place Gold).

The second-place team (2nd place solution - team JuneHomes) highlighted best practices, including team collaboration using AWS, thorough version control, and model selection strategies.

The first-place solution (1st solution) was highly complex but not detailed by the author.

5 Conclusion

Many useful insights emerged from the competition's Discussion section:

Notable notebooks include:

This competition showcased impressive ML techniques and strategies, offering valuable learning experiences for the community.


Viewed times

Comments