Research ML Diagnostics

Credit RP Diagnostic

A regulatory-focused diagnostic tool that quantifies feature redundancy in credit risk models using sparse random projections. Assesses effective dimensionality of credit datasets to identify over-parameterization and ensure compliance with Basel IRB and IFRS 9 validation frameworks.

GitHub ↗

Python Machine Learning Scikit-learn Dimensionality Reduction Regulatory Compliance

Motivation

The Over-Parameterization Problem

Credit risk models in Basel IRB and IFRS 9 frameworks must demonstrate stability and predictive power. A common failure mode is feature bloat: dozens of correlated features create instability under stress testing and regulatory scrutiny. This diagnostic quantifies exactly how many features are truly independent.

Methodology

Sparse Random Projection Approach

Rather than PCA or autoencoders, this project uses sparse random projections — computationally efficient, distance-preserving dimensionality reduction. The approach:

Baseline Model

Train a balanced logistic regression classifier on the full feature set (59 credit features post-processing) using the Home Credit Default Risk dataset (300k samples, binary classification).

Incremental Projection

Apply sparse random projection to reduce the feature space from k to 1, measuring AUC at each dimension. The "elbow" point where AUC plateaus indicates the intrinsic dimensionality.

Ensemble Averaging

Run the projection pipeline 20 times with different random seeds and average the AUC curves. This stabilizes results against seed-dependent variance.

Redundancy Quantification

Calculate the redundancy ratio as (total features − intrinsic dimensions) / total features. This ratio is the compliance metric reported to regulators.

Results

Diagnostic Findings

Model Complexity Audit

Original Feature Count

Intrinsic Dimensions

Redundancy Ratio

15.3%

Baseline AUC

0.7251

Regulatory Verdict: PASS — Well-calibrated dimensionality. The model is not over-parameterized; redundancy is modest and the effective feature space is tightly aligned with intrinsic data structure. This positioning provides strong defense against accusations of fitting spurious correlations or hidden test-set leakage.

The diagnostic assumes that features are not deliberately engineered to collude. If features are derived from the same underlying source (e.g., multiple ratios computed from the same balance sheet), correlation is expected and reflects data structure, not model bloat.

Deliverables

Output & Artifacts

The project is implemented as a self-contained Jupyter notebook (single-file execution). It includes:

Data loading and preprocessing: Home Credit dataset preparation, feature engineering
Baseline model training: Logistic regression with balanced class weighting
Projection pipeline: Sparse RP dimensionality sweeps with ensemble averaging
Visualization: AUC curves, dimensionality elbow plots, redundancy heatmaps
Compliance report: Structured output summarizing regulatory findings
Dependencies: scikit-learn, pandas, numpy, matplotlib

Regulatory Context

Basel IRB & IFRS 9 Alignment

This diagnostic directly addresses Basel III IRB validation requirements, which mandate:

Stability testing: Models must remain stable under economic stress (addressed by redundancy reduction)
Predictive power: Metrics like AUC must be documented and defended (quantified here)
Feature justification: Every feature must have a documented business and statistical rationale (this tool justifies exclusion of redundant ones)
Transparency: Model complexity must be communicated to risk committees and regulators (the redundancy ratio is the key metric)

Similarly, IFRS 9 Expected Credit Loss (ECL) models rely on stable, well-calibrated PD estimates. Over-parameterized models fail in out-of-sample periods and regulatory backtests. This diagnostic provides quantitative evidence that the model is appropriately tuned.