Research ML Diagnostics

Credit RP Diagnostic

A regulatory-focused diagnostic tool that quantifies feature redundancy in credit risk models using sparse random projections. Assesses effective dimensionality of credit datasets to identify over-parameterization and ensure compliance with Basel IRB and IFRS 9 validation frameworks.

Python Machine Learning Scikit-learn Dimensionality Reduction Regulatory Compliance
Motivation

The Over-Parameterization Problem

Credit risk models used in Basel IRB and IFRS 9 frameworks must demonstrate stability and predictive power. A common failure mode is feature bloat: models incorporate dozens of features, many of which are highly correlated or redundant, creating instability under stress testing and regulatory scrutiny. This diagnostic quantifies exactly how many features are truly independent.

Methodology

Sparse Random Projection Approach

Rather than traditional dimensionality reduction (PCA, autoencoders), this project uses sparse random projections— a computationally efficient method that preserves pairwise distances while projecting high-dimensional data into lower-dimensional spaces. The approach:

1
Baseline Model

Train a balanced logistic regression classifier on the full feature set (59 credit features post-processing) using the Home Credit Default Risk dataset (300k samples, binary classification).

2
Incremental Projection

Apply sparse random projection to reduce the feature space from k to 1, measuring AUC at each dimension. The "elbow" point where AUC plateaus indicates the intrinsic dimensionality.

3
Ensemble Averaging

Run the projection pipeline 20 times with different random seeds and average the AUC curves. This stabilizes results against seed-dependent variance.

4
Redundancy Quantification

Calculate the redundancy ratio as (total features − intrinsic dimensions) / total features. This ratio is the compliance metric reported to regulators.

Results

Diagnostic Findings

Model Complexity Audit

Original Feature Count
59
Intrinsic Dimensions
50
Redundancy Ratio
15.3%
Baseline AUC
0.7251

Regulatory Verdict: PASS — Well-calibrated dimensionality. The model is not over-parameterized; redundancy is modest and the effective feature space is tightly aligned with intrinsic data structure. This positioning provides strong defense against accusations of fitting spurious correlations or hidden test-set leakage.

The diagnostic assumes that features are not deliberately engineered to collude. If features are derived from the same underlying source (e.g., multiple ratios computed from the same balance sheet), correlation is expected and reflects data structure, not model bloat.

Deliverables

Output & Artifacts

The project is implemented as a self-contained Jupyter notebook (single-file execution). It includes:

Regulatory Context

Basel IRB & IFRS 9 Alignment

This diagnostic directly addresses Basel III IRB validation requirements, which mandate:

Similarly, IFRS 9 Expected Credit Loss (ECL) models rely on stable, well-calibrated PD estimates. Over-parameterized models fail in out-of-sample periods and regulatory backtests. This diagnostic provides quantitative evidence that the model is appropriately tuned.