Research ML Diagnostics

Polish Bankruptcy RP Diagnostic

A financial distress prediction model diagnostic using sparse random projections to quantify feature redundancy in bankruptcy prediction. Applied to the Polish Companies Bankruptcy dataset to measure intrinsic dimensionality and validate model complexity against regulatory and financial soundness standards.

Python Machine Learning Scikit-learn Financial Ratios Bankruptcy Prediction
Motivation

Financial Ratio Redundancy in Distress Prediction

Financial analysts and machine learning practitioners often compute dozens of ratios from company balance sheets— liquidity ratios, leverage ratios, profitability metrics, efficiency measures, and more. While intuition suggests that more ratios improve prediction, correlation among ratios creates feature bloat that destabilizes models and obscures the true drivers of bankruptcy risk.

The Polish Companies Bankruptcy dataset contains 64 financial ratio features with rich structure. This diagnostic asks a simple question: how many of these 64 ratios are truly independent? The answer informs feature selection, model complexity audits, and interpretability.

Methodology

Sparse Random Projection Analysis

This project applies the same sparse random projection methodology used in credit risk diagnostics, adapted for the bankruptcy prediction domain. The approach quantifies feature redundancy without requiring explicit feature selection.

1
Baseline Model Training

Train a balanced logistic regression classifier on all 64 financial ratio features using the Polish Companies Bankruptcy dataset (5-year prediction horizon, binary bankrupt/non-bankrupt classification).

2
Dimensionality Sweep

Apply sparse random projection to systematically reduce the feature space from 64 down to 1 dimension, measuring predictive performance (AUC) at each step.

3
Ensemble Stabilization

Execute the full projection pipeline 20 times with different random initializations and average the AUC curves to reduce noise and improve robustness.

4
Redundancy Metric

Identify the "elbow" point where AUC plateau occurs and calculate the redundancy ratio as (64 − intrinsic dimensions) / 64, expressed as a percentage.

Results

Model Complexity Audit

Feature Redundancy Analysis

Original Feature Count
64
Intrinsic Dimensions
50
Redundancy Ratio
21.9%
Independent Ratios
50 of 64

Interpretation: Of the 64 financial ratios, only 50 represent truly independent dimensions of financial health. The remaining 14 ratios (21.9%) are linear combinations or near-collinear with others, suggesting that predictive power plateaus well before all 64 features are needed.

This redundancy is not surprising. For example, Return on Assets (ROA), Return on Equity (ROE), and Profit Margin are mathematically linked through the DuPont decomposition. Similarly, current ratio and quick ratio measure overlapping aspects of liquidity. The diagnostic quantifies this structure systematically.

Analysis

Business & Model Implications

Feature Selection Guidance

A parsimonious model using 50 carefully chosen ratios should achieve comparable predictive power to the full 64-feature model, while offering better generalization, faster inference, and improved interpretability. Practitioners can use this diagnostic as justification for dropping correlated ratios rather than including them for marginal improvements.

Regulatory Alignment

In credit risk and financial regulation, simpler models are preferred when performance is equivalent. Basel III, Sarbanes-Oxley (SOX), and other frameworks reward transparency and parsimony. Demonstrating that 21.9% of features are redundant provides a quantitative rationale for model simplification.

Out-of-Sample Stability

High-dimensional models with correlated features are prone to overfitting and performance degradation on new economic regimes (e.g., recessions, industry downturns). Pruning redundant features improves robustness and reduces the risk of "spurious correlation" failures.

Redundancy does not imply causation or lack of importance. A correlated feature may still be interpretable and communicate risk clearly to stakeholders. The diagnostic informs dimensionality but does not dictate which features to keep.

Deliverables

Project Artifacts

The analysis is packaged as a self-contained Jupyter notebook with:

Extensions

Potential Directions

This diagnostic can be extended to: