Polish Bankruptcy RP Diagnostic
A financial distress prediction model diagnostic using sparse random projections to quantify feature redundancy in bankruptcy prediction. Applied to the Polish Companies Bankruptcy dataset to measure intrinsic dimensionality and validate model complexity against regulatory and financial soundness standards.
Financial Ratio Redundancy in Distress Prediction
Financial analysts and machine learning practitioners often compute dozens of ratios from company balance sheets— liquidity ratios, leverage ratios, profitability metrics, efficiency measures, and more. While intuition suggests that more ratios improve prediction, correlation among ratios creates feature bloat that destabilizes models and obscures the true drivers of bankruptcy risk.
The Polish Companies Bankruptcy dataset contains 64 financial ratio features with rich structure. This diagnostic asks a simple question: how many of these 64 ratios are truly independent? The answer informs feature selection, model complexity audits, and interpretability.
Sparse Random Projection Analysis
This project applies the same sparse random projection methodology used in credit risk diagnostics, adapted for the bankruptcy prediction domain. The approach quantifies feature redundancy without requiring explicit feature selection.
Train a balanced logistic regression classifier on all 64 financial ratio features using the Polish Companies Bankruptcy dataset (5-year prediction horizon, binary bankrupt/non-bankrupt classification).
Apply sparse random projection to systematically reduce the feature space from 64 down to 1 dimension, measuring predictive performance (AUC) at each step.
Execute the full projection pipeline 20 times with different random initializations and average the AUC curves to reduce noise and improve robustness.
Identify the "elbow" point where AUC plateau occurs and calculate the redundancy ratio as (64 − intrinsic dimensions) / 64, expressed as a percentage.
Model Complexity Audit
Feature Redundancy Analysis
Interpretation: Of the 64 financial ratios, only 50 represent truly independent dimensions of financial health. The remaining 14 ratios (21.9%) are linear combinations or near-collinear with others, suggesting that predictive power plateaus well before all 64 features are needed.
This redundancy is not surprising. For example, Return on Assets (ROA), Return on Equity (ROE), and Profit Margin are mathematically linked through the DuPont decomposition. Similarly, current ratio and quick ratio measure overlapping aspects of liquidity. The diagnostic quantifies this structure systematically.
Business & Model Implications
Feature Selection Guidance
A parsimonious model using 50 carefully chosen ratios should achieve comparable predictive power to the full 64-feature model, while offering better generalization, faster inference, and improved interpretability. Practitioners can use this diagnostic as justification for dropping correlated ratios rather than including them for marginal improvements.
Regulatory Alignment
In credit risk and financial regulation, simpler models are preferred when performance is equivalent. Basel III, Sarbanes-Oxley (SOX), and other frameworks reward transparency and parsimony. Demonstrating that 21.9% of features are redundant provides a quantitative rationale for model simplification.
Out-of-Sample Stability
High-dimensional models with correlated features are prone to overfitting and performance degradation on new economic regimes (e.g., recessions, industry downturns). Pruning redundant features improves robustness and reduces the risk of "spurious correlation" failures.
Redundancy does not imply causation or lack of importance. A correlated feature may still be interpretable and communicate risk clearly to stakeholders. The diagnostic informs dimensionality but does not dictate which features to keep.
Project Artifacts
The analysis is packaged as a self-contained Jupyter notebook with:
- Data loading: UCI Polish Companies Bankruptcy dataset (multiple years available)
- Preprocessing: Normalization, handling missing values, class balancing
- Baseline modeling: Logistic regression with AUC metric
- RP pipeline: Sparse random projection with dimensionality sweeps and ensemble averaging
- Visualization: Redundancy curves, feature importance rankings, elbow plots
- Summary report: Redundancy ratio, intrinsic dimensionality, compliance notes
- Dependencies: scikit-learn, pandas, numpy, matplotlib, scipy
Potential Directions
This diagnostic can be extended to:
- Non-linear methods: Compare sparse RP results with kernel PCA or autoencoders
- Time-series stability: Measure redundancy across different economic periods (pre-crisis, crisis, recovery)
- Feature attribution: Combine with SHAP or permutation importance to rank features by both redundancy and predictive contribution
- Multi-horizon prediction: Analyze redundancy for different prediction horizons (1-year, 3-year, 5-year)