Polish Bankruptcy RP Diagnostic
A financial distress prediction model diagnostic using sparse random projections to quantify feature redundancy in bankruptcy prediction. Applied to the Polish Companies Bankruptcy dataset to measure intrinsic dimensionality and validate model complexity against regulatory and financial soundness standards.
Financial Ratio Redundancy in Distress Prediction
Financial analysts compute dozens of ratios from company balance sheets — liquidity, leverage, profitability, efficiency. While intuition suggests more ratios improve prediction, correlation among them creates feature bloat that destabilizes models and obscures the true drivers of bankruptcy risk. The Polish Companies Bankruptcy dataset contains 64 financial ratio features. This diagnostic asks: how many are truly independent?
Sparse Random Projection Analysis
Same methodology as the credit risk diagnostic, adapted for bankruptcy prediction. Quantifies feature redundancy without requiring explicit feature selection.
Train a balanced logistic regression classifier on all 64 financial ratio features using the Polish Companies Bankruptcy dataset (5-year prediction horizon, binary bankrupt/non-bankrupt classification).
Apply sparse random projection to systematically reduce the feature space from 64 down to 1 dimension, measuring predictive performance (AUC) at each step.
Execute the full projection pipeline 20 times with different random initializations and average the AUC curves to reduce noise and improve robustness.
Identify the "elbow" point where AUC plateau occurs and calculate the redundancy ratio as (64 − intrinsic dimensions) / 64, expressed as a percentage.
Model Complexity Audit
Feature Redundancy Analysis
Interpretation: Of the 64 financial ratios, only 50 represent truly independent dimensions of financial health. The remaining 14 ratios (21.9%) are linear combinations or near-collinear with others, suggesting that predictive power plateaus well before all 64 features are needed.
This redundancy is not surprising. For example, Return on Assets (ROA), Return on Equity (ROE), and Profit Margin are mathematically linked through the DuPont decomposition. Similarly, current ratio and quick ratio measure overlapping aspects of liquidity. The diagnostic quantifies this structure systematically.
Business & Model Implications
Feature Selection Guidance
A parsimonious model using 50 carefully chosen ratios should achieve comparable predictive power to the full 64-feature model with better generalization and interpretability. Practitioners can cite this diagnostic as justification for dropping correlated ratios.
Regulatory Alignment
Basel III, Sarbanes-Oxley, and other frameworks reward transparency and parsimony when performance is equivalent. Demonstrating that 21.9% of features are redundant provides quantitative rationale for model simplification.
Out-of-Sample Stability
High-dimensional models with correlated features overfit and degrade on new economic regimes. Pruning redundant features improves robustness and reduces the risk of spurious correlation failures.
Redundancy does not imply causation or lack of importance. A correlated feature may still be interpretable and communicate risk clearly to stakeholders. The diagnostic informs dimensionality but does not dictate which features to keep.
Project Artifacts
The analysis is packaged as a self-contained Jupyter notebook with:
- Data loading: UCI Polish Companies Bankruptcy dataset (multiple years available)
- Preprocessing: Normalization, handling missing values, class balancing
- Baseline modeling: Logistic regression with AUC metric
- RP pipeline: Sparse random projection with dimensionality sweeps and ensemble averaging
- Visualization: Redundancy curves, feature importance rankings, elbow plots
- Summary report: Redundancy ratio, intrinsic dimensionality, compliance notes
- Dependencies: scikit-learn, pandas, numpy, matplotlib, scipy
Potential Directions
This diagnostic can be extended to:
- Non-linear methods: Compare sparse RP results with kernel PCA or autoencoders
- Time-series stability: Measure redundancy across different economic periods (pre-crisis, crisis, recovery)
- Feature attribution: Combine with SHAP or permutation importance to rank features by both redundancy and predictive contribution
- Multi-horizon prediction: Analyze redundancy for different prediction horizons (1-year, 3-year, 5-year)