Research ML Diagnostics

Polish Bankruptcy RP Diagnostic

A financial distress prediction model diagnostic using sparse random projections to quantify feature redundancy in bankruptcy prediction. Applied to the Polish Companies Bankruptcy dataset to measure intrinsic dimensionality and validate model complexity against regulatory and financial soundness standards.

GitHub ↗

Python Machine Learning Scikit-learn Financial Ratios Bankruptcy Prediction

Motivation

Financial Ratio Redundancy in Distress Prediction

Financial analysts compute dozens of ratios from company balance sheets — liquidity, leverage, profitability, efficiency. While intuition suggests more ratios improve prediction, correlation among them creates feature bloat that destabilizes models and obscures the true drivers of bankruptcy risk. The Polish Companies Bankruptcy dataset contains 64 financial ratio features. This diagnostic asks: how many are truly independent?

Methodology

Sparse Random Projection Analysis

Same methodology as the credit risk diagnostic, adapted for bankruptcy prediction. Quantifies feature redundancy without requiring explicit feature selection.

Baseline Model Training

Train a balanced logistic regression classifier on all 64 financial ratio features using the Polish Companies Bankruptcy dataset (5-year prediction horizon, binary bankrupt/non-bankrupt classification).

Dimensionality Sweep

Apply sparse random projection to systematically reduce the feature space from 64 down to 1 dimension, measuring predictive performance (AUC) at each step.

Ensemble Stabilization

Execute the full projection pipeline 20 times with different random initializations and average the AUC curves to reduce noise and improve robustness.

Redundancy Metric

Identify the "elbow" point where AUC plateau occurs and calculate the redundancy ratio as (64 − intrinsic dimensions) / 64, expressed as a percentage.

Results

Model Complexity Audit

Feature Redundancy Analysis

Original Feature Count

Intrinsic Dimensions

Redundancy Ratio

21.9%

Independent Ratios

50 of 64

Interpretation: Of the 64 financial ratios, only 50 represent truly independent dimensions of financial health. The remaining 14 ratios (21.9%) are linear combinations or near-collinear with others, suggesting that predictive power plateaus well before all 64 features are needed.

This redundancy is not surprising. For example, Return on Assets (ROA), Return on Equity (ROE), and Profit Margin are mathematically linked through the DuPont decomposition. Similarly, current ratio and quick ratio measure overlapping aspects of liquidity. The diagnostic quantifies this structure systematically.

Analysis

Business & Model Implications

Feature Selection Guidance

A parsimonious model using 50 carefully chosen ratios should achieve comparable predictive power to the full 64-feature model with better generalization and interpretability. Practitioners can cite this diagnostic as justification for dropping correlated ratios.

Regulatory Alignment

Basel III, Sarbanes-Oxley, and other frameworks reward transparency and parsimony when performance is equivalent. Demonstrating that 21.9% of features are redundant provides quantitative rationale for model simplification.

Out-of-Sample Stability

High-dimensional models with correlated features overfit and degrade on new economic regimes. Pruning redundant features improves robustness and reduces the risk of spurious correlation failures.

Redundancy does not imply causation or lack of importance. A correlated feature may still be interpretable and communicate risk clearly to stakeholders. The diagnostic informs dimensionality but does not dictate which features to keep.

Deliverables

Project Artifacts

The analysis is packaged as a self-contained Jupyter notebook with:

Data loading: UCI Polish Companies Bankruptcy dataset (multiple years available)
Preprocessing: Normalization, handling missing values, class balancing
Baseline modeling: Logistic regression with AUC metric
RP pipeline: Sparse random projection with dimensionality sweeps and ensemble averaging
Visualization: Redundancy curves, feature importance rankings, elbow plots
Summary report: Redundancy ratio, intrinsic dimensionality, compliance notes
Dependencies: scikit-learn, pandas, numpy, matplotlib, scipy

Extensions

Potential Directions

This diagnostic can be extended to:

Non-linear methods: Compare sparse RP results with kernel PCA or autoencoders
Time-series stability: Measure redundancy across different economic periods (pre-crisis, crisis, recovery)
Feature attribution: Combine with SHAP or permutation importance to rank features by both redundancy and predictive contribution
Multi-horizon prediction: Analyze redundancy for different prediction horizons (1-year, 3-year, 5-year)