Łukasz JANKOWSKI and Rafał JANKOWSKI
AGH University of Krakow, Poland
The growing use of data analytics in the debt collection sector has increased the demand for accurate and transparent models capable of predicting recovery levels in large portfolios of mass receivables. Despite the rising importance of machine-learning methods in financial analysis, the literature still lacks empirical studies based on real operational data from large-scale portfolios. This work addresses this gap by conducting a comparative assessment of three tree-based machine-learning algorithms: decision tree, random forest and XGBoost. The models were trained on ex-ante data derived from 389,250 actual receivables serviced by an entity operating on the Polish debt collection market.
The applied approach included an extensive hyperparameter tuning procedure and an evaluation of predictive performance using MAE, RMSE and R² metrics. To enhance interpretability and ensure transparency relevant to managerial and regulatory analysis, SHAP values were employed, enabling the identification of the most important variables influencing model outcomes.
The obtained results indicate that the random forest model provides the most favourable balance between accuracy and generalisation ability, outperforming the single decision tree and achieving slightly better results than XGBoost. The most significant predictors were the nominal claim value, the purchase price and the debtor’s age, complemented by regional characteristics and legal-form attributes.
These findings have important managerial and economic implications, supporting more precise portfolio valuation, more effective risk assessment, better allocation of operational resources and improved planning of both amicable and enforcement strategies.