Machine Learning and Frequency–Severity Decomposition for Insurance Pricing

Nguyet Nguyen

doi:10.20944/preprints202603.2497.v1

Submitted:

31 March 2026

Posted:

01 April 2026

You are already at the latest version

Abstract

Insurance pricing plays a central role in risk management and financial decision-making, 2 as accurate premium estimation directly impacts portfolio stability and profitability. This 3 study investigates insurance pure premium estimation by integrating classical actuar- 4 ial models with modern machine learning techniques. We compare the traditional fre- 5 quency–severity decomposition framework with direct modeling approaches, including 6 XGBoost and Tweedie models. For claim frequency, we evaluate Poisson-based models, 7 generalized additive models, and XGBoost. For claim severity, we compare a Gamma gen- 8 eralized linear model with XGBoost. The results show that XGBoost significantly improves 9 predictive performance for both components. Within the decomposition framework, the 10 XGBoost–XGBoost model achieves the best overall prediction accuracy. However, lift-based 11 analysis reveals that the XGBoost–Gamma model provides superior risk segmentation, 12 highlighting a trade-off between prediction accuracy and risk ranking. Direct modeling 13 approaches, while competitive, do not outperform the decomposition framework. Overall, 14 the findings demonstrate that machine learning enhances predictive performance, but its 15 effectiveness is maximized within the frequency–severity framework. The results further 16 indicate that claim frequency is the primary driver of risk differentiation, while claim sever- 17 ity contributes more to prediction accuracy. These findings have important implications for 18 risk management and pricing strategies in insurance portfolios.

Keywords:

insurance pricing

;

pure premium

;

frequency--severity decomposition

;

machine learning

;

XGBoost

;

generalized linear models

;

Tweedie models

;

risk segmentation

Subject:

Computer Science and Mathematics - Computational Mathematics

1. Introduction

Accurate estimation of expected losses is a central challenge in automobile insurance pricing. Insurers must determine premiums that reflect the underlying risk of each policyholder while ensuring fairness, competitiveness, and regulatory compliance, as premium estimation directly affects portfolio stability and profitability.

A cornerstone of actuarial practice is the frequency–severity framework, which decomposes the expected loss—or pure premium—into two latent components: the expected number of claims and the expected claim size. By modeling these components separately, actuaries can employ statistical distributions tailored to the distinct stochastic properties of each process before recombining them for final loss estimation. This framework provides both interpretability and flexibility, making it fundamental to modern insurance ratemaking.

Generalized Linear Models (GLMs) have long served as the industry standard for actuarial modeling. Existing literature highlights the capacity of GLMs to provide an interpretable yet flexible structure for non-normal, skewed, and heteroscedastic insurance data [3,7,8]. While Poisson and Negative Binomial models are commonly used for claim frequency, Gamma models are widely applied to claim severity due to the strictly positive and right-skewed nature of loss amounts. Extensions such as Generalized Additive Models (GAMs) allow for nonlinear relationships while preserving model transparency [4,10].

Machine learning (ML) has emerged as a powerful tool in insurance analytics. Tree-based algorithms and gradient boosting methods are particularly effective in capturing complex interactions and nonlinear relationships that traditional parametric models may overlook. Empirical evidence suggests that models such as XGBoost can significantly improve performance in tasks ranging from claim frequency prediction to fraud detection [2,6,9].

Recent studies have explored both direct and component-based modeling strategies using machine learning. Direct modeling of total claim amounts, using methods such as Support Vector Regression (SVR), XGBoost, and neural networks, has demonstrated strong predictive accuracy for aggregate loss estimation [18]. Other studies incorporate machine learning within the classical frequency–severity framework, where gradient boosting models are compared with traditional GLMs [19]. These findings indicate that machine learning consistently improves claim frequency prediction, while results for claim severity remain less conclusive due to the high variability of loss amounts.

Despite these advances, direct modeling approaches do not explicitly account for the structural decomposition of insurance losses into frequency and severity components. Moreover, existing evidence on the relative performance of decomposition-based and direct modeling approaches remains inconclusive, particularly when evaluated under both predictive accuracy and risk segmentation criteria.

Classical actuarial models continue to play an important role due to their interpretability and regulatory transparency. Consequently, recent research has focused on hybrid approaches that integrate machine learning techniques within the frequency–severity framework [11,12]. These approaches aim to enhance predictive performance while preserving the structural foundations of actuarial pricing.

Alternative unified modeling approaches have also been proposed, most notably through the Tweedie distribution, which provides a compound Poisson–Gamma representation of aggregate losses [15,16]. While such models offer strong theoretical appeal, it remains unclear whether they can outperform decomposition-based methods in practical insurance applications.

Motivated by these considerations and the lack of consensus in the literature, this study investigates whether machine learning improves insurance pricing by replacing the classical decomposition framework or by enhancing its individual components. In particular, we evaluate model performance not only in terms of prediction accuracy but also in terms of the ability to identify high-risk policies.

To address these questions, we develop a comprehensive modeling framework that integrates classical actuarial models with modern machine learning techniques. We evaluate several models for claim frequency, including Poisson GLMs, spline-based models, and XGBoost. For claim severity, we compare a Gamma GLM with XGBoost. These models are combined within the frequency–severity decomposition framework to estimate pure premium. In addition, we examine direct modeling approaches, including XGBoost and Tweedie models, for predicting aggregate losses.

The contribution of this study is threefold. First, we demonstrate that machine learning methods significantly improve predictive performance for both claim frequency and claim severity, particularly in capturing nonlinear relationships and interactions. Second, we provide empirical evidence that the frequency–severity decomposition framework remains superior to direct modeling approaches, even when advanced machine learning methods are applied. Third, we identify a trade-off between prediction accuracy and risk segmentation: while models combining machine learning for both components achieve the best overall accuracy, models emphasizing frequency modeling provide superior identification of high-risk policies.

Overall, this study contributes to the literature on actuarial science and financial mathematics by providing a comprehensive comparison of classical and modern approaches to insurance pricing. The results highlight the continued relevance of the frequency–severity decomposition framework while demonstrating how machine learning can be effectively integrated to enhance predictive performance and risk differentiation.

The remainder of the paper is organized as follows. Section 2 describes the dataset and data preparation procedures; Section 3 presents the modeling framework and statistical methods; Section 4 reports the empirical findings; and Section 5 concludes with a discussion of the results and their implications for insurance pricing and risk management.

2. Data Description and Preparation

This study utilizes the French motor third-party liability (freMTPL2) datasets from the CASdatasets package [1]. The raw data comprise two files: freMTPL2freq, containing 677,991 policy-year records, and freMTPL2sev, containing 26,444 individual claim amounts. These datasets provide a comprehensive set of risk features, including driver demographics, vehicle attributes, and geographic factors.

Table 1 summarizes the variables included in the frequency dataset. These covariates represent standard rating factors used in automobile insurance pricing, including driver age, vehicle characteristics, geographic region, and exposure.

The severity dataset contains 26,444 claim-level observations and includes the policy identifier and the corresponding claim amount for each reported claim. Because the frequency data are recorded at the policy level while the severity data are recorded at the claim level, the two datasets must be reconciled before modeling.

2.1. Data Integration and Preprocessing

To reconcile the policy-level frequency data with the claim-level severity data, all individual claim amounts in freMTPL2sev were aggregated by policy identifier (IDpol). This produced, for each policy, the total claim amount incurred during the exposure period and the number of associated claims. Policies with no reported claims were assigned a total loss of zero. The aggregated severity information was then merged with the frequency dataset to create a unified policy-level file containing exposure, claim counts, total incurred losses, and all rating variables.

For policies with at least one claim, an average claim severity variable was also computed for descriptive purposes. Pure premium values were obtained by dividing the total incurred loss by the policy’s exposure. These derived quantities are used only for exploratory analysis in this section; the formal notation and modeling framework are introduced later in Section 3.

2.2. Exploratory Data Analysis

The raw insurance portfolio exhibits the extreme class imbalance and heavy-tailed distributions typical of motor liability risks. Figure 1 illustrates these characteristics using log-scaled axes to visualize the full range of the data.

The frequency distribution (Figure 1, left) is dominated by zero-claim policies, with a rapid decay in frequency as claim counts increase. Notably, the raw data contain rare observations of up to 16 claims per policy, appearing as isolated points in the extreme tail. Similarly, the claim severity and pure premium distributions (Figure 1, center and right) span several orders of magnitude. The severity plot reveals a significant concentration of claims around 1,000 EUR, but also shows an exceptionally long right tail reaching towards

10^{6}

EUR, representing catastrophic losses.

These visualizations highlight the necessity of the preprocessing steps described in Section 2.3. Specifically, the extreme sparsity of high claim counts and the high-leverage outliers in the severity tail motivate the use of capping (winsorization) to ensure the numerical stability and generalizability of the frequency and severity models.

2.3. Data Cleaning and Preparation

This section details the preprocessing steps implemented to ensure model stability and mitigate the influence of extreme observations. These procedures are critical for both the Poisson frequency models and the Gamma severity models, which are sensitive to high-leverage outliers.

Exposure Filtering Policies with extremely low exposure are prone to producing artificial variance in annualized claim rates. We excluded all records with an exposure below 0.1 years. This threshold ensures that the observations used for training represent a meaningful period of risk. As shown in Table 2, this step improved the mean exposure from 0.529 to 0.632 years.

Capping and Winsorization Given the extreme right-skewness observed in Figure 1, we applied capping to both frequency and severity components:

Claim Frequency: Observations with more than 4 claims were capped at 4. Although the raw data contained counts as high as 16, these represented a negligible fraction of the portfolio ( $< 0.01 %$ ) but could disproportionately influence the maximum likelihood estimation.
Claim Severity: Individual claim amounts were winsorized at 100,000 EUR. This prevents catastrophic “black swan” events—such as the observed maximum loss of 4.07 million EUR—from distorting the Gamma GLM parameters and the subsequent pure premium calculations.

2.4. Final Modeling Dataset Profile

The resulting dataset provides a stabilized foundation for estimating expected losses. Table 3 summarizes the response variables that will be utilized in Section 3. The disparity between the mean and median values across all metrics reinforces the inherent skewness that persists even after cleaning, necessitating the specialized modeling framework proposed in this study.

AIC	Akaike Information Criterion
GAM	Generalized Additive Model
GLM	Generalized Linear Model
MAE	Mean Absolute Error
ML	Machine Learning
MSE	Mean Squared Error
REML	Restricted Maximum Likelihood
NB	Negative Binomial
XGBoost	Extreme Gradient Boosting

Variable	Type	Description
`IDpol`	Integer	Unique policy identifier.
`ClaimNb`	Integer	Number of claims during the exposure period.
`Exposure`	Numeric	Fraction of the year the policy was in force.
`Area`	Categorical	Geographic area classification.
`VehPower`	Categorical	Vehicle power category.
`VehAge`	Integer	Age of the vehicle (years).
`DrivAge`	Integer	Age of the driver (years).
`BonusMalus`	Numeric	Bonus–malus coefficient.
`VehBrand`	Categorical	Vehicle manufacturer or brand.
`VehGas`	Categorical	Fuel type (gasoline or diesel).
`Density`	Numeric	Population density of the insured area.
`Region`	Categorical	Administrative region of residence.

Exposure Distribution			Claim Severity Quantiles (EUR)
Statistic	Before	After	Quantile	Raw	Capped
Min	0.0027	0.1000	50% (Median)	1,172	1,172
Mean	0.5287	0.6318	95%	4,765	4,765
Max	2.0100	2.0100	99%	16,451	16,451
N	677,991	556,439	100% (Max)	4,075,400	100,000

Variable	Mean	Median	Min	Max
ClaimNb (Capped)	0.045	0	0	4
Exposure	0.632	0.630	0.10	2.01
Total Claim Amount (EUR)	82.09	0	0	115,600
Average Severity (EUR)	76.46	0	0	100,000

Model	MSE	MAE	Correlation	Lift (Top 10%)
Poisson GLM	0.04654	0.08432	0.136	2.48
Poisson + Splines	0.04643	0.08415	0.144	2.76
Negative Binomial	0.04656	0.08445	0.136	2.48
GAM	0.04637	0.08418	0.147	2.62
XGBoost	0.04573	0.08349	0.185	3.06

Model	MSE	MAE	Correlation
Gamma GLM	29,341,925	1457.49	0.0161
XGBoost	3,842,868	1409.51	0.0461

Model	MSE	MAE	Correlation	Lift (Top 10%)
Spline–Gamma	15,553,426	244.57	0.0066	1.509
XGBoost–Gamma	15,548,539	243.19	0.0126	2.039
Spline–XGBoost	15,550,927	242.83	0.0112	1.452
XGBoost–XGBoost	15,547,197	241.64	0.0165	1.663

Model	MSE	MAE	Correlation
XGBoost–XGBoost (Decomposition)	1,641,356	138.99	0.0656
XGBoost (Direct)	1,644,657	150.43	0.0460
Tweedie GLM (Direct)	1,649,283	162.65	0.0329

Machine Learning and Frequency–Severity Decomposition for Insurance Pricing

Abstract

Keywords:

Subject:

1. Introduction

2. Data Description and Preparation

2.1. Data Integration and Preprocessing

2.2. Exploratory Data Analysis

2.3. Data Cleaning and Preparation

2.4. Final Modeling Dataset Profile

3. Methodology

3.1. Frequency–Severity Decomposition Framework

3.2. Claim Frequency Models

3.2.1. Poisson Generalized Linear Model

3.2.2. Negative Binomial Model

3.2.3. Poisson GLM with Natural Cubic Splines

3.2.4. Generalized Additive Model (GAM)

3.2.5. XGBoost for Frequency

3.2.6. Model Selection for Frequency

3.3. Claim Severity Models

3.3.1. Gamma GLM

3.3.2. XGBoost for Severity

3.4. Pure Premium Estimation

3.5. Direct Modeling of Aggregate Loss

3.5.1. Tweedie Model

3.5.2. XGBoost for Total Loss

3.6. Model Evaluation

4. Results

4.1. Claim Frequency Model Performance

4.2. Claim Severity Model Performance

4.3. Results for Frequency–Severity Decomposition for Pure Premium

4.4. Comparison Between Decomposition and Direct Modeling for Total Loss

5. Conclusions and Discussion

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

MDPI Initiatives

Important Links

Subscribe