Is the validity of logistic regression models developed with a medico-administrative database inferior to models developed from clinical databases?

Alain Bernard; Jonathan Cottenet; Pierre Benoit Pages; Catherine Quantin

doi:10.20944/preprints202310.0097.v1

Submitted:

02 October 2023

Posted:

03 October 2023

You are already at the latest version

Abstract

In medico-administrative database, certain prognostic factors cannot be taken into account. The main objective was to estimate the performance of two models based on two databases: Epi-thor clinical and medico-administrative databases. For each of the two databases, we randomly sampled a development dataset with 70% of the data and a validation dataset with 30%. Performance of models was assessed by Brier score, the area under the receiver operating characteristic (AUC ROC) curve and the calibration of the model. For Epithor and medico-administrative databases, the development dataset included 10,516 patients (with resp. 227 (2.16%) and 283(2.7%) deaths) and the validation dataset included 4,507 patients (with resp. 93 (2%) and 119 (2.64%) deaths). 15 predictors were selected in the models (including FEV, Body Mass Index, ASA score and TNM stage for Epithor). The Brier score values were similar in the models of the two databases. For validation data, the AUC ROC curve was 0.73 [0.68-0.78] for Epithor and 0.8 [0.76-0.84] for medico-administrative databases. The slope of the calibration plot was less than 1 for the two databases. This work shows the good performances of a model developed from a medico-administrative database, despite the absence of clinical variables used in practice by surgeons, such as FEV1, ASA score or TNM stage.

Keywords:

Model performance

;

medico-administrative database

;

clinical database

;

Brier score

;

area under the receiver operating characteristic

;

discrimination

;

calibration

Subject:

Medicine and Pharmacology - Epidemiology and Infectious Diseases

1. Introduction

Lung cancer, one of the deadliest worldwide cancers today despite the therapeutic progress made in recent years [1,2], is the 3rd most common cancer in France [3] and remains the leading cause of cancer mortality in men and the second in women, responsible for 22,761 and 10,356 deaths, respectively. Age-standardised all-stage net survival at 5 years is 17% (16% in men, 20% in women); and 10% at 10 years (9% in men, 13% in women). For surgically treatable early stages, 5-year survival ranges from 92-77% for stage IA, 68% for stage IB, 60% for stage IIA, 53% for stage IIB [4].

For early stage bronchial cancers (stages IA and IB), lung resection surgery associated with mediastinal lymph node dissection is the first-line treatment [5,6]. Thus, for patients with normal respiratory capacity, the standard treatment will be lung lobectomy, combined with mediastinal lymph node dissection. In some cases, for patients with a tumour ≤ 2 cm in diameter, in the absence of scissural and/or hilar lymph node metastasis, in favourable topographical situations, or in particular clinical situations (high operative risk (expected mortality for a lobectomy 5%), synchronous or metachronous multifocal tumours), an anatomical segmentectomy can be proposed [5,6]. In recent years, minimally invasive thoracic surgery has developed considerably, mainly in Western countries, with the successive arrival of video thoracoscopy (VATS) and more recently robotic surgery (RATS). In fact in 2007 and 2008, senior surgeons were less inclined to perform VATS than younger surgeons. After that, the minimally invasive approach has gained popularity in recent years for example in USA or Spain, thanks to its efficacy and safety. The use of minimally invasive approaches is now recommended for early-stage localized LC as showed by the ESMO Clinical Practice Guidelines [7], depending on the surgeon's expertise, provided that he or she is able to perform a complete removal of the tumour [5,6].

To assess quality of care [8,9], several publications are based on two quality indicators (e.g. 30-day mortality and failure-to-rescue). A national administrative database is an important tool for assessing the quality of care, and provides data relative to all patients and all care centers nationwide. For example, the French national administrative database for hospital care (PMSI) provides a huge amount of epidemiological information concerning hospitalized French patients [10,11,12]: it includes a large national cohort of patients operated on for lung cancer (about 8 500 patients/year), with exhaustive recruitment (in all hospitals: about 300 in 2010).

However, in medico-administrative database, it is often said that certain prognostic factors cannot be taken into account [13,14,15]. For example, for lung cancer surgery, variables such as preoperative forced expiratory volume (FEV1) or American Society of Anesthesiologists score (ASA score), Body Mass Index (BMI) and TNM stage are not considered in studies based on these data [13,14,15].

On the other hand, these variables are present in clinical databases such as the STS database or the French Epithor database (national database of thoracic surgery in lung cancer) collected thanks to the collaboration of 112 thoracic surgery centers [16,17]. Even if, the demographic characteristics, risk factors and outcomes in our previous study population [13] were very similar to those in previous French studies from the Epithor database [6], the absence of prognostic factors may call into question the validity of the various models based on a medico-administrative database.

To address these concerns, we carried out two logistic regression models, one developed with the French national medico-administrative database and the other with the Epithor clinical database to analyse 30 days mortality.

The main objective of this work was to estimate the performance of the two models based on the two different databases (medico-administrative database and the Epithor clinical database) using several statistical criteria (such as calibration and discrimination). The second objective was to examine the results obtained for the prognostic factors that can be identified in the two databases.

2. Materials and Methods

2.1. Medico-Administrative Database

For this retrospective cohort study, all data for patients who underwent pulmonary resection for lung cancer in France were collected from January 2005 to December 2020 from the national administrative database. This database, called PMSI for “Programme de Médicalisation des Systèmes d’Information”, was inspired by the US Medicare system. The reliability and validity of PMSI data have already been assessed [10,11,12]. Routinely collected medical information includes the principal diagnosis, secondary diagnoses and procedures performed. Diagnoses identified during the hospital stay are coded according to the International Classification of Diseases, tenth revision (ICD-10) [18,19]. All patients undergoing lung cancer surgery in France were included. We thus selected patients for whom a diagnosis of primary lung cancer was coded as the principal discharge diagnosis (all codes C34), associated with a procedure of lung cancer surgery (thoracotomy, video assisted thoracic surgery (VATS) or robot-assisted surgery) during the same hospital stay. As we only included patients who had undergone surgery, no patients with stage IV disease were included. Procedures are coded according to the Classification Commune des Actes Médicaux ( CCAM) [20]. For all patients, lung cancer was proven by pathology analyses according to the 2004 World Health Organization classification of Lung Cancer [18]. Surgery-related variables included the surgical approach (thoracotomy, video assisted thoracic surgery (VATS) or robot-assisted surgery), the type of resection (limited resection, lobectomy, bi-lobectomy and pneumonectomy), bronchoplasty, and the extent of the pulmonary resection (to the chest wall, the left atrium, the carina, the diaphragm, and the superior vena cava).

Patient Characteristics

Baseline demographics included age and gender. From the national administrative database, we included the following comorbidities: pulmonary disease (chronic bronchitis, emphysema), heart disease (coronary artery disease, cardiac arrhythmia, congestive heart failure, valvular heart disease, pulmonary artery hypertension, pulmonary embolism), peripheral vascular disease, liver disease, cerebrovascular events, neurological diseases (hemiplegia or paraplegia), renal disease, hematologic disease (leukemia, lymphoma), metabolic disease (including obesity), anemia, other therapies (preoperative chemotherapy including neoadjuvant therapies, steroids) and infectious disease. We also calculated a modified Charlson Comorbidity Index (CCI) as a marker of comorbidity [21].

Ethics

Patient consent was not required, and patient-identifying information was not used in the research as this national retrospective study was based on pseudonymized data. In fact, the French national administrative hospital database does not contain any patient-identifying data. The patient's identity is pseudonymized, making it possible to link data from the same patient without knowing his or her identity. The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the National Committee for data protection: declaration of conformity to the methodology of reference 05 obtained on 7/08/2018 under the number 2204633 v0.

2.2. Clinical Database Epithor

The database of the French society of thoracic and cardiovascular surgery, Epithor, was created in 2003 [22,23,24,25,26]. Currently, 112 centers use this database to store their data. Epithor underwent a significant transformation in 2016, and surgeons can now save patient data directly to a website called web Epithor. Several published articles have based their research on data extracted from the Epithor database [22,23,24,25,26]. Population description as previously described, we included patients operated on for lung cancer from 1 January 2016 to 31 December 2018 and entered in the Epithor database.

Patient Characteristics

Baseline demographic data included sex, age, body mass index (BMI), performance status, ASA score, FEV, dyspnoea score and TNM stage. The number of comorbidities per patient as a ranked variable grouped into 4 values (0, 1, 2, and ≥3) [23]. The surgery details such as the following were recorded as well: surgical approach (open thoracotomy and video-assisted thoracoscopy), type of surgery (wedge, lobectomy, bilobectomy, pneumonectomy).

Ethics

Use of this database was approved by the National Commission for Data protection (CNIL No 809833) and this study adhered to the tenets of the declaration of Helsinki.

Outcome Measurements

To assess the quality of care, we chose one outcome indicator identified at the patient level: 30-day mortality.

30-day mortality was defined as any patient who died in hospital (including transferred patients) within the first 30 days after the operation and those who died later during the same hospitalization.

Statistical Analysis

First, we sampled the medico-administrative database with the same number of patients as the Epithor database, over the same period.

Then, for each of the two databases, we randomly sampled a development data set with 70% of the data and a validation data set with 30%.

We used a bootstrap backward procedure, to determine which of these factors were significantly associated with the outcome in logistic regression models for medico-administrative database and clinical database Epithor. Using this approach, 1000 replicated bootstrap samples were selected from the original data. Risk factors selected in at least 500 samples (50%) of the replicates were included in the model [27].

For continuous variables, we are tested various extensions of the basic “linear predictor” models exist that can relax the linearity assumption, such as restricted cubic splines and fractional polynomials [28].

Validation of Models

Performance of models was assessed by R2 value, Brier score, Brier Max and Brier scaled [29].

The area under the receiver operating characteristic (ROC) curve, concordance and discrimination slope, were used to measure the discriminatory ability of the model [29].

The calibration of the model was estimated by the relationship between the predicted probability and the observed outcome in that sample. Calibration by plotting predicted against observed probability can estimate intercepts and slopes of curves to quantify overfitting. Well-calibrated models have a slope of 1, whereas models that provide overly extreme predictions have a slope of less than 1: low predicted probabilities are too low, and high predicted probabilities are too high.

The calibration of the model was assessed with the Hosmer–Lemeshow goodness-of-fit test [30]. We used Integrated Calibrated Index (ICI), E50, E90 and Emax to quantify the calibration of logistic model regression [31]. The ICI can be interpreted as weighted difference between observed and predicted probabilities, in which observations are weighted by the empirical density function of the predicted probabilities.

The calculations for logistics regression models were carried out using STATA 18 software (StataCorp, College Station, Tex), and R version 4.2.2 statistical software (http://www.r-project.org).

3. Results

From the Epihor clinical database, 15,023 lung cancer surgery patients were analyzed, with the development data set comprising 10,516 patients and the validation data set comprising 4,507 patients. The number of deaths was 227 (2.16%) in the development data. The number of deaths was 93 (2%) in the validation data.

In parallel, the medico-administrative database showed 15,023 lung cancer surgery patients, with 10,516 in the development data and 4,507 in the validation data. The number of deaths was 283 (2.7%) in the development data. The number of deaths was 119 (2.64%) in the validation data.

Description of Predictors

For the clinical database Epithor and medico-administrative database, the description of patients and hospitals characteristics is reported in the supplementary material (S1 S2, S3,S4). It showed the difference between the two databases. Variables such as FEV, Body Mass Index, ASA score, performance status and TNM stage can be only presented in clinical database Epithor (supplementary material). We have created an additional class for missing data on TNM stage variables (supplementary material). Hospitals characteristics are not significantly related to postoperative mortality; hospital volume is significantly related to mortality in the medico-administrative database (supplementary material S1,S2,S3,S4).

Development Model

For the Epithor clinical database, 15 predictors were selected in >50% of bootstrap samples (Table 1). We used the restricted cubic spline function for the FEV variable, which was tested to make the model the most stable (Table 1). For the variable Body Mass index, we performed a cubic transformation, because of relax the linearity assumption (Table 1). But for age the linearity of model was valid (Table 1).

For the medico-administrative database, 15 predictors were selected in >50% of bootstrap samples (Table 2). For the variable age, we used the restricted spline function and we have transformed the variable hospital volume into logarithm because the linearity assumption are loss and to make the model more stable (Table 2).

Model Validity

Overall performance measures of two models were reported in Table 3. For both models, the R2 was close in the development sample and validation sample (Table 3). For the medico-administrative database model, the R2 was 20% for the development data versus 13% for the Epithor clinical data base (Table 3). The variables selected in the medico-administrative database model explained 20% of the variability in mortality, while the clinical database Epithor model explained 13% of the variability.

The Brier score was identical in the development data and validation data for both databases (Table 3). The Brier score values were similar in the models of the two databases (Table 3). The estimated values of the Brier score for both models were far from 0.5, reflecting a non-informative model.

Discriminative ability was estimated by the area under the receiver operating characteristics (ROC) curve, which was 0.83 (95% Confidence interval CI 0.8-0.85) for development data from the medico-administrative database and 0.8 (95%CI 0.76-0.84) for validation data (Table 3). For development data from the Epithor clinical database the AUC ROC was 0.78 (95%CI 0.75-0.81) and for validation data it was 0.73 (95%CI 0.68-0.78) (Table 3). The model developed by the medico-administrative database had a better discriminant value between living and deceased patients than the Epithor clinical database model, particularly for the validation data.

For the goodness-of-fit, we used the Hosmer-Lemeshow test, which is non-significant for both databases (Table 3). The calibration plot is shown in Figure 1 and Figure 2. For the validation data of the medico-administrative database, the slope was less than 1 (Figure 1), high predicted probabilities were too high. The calibration plot for validation data from the Epithor clinical database showed that the slope was even further away from 1 because low predicted probabilities were too low, and high predicted probabilities were too high (Figure 2).

The integrated Calibration Index was comparable for both databases (Table 3), estimated at 0.0037 for validation data from the medico-administrative database and 0.003 for validation data from the Epithor clinical database.

4. Discussion

The model developed from the medico-administrative database showed a good discriminative value with an area under the ROC curve of 0.83. The model developed from the Epithor clinical database had a slightly poorer discriminative ability, with an area under the ROC curve of 0.78. Overfitting was greater for validation data from the Epithor clinical database than for validation data from the medico-administrative database. Otherwise, the other measures of model performance were not very different between the two databases.

This work showed that the performance of the models based on the PMSI medico-administrative database was similar to that of the Epithor clinical database, using several statistical criteria (such as calibration and discrimination). The performance of these two models might have been expected to be less similar, given that the medico-administrative database does not include variables with a major prognostic role, such as TNM stage, ASA score, FEV1 and body mass index, which are taken into account by surgeons and enable them to make the operative indication.

TNM stage is the strong point of the Epithor clinical database; as we have already indicated, many authors insist that this variable is essential in a prognostic/predictive model [23,24,25,26]. Our analysis did not confirm this assertion, since the model developed from the Epithor clinical database performed no better than the medico-administrative database. However, it is difficult to find a formal explanation for this finding. We can only formulate hypotheses. Variables describing the type of surgery, such as “extended resection” or “approach”, seemed to have more impact in the medico-administrative than in the Epithor clinical database models. In fact, the indication for surgery is based on the TNM stage: for example, surgeons will only be able to propose a minimally invasive approach (VATS or robot) for patients with a tumor classified T1a or T1b. On the other hand, extended resection will be performed on patients classified as T2 or T3. In the end, it is then possible that these variables describing surgery, having an important effect in the medico-administrative database model, may partially compensate for the absence of the TNM stage variable.

Other clinical variables available in the Epithor clinical database, such as FEV1, dyspnea score or gold score, are important to consider in a prognostic model as they reflect the patient's pulmonary pathology at the time of lung resection. However, even if they are not included as such as the PMSI database, the patient's respiratory status is in fact described globally by the variable pulmonary disease in the medico-administrative database model, and this variable has indeed a significant effect in this model.

In the Epithor clinical database, the variable "performance status" is used to indicate the general condition of patients. This variable is not available directly in the medico-administrative database, but can be taken into account indirectly in the model through other comorbidities, such as the Charlson score.

However, the quality of coding information in the administrative database can be questioned. For example, the risk of underestimating certain comorbidities cannot be ruled out. Indeed, coding practices may vary from one hospital to another, and involve different personnel (clinicians or technicians specialized in coding). Nevertheless, the quality of coding is verified by medical information professionals in each hospital (internal quality assessment). It has been shown that the quality of comorbidity coding has increased significantly in recent years [32] in France, due to its impact on hospital charges. In addition, a national external quality assessment program has been set up to verify the quality of discharge summaries in each hospital. All these measures contribute to the quality of the data in this database, as our study seems to confirm.

The quality of Epithor clinical database data may also be questionable, as suggested by the work we carried out in 2019 (33). We showed that some centers did not systematically register all their postoperative deaths. We also found that the TNM stage variable could present missing data, as shown by the calibration with a significant over-fit for the validation data. We have also shown that not all teams participate in the Epithor clinical database [33].

To overcome the shortcomings of these 2 databases, it would be interesting to be able to link them. This would enable us to supplement medico-administrative data with clinical information available in Epithor data, for patients common to both databases. We can reasonably assume that patients present in the Epithor database are also present in the PMSI database. In fact, medico-administrative data are entered in a standardized and compulsory way for all hospitalized patients, which would provide us with data on all patients operated on for lung cancer.

5. Conclusions

In conclusion, we have demonstrated the performance of a model developed from a medico-administrative database, despite the absence of clinical variables used in practice by surgeons, such as FEV1, ASA score or TNM stage. To compensate for the absence of these variables, it would be interesting to link medico-administrative data with Epithor data, which would provide data for all patients operated on and clinical information for patients common to both databases.

Supplementary Materials

The following supporting information can be downloaded at: www.mdpi.com/xxx/s1.

Author Contributions

Conceptualization, AB and CQ.; methodology, AB and CQ.; software, AB and JC.; validation, AB., JC, PBP and CQ.; formal analysis, AB and JC.; writing—original draft preparation, AB.; writing—review and editing, AB, JC and CQ.; supervision, AB and CQ. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the Fondation ARC pour la recherche sur le cancer www.fondation-arc.org

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the French National Commission for Data protection (No 1576793).

Informed Consent Statement

Patient consent was not required as this national retrospective study was based on anonymous data.

Data Availability Statement

The use of these data by our department was approved by the National Committee for data protection. We are not allowed to transmit these data. PMSI data are available for researchers who meet the criteria for access to these French confidential data (this access is submitted to the approval of the National Committee for data protection) from the national agency for the management of hospitalization (ATIH - Agence technique de l'information sur l'hospitalisation). Address: Agence technique de l'information sur l'hospitalisation, 117 boulevard Marius Vivier Merle, 69329 Lyon Cedex 03

Acknowledgments

The authors thank Suzanne Rankin for reviewing the English and Gwenaëlle Periard for her help with the layout and management of this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Siegel, R.L.; Miller, K.D.; Fuchs, H.E.; Jemal, A. Cancer statistics, 2022. CA Cancer J Clin. 2022, 72, 7–33. [Google Scholar] [CrossRef]
Pujol, J.L.; Thomas, P.A.; Giraud, P.; Denis, M.G.; Tretarre, B.; Roch, B.; et al. Lung Cancer in France. J Thorac Oncol. janv 2021, 16, 21–9. [Google Scholar]
INCA. Le cancer du poumon.
Goldstraw, P.; Chansky, K.; Crowley, J.; Rami-Porta, R.; Asamura, H.; Eberhardt, W.E.E.; et al. The IASLC Lung Cancer Staging Project: Proposals for Revision of the TNM Stage Groupings in the Forthcoming (Eighth) Edition of the TNM Classification for Lung Cancer. J Thorac Oncol. janv 2016, 11, 39–51. [Google Scholar] [CrossRef] [PubMed]
Howington, J.A.; Blum, M.G.; Chang, A.C.; Balekian, A.A.; Murthy, S.C. Treatment of stage I and II non-small cell lung cancer: Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. Chest. mai 2013, 143, e278S–e313S. [Google Scholar] [CrossRef] [PubMed]
Vansteenkiste, J.; Crinò, L.; Dooms, C.; Douillard, J.Y.; Faivre-Finn, C.; Lim, E.; et al. 2nd ESMO Consensus Conference on Lung Cancer: early-stage non-small-cell lung cancer consensus on diagnosis, treatment and follow-up. Ann Oncol. août 2014, 25, 1462–74. [Google Scholar] [CrossRef]
Postmus, P.E.; Kerr, K.M.; Oudkerk, M.; Senan, S.; Waller, D.A.; Vansteenkiste, J.; et al. Early and locally advanced non-small-cell lung cancer (NSCLC): ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Annals of Oncology. juill 2017, 28, iv1–21. [Google Scholar] [CrossRef]
Fernandez, F.G.; Kosinski, A.S.; Burfeind, W.; Park, B.; DeCamp, M.M.; Seder, C.; et al. The Society of Thoracic Surgeons Lung Cancer Resection Risk Model: Higher Quality Data and Superior Outcomes. Ann Thorac Surg. août 2016, 102, 370–7. [Google Scholar] [CrossRef]
Farjah, F.; Backhus, L.; Cheng, A.; Englum, B.; Kim, S.; Saha-Chaudhuri, P.; et al. Failure to rescue and pulmonary resection for lung cancer. J Thorac Cardiovasc Surg. mai 2015, 149, 1365–71, discussion 1371-1373.e3. [Google Scholar] [CrossRef]
Lorgis, L.; Cottenet, J.; Molins, G.; Benzenine, E.; Zeller, M.; Aube, H.; et al. Outcomes after acute myocardial infarction in HIV-infected patients: analysis of data from a French nationwide hospital medical information database. Circulation. 30 avr 2013, 127, 1767–74. [Google Scholar] [CrossRef]
Lainay, C.; Benzenine, E.; Durier, J.; Daubail, B.; Giroud, M.; Quantin, C.; et al. Hospitalization within the first year after stroke: the Dijon stroke registry. Stroke. janv 2015, 46, 190–6. [Google Scholar] [CrossRef]
Abdulmalak, C.; Cottenet, J.; Beltramo, G.; Georges, M.; Camus, P.; Bonniaud, P.; et al. Haemoptysis in adults: a 5-year study using the French nationwide hospital administrative database. Eur Respir J. août 2015, 46, 503–11. [Google Scholar] [CrossRef] [PubMed]
Pagès, P.B.; Cottenet, J.; Mariet, A.S.; Bernard, A.; Quantin, C. In-hospital mortality following lung cancer resection: nationwide administrative database. Eur Respir J. juin 2016, 47, 1809–17. [Google Scholar] [CrossRef]
Bernard, A.; Cottenet, J.; Pages, P.B.; Quantin, C. Is there variation between hospitals within each region in postoperative mortality for lung cancer surgery in France? A nationwide study from 2013 to 2020. Front Med (Lausanne). 2023, 10, 1110977. [Google Scholar] [CrossRef]
Bernard, A.; Cottenet, J.; Pages, P.B.; Quantin, C. Diffusion of Minimally Invasive Approach for Lung Cancer Surgery in France: A Nationwide, Population-Based Retrospective Cohort Study. Cancers (Basel). 22 juin 2023, 15, 3283. [Google Scholar] [CrossRef]
Kozower, B.D.; Sheng, S.; O’Brien, S.M.; Liptay, M.J.; Lau, C.L.; Jones, D.R.; et al. STS database risk models: predictors of mortality and major morbidity for lung cancer resection. Ann Thorac Surg. sept 2010, 90, 875–81, discussion 881-883. [Google Scholar] [CrossRef] [PubMed]
Falcoz, P.E.; Conti, M.; Brouchet, L.; Chocron, S.; Puyraveau, M.; Mercier, M.; et al. The Thoracic Surgery Scoring System (Thoracoscore): risk model for in-hospital death in 15,183 patients requiring thoracic surgery. J Thorac Cardiovasc Surg. févr 2007, 133, 325–32. [Google Scholar]
World Health Organization. International Statistical Classification of Diseases and Related Health Problems 10th Revision. http://apps.who.int/classifications/icd10/browse/2016/en Date last accessed: March 1, 2016. Date last updated: 2016. 1 March.
Iezzoni, L.I. Assessing quality using administrative data. Ann Intern Med. 15 oct 1997, 127 8 Pt 2, 666–74. [Google Scholar] [CrossRef]
Travis, W.D.; Brambilla, E.; Müller-Hermelink, H.K.; et al. Pathology and Genetics: Tumours of the Lung, Pleura, Thymus and Heart. Lyon, IARC Press, 2004. In.
Charlson, M.; Szatrowski, T.P.; Peterson, J.; Gold, J. Validation of a combined comorbidity index. J Clin Epidemiol. nov 1994, 47, 1245–51. [Google Scholar] [CrossRef] [PubMed]
Delpy, J.P.; Pagès, P.B.; Mordant, P.; Falcoz, P.E.; Thomas, P.; Le Pimpec-Barthes, F.; et al. Surgical management of spontaneous pneumothorax: are there any prognostic factors influencing postoperative complications? Eur J Cardiothorac Surg. mars 2016, 49, 862–7. [Google Scholar] [CrossRef]
Bernard, A.; Rivera, C.; Pages, P.B.; Falcoz, P.E.; Vicaut, E.; Dahan, M. Risk model of in-hospital mortality after pulmonary resection for cancer: a national database of the French Society of Thoracic and Cardiovascular Surgery (Epithor). J Thorac Cardiovasc Surg. févr 2011, 141, 449–58. [Google Scholar] [CrossRef]
Morgant, M.C.; Pagès, P.B.; Orsini, B.; Falcoz, P.E.; Thomas, P.A.; Barthes, F.L.P.; et al. Time trends in surgery for lung cancer in France from 2005 to 2012: a nationwide study. Eur Respir J. oct 2015, 46, 1131–9. [Google Scholar] [CrossRef]
Pagès, P.B.; Mordant, P.; Renaud, S.; Brouchet, L.; Thomas, P.A.; Dahan, M.; et al. Sleeve lobectomy may provide better outcomes than pneumonectomy for non-small cell lung cancer. A decade in a nationwide study. J Thorac Cardiovasc Surg. janv 2017, 153, 184–195. [Google Scholar] [CrossRef] [PubMed]
Pforr, A.; Pagès, P.B.; Baste, J.M.; Thomas, P.; Falcoz, P.E.; Lepimpec Barthes, F.; et al. A Predictive Score for Bronchopleural Fistula Established Using the French Database Epithor. Ann Thorac Surg. janv 2016, 101, 287–93. [Google Scholar] [CrossRef]
Harrell, F.E. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis; Springer: Paris, France, 2001. [Google Scholar]
Heinze, G.; Wallisch, C.; Dunkler, D. Variable selection - A review and recommendations for the practicing statistician. Biom J. mai 2018, 60, 431–49. [Google Scholar] [CrossRef]
Steyerberg, E.W. Clinical Prediction Models: A Practical Approach to Development, Validation and Updating; Springer: New York, NY, USA, 2009. [Google Scholar]
Hosmer, D.W.; Hosmer, T.; Le Cessie, S.; Lemeshow, S. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med. 15 mai 1997, 16, 965–80. [Google Scholar] [CrossRef]
Austin, P.C.; Steyerberg, E.W. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat Med. 20 sept 2019, 38, 4051–65. [Google Scholar] [CrossRef] [PubMed]
Tassi, M.F.; le Meur, N.; Stéfic, K.; Grammatico-Guillon, L. Performance of French medico-administrative databases in epidemiology of infectious diseases: a scoping review. Front Public Health. 2023, 11, 1161550. [Google Scholar] [CrossRef]
Bernard, A.; Falcoz, P.E.; Thomas, P.A.; Rivera, C.; Brouchet, L.; Baste, J.M.; et al. Comparison of Epithor clinical national database and medico-administrative database to identify the influence of case-mix on the estimation of hospital outliers. PLoS One. 2019, 14, e0219672. [Google Scholar] [CrossRef]

Figure 1. Calibration plot of observed mortality vs predicted mortality. calibration for development data of medico-administrative database (n=4507). calibration plot for validation data of medico-administrative database (n=4507).

Figure 2. Calibration plot of observed mortality vs predicted mortality. calibration for development data of clinical database Epithor (n=4507). calibration plot for validation data of clinical database Epithor (n=4507).

Table 1. Logistic model regression developed with development clinical database Epithor (n=10 516).

	Coef	S.E.	Wald test	P value
Intercept	-8.0726	1.7008	-4.75	<0.0001
FEV 1*	0.0192	0.0134	1.43	0.1518
FEV 2	-0.0835	0.0416	-2.01	0.0445
FEV 3	0.3055	0.1609	1.90	0.0576
Age	0.0247	0.0080	3.09	0.0020
Body Mass Index** Bmi: X/10	1.3608	0.8036	1.69	0.0904
Bmi : X^3	-0.0813	0.0374	-2.17	0.0298
Performance status 2	0.3381	0.1559	2.17	0.0301
≥ 3	0.7057	0.2269	3.11	0.0019
Dyspnea score ≥4	1.3531	0.2899	4.67	<0.0001
Gold score ≥3	0.2870	0.2128	1.35	0.1774
Pneumonectomy	0.3655	0.2362	1.55	0.1219
Sleeve	0.9130	0.3111	2.94	0.0033
VATS	-0.0331	0.1540	-0.22	0.8296
Extended resection	0.3369	0.3686	0.91	0.3606
T2	0.1873	0.1898	0.99	0.3237
T3	0.6567	0.2056	3.19	0.0014
T4	0.8566	0.2560	3.35	0.0008
T missing	2.1008	0.8099	2.59	0.0095
N2	0.3969	0.2089	1.90	0.0575
N missing	0.7002	0.1853	3.78	0.0002
M2	0.7117	0.2662	2.67	0.0075
M missing	-2.3664	0.8259	-2.87	0.0042
Female	-0.7042	0.1807	-3.90	<0.0001
Asa score 2	0.3854	0.2779	1.39	0.1656
Asa score 3	0.6555	0.2822	2.32	0.0202
Comorbidity score ≥3	0.2693	0.1479	1.82	0.0687

* Restricted cubic splines; ** fractional polynomial.

Table 2. Logistic model regression developed with development data medico-administrative (n=10 516).

	Coef	S.E.	Wald test	P value
Intercept	-4.3107	1.0408	-4.14	<0.0001
Pulmonary disease	1.4882	0.1401	10.62	<0.0001
Heart disease	0.4115	0.1412	2.91	0.0036
Peripheral vascular disease	0.4095	0.1652	2.48	0.0131
Neurological disease	0.5063	0.2213	2.29	0.0221
Liver disease	1.9270	0.3037	6.35	<0.0001
Renal disease	0.7027	0.2410	2.92	0.0035
Metabolic disease	-0.3874	0.1867	-2.07	0.0380
Anemia	0.3951	0.1483	2.66	0.0077
Infectious disease	1.4135	0.3245	4.36	<0.0001
Other disease	0.4097	0.1292	3.17	0.0015
Extended resection	0.6063	0.1632	3.71	0.0002
Sleeve	0.8957	0.2966	3.02	0.0025
Female	-0.7435	0.1680	-4.43	<0.0001
VATS/robot	-0.6010	0.1590	-3.78	0.0002
Age 1*	0.0054	0.0163	0.33	0.7379
Age 2	0.0405	0.0161	2.52	0.0117
Logarithm hospital volume	-0.2163	0.0661	-3.27	0.0011

* Restricted Cubic Splines.

Table 3. evaluation of performance of model logistic regression.

	Medico-administrative database		Clinical database Epithor
	Development data (n=10516)	Validation data (n=4507)	Development data (n=10516)	Validation data (n=4507)
Performance measures R2 Brier score Brier max Brier scaled	20% 0.024 0.026 0.08	19% 0.024 0.026 0.07	12% 0.02 0.021 0.03	13% 0.02 0.02 0.08
Discrimative ability AUC ROC Concordance statistic Discrimination slope	0.83[0.80- 0.85]	0.80[0.76- 0.84] 0.82 0.08	0.78[0.75-0.81]	0.73[0.68- 0.78] 0.73 0.03
Calibration Hosmer-Lemeshow test (Χ²) (P-value) ICI E50 E90 Emax Abs Calibration Error* Unreliability p value	8.7 (0.36)	8 (0.5) 0.0037 0.003 0.006 0.15 0.006 0.2	10.4(0.24)	9(0.43) 0.003 0.002 0.005 0.68 0.005 0.05

ICI : Integrated Calibration Index; *Mean Absolute calibration error.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Is the validity of logistic regression models developed with a medico-administrative database inferior to models developed from clinical databases?

Abstract

Keywords:

Subject:

1. Introduction

2. Materials and Methods

2.1. Medico-Administrative Database

Patient Characteristics

Ethics

2.2. Clinical Database Epithor

Patient Characteristics

Ethics

Outcome Measurements

Statistical Analysis

Validation of Models

3. Results

Description of Predictors

Development Model

Model Validity

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe