Background: Machine Learning (ML) is a potent tool for analyzing Real-World Data (RWD) where missing values of varying degrees are a common problem.Limited research exists on how multiple imputation (MI) methods and degrees of missingness impact ML performance for RWD problems such as overall survival (OS) prediction in advanced lung cancer (mLC).
Objectives: To evaluate different MI approaches to handle varying degrees of missingness among features for ML in the prediction of OS for the mLC.
Methods: In a cohort of adults from first recorded diagnosis of mLC from the large nationwide IQVIA Oncology EMR – US database in 2015-2020, ML algorithms were trained and validated for predicting 90-day mortality. Baseline features included demographics, vital signs, stage, TNM, histology, biomarkers (e.g., EGFR, HER2, KRAS, BRAF, and cMET), chemo, target- and immunotherapy, and functional lab test were assessed. Steps conducted were: i) Full cohort (C100), 4 additional analytic cohorts (C75, C50, C25 and C0) to keep features with respective missingness proportions < 75%, < 50%, < 25% and 0% were created; ii) Each cohort was split into 70/30 for training and testing; iii) Data cleaning included removing extreme outliers, clustering of chemotherapies and histology; iv) MI methods included MICE and MissForest; v) As a reference, single imputation was utilized, employing the median for continuous variables and the most frequent category for categorical variables; and vi) ML including XGBoost and Random Forest (RF) were used. Performance metrics for ML models included AUC, accuracy, F1, sensitivity and specificity.
Results: The full study cohort included 19,751 mLC adult patients where 31.2% were 75+ years old (median=69 and IQR=62-76 years) and 52.2% were male. Stages IIIB, IIIC and IV were characterized in 12.3%, 1.4% and 86.3% of patients, respectively. 9% of patients died within the 90-day follow-up. The number of features dropped from 66 in the full cohort to 37, 30, 21 and 20 for the C75, C50 C25 and C0 cohorts, respectively. AUC of 0.86, 0.85, 0.83, 0.73 and 0.73 from XGBoost; and 0.84, 0.84, 0.81, 0.71 and 0.71 from RF were observed for the C100, C75, C50, C25, and C0 cohorts, respectively. Similar AUC trends were found from different data cleaning methods and single imputations for the 5 cohorts. MissForest seemed to perform better than MICE, and XGBoost seemed to perform better than RF.
Conclusions: Based on AUC, the MI approaches incorporating all features regardless of missingness levels demonstrated the best performance in predicting mortality in this study. While certain MI and ML methods showed slightly superior performance, additional research is warranted to validate these findings across different diseases and databases.