(122) Implementation of a cluster-based High-Performance Computing solution for logistic regression, random forest and XGBoost algorithms to predict memory impairment in older adults
Background: Machine learning is crucial in analyzing large healthcare data and predicting clinical outcomes. We present here an approach aided by High-Performance Computing (HPC) to analyze one of the largest and the most comprehensive datasets of geriatric assessments, the interRAI Home Care Assessment System (interRAI-HC).
Objectives: This study combines mainstream machine-learning algorithms with the power of HPC to predict memory impairment in older adults 65 years and above. Logistic regression, random forest and XGBoost algorithms were considered.
Methods: We used anonymized data of community-dwelling adults in New Zealand aged 65 years and above who received an interRAI Home Care Assessment System (interRAI-HC) assessment between June 1, 2012, and June 30, 2014. The dataset can be linked to other healthcare data, including demographic, psychosocial, and clinical variables. The interRAI dataset currently consists of 250 variables. Using the HPC cluster (G4W-Isambard- Run on XCI Marvell Thunder X2 nodes) and identified 24 variables that are key discriminators of memory impairment risk to train predictors based on logistic regression, random forest and XGBoost. Each predictor was tested by 100-fold cross-validation, parallelized implementations over a single node with 64 cores, with the map function in the R package purrr. We evaluated model performance using the area under the receiver-operating characteristic curve, F1, accuracy, sensitivity and specificity, and negative and positive predictive value.
Results: The overall statistics demonstrated the mean AUC for the logistic regression algorithm is 0.787 (se=0.00283) at lambda=0.0001. For the random forest, the mean AUC is 0.718 (se =0.00271), and the mean AUC of XGBoost is 0.799 (se =0.00300). The XGBoost model achieved the highest accuracy ( 81%). The biggest contributors to cognitive impairment were social or occupational functioning, including higher IADL dependency and poor mobility.
Conclusions: With the aid of HPC, we successfully ran machine learning algorithms on large datasets efficiently within 6 hours. The cluster-based HPC cross-validation solution indicated that XGBoost works best compared to logistic regression and Random forest. This cluster-based HPC solution for logistic regression, random forest and XGBoost algorithms will enable an analysis of large interRAI data currently greater than a million individuals and growing