A Machine Learning Approach to Identify Distinct Subgroups of Veterans at Risk for Hospitalization or Death Using Administrative and Electronic Health Record Data

Ravi B. Parikh
Kristin A. Linn
Jiali Yan
Matthew L. Maciejewski
Ann-Marie Rosland
Kevin G. Volpp
Peter W. Groeneveld
Amol S. Navathe
Peer-Reviewed Article
February 2021


Use of machine learning clustering algorithms revealed 30 distinct subgroups of patients among high-risk veterans, indicating a need for tailored approaches to health care.


The Veterans Health Administration developed the Care Assessment Needs (CAN) score to predict risk of future hospitalization or mortality for all veterans who receive primary care. Interventions for all veterans, such as telemedicine or case management, have had limited effectiveness in improving care for individuals with high CAN risk scores. Since approaches to subgrouping high-risk patients that rely on diagnosis/disease criteria or expert opinion are prone to human error, this study sought to understand whether machine learning algorithms could identify distinct subpopulations of individuals at high risk for hospitalization or mortality using electronic health record and administrative data.


Within a national randomized sample of 110,000 veterans identified from high CAN scores, this study identified 30 unique subgroups with a wide range of characteristics, health care utilization, and mortality risk. Though most subgroups were categorized via chronic conditions, the analysis identified several subgroups based on non-clinical factors, including sociodemographic (e.g., Medicaid, Hispanic ethnicity) and psychobehavioral (e.g., specific polysubstance use, psychoses without polysubstance use) characteristics. About 25 percent of the high-risk veterans did not fit clearly into one subgroup due to having characteristics across multiple subgroups or characteristics that were not captured.


Care management programs for high-risk patients should be tailored to the distinct needs of the population. The clustering methods described in this article can guide health care organizations in using both clinical and administrative data to identify unique needs and inform appropriate interventions for targeted subgroups. 

Posted to The Playbook on
Level of Evidence
What does this mean?