Application of Machine Learning to Osteoporosis and Osteopenia Screening Using Hand Radiographs

Purpose Fragility fractures associated with osteoporosis and osteopenia are a common cause of morbidity and mortality. Current methods of diagnosing low bone mineral density require specialized dual x-ray absorptiometry (DXA) scans. Plain hand radiographs may have utility as an alternative screening tool, although optimal diagnostic radiographic parameters are unknown, and measurement is prone to human error. The aim of the present study was to develop and validate an artificial intelligence algorithm to screen for osteoporosis and osteopenia using standard hand radiographs.
Methods Institutional review board approval was obtained. An institutional database was queried to identify all patients between 1998 and 2019 who underwent both a DXA scan and a hand radiograph within 12 months of each other. The reports for the DXA scan within 12 months of the radiograph were obtained and the T-scores were extracted. High-resolution images of corresponding posteroanterior view hand radiographs were exported from our institution's Picture Archiving and Communication System. Hand radiograph images were labeled with DXA T-score and category (osteoporosis, osteopenia, or normal). Definitions of categories followed the standard World Health Organization (WHO) definitions using T-scores21 as follows: normal, T >= -1.0; osteopenia, -2.5 < T < -1.0; and osteoporosis, T <= -2.5.
Results There was a total of 687 images in the normal category, 607 images in the osteopenia category, and 130 images in the osteoporosis category, for a total of 1,424 images. When predicting low bone density (osteopenia or osteoporosis) versus normal bone density, sensitivity was 88.5%, specificity was 65.4%, overall accuracy was 80.8%, and the area under the curve was 0.891, at the standard threshold of 0.5. If optimizing for both sensitivity and specificity, at a threshold of 0.655, the model achieved a sensitivity of 84.6% at a specificity of 84.6%.
Conclusions The findings represent a possible step toward more accessible, cost-effective, automated diagnosis and therefore earlier treatment of osteoporosis/osteopenia. (J Hand Surg Am. 2024;-(-):-e-. Copyright Ó2024 by the American Society for Surgery of the Hand. All rights are reserved, including those for text and data mining, AI training, and similar technologies.)
Type of study/level of evidence Diagnostic II Key words Artificial intelligence, bone mineral density, machine learning, osteopenia, osteoporosis.

From the *Division of Plastic and Reconstructive Surgery, Department of Surgery, Stanford University School of Medicine, Stanford, CA; †Department of Orthopedics, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan, Republic of China; ‡Department of Orthopaedic Surgery, Stanford University School of Medicine, Stanford, CA; and §Robert A. Chase Hand and Upper Limb Center, Department of Orthopaedic Surgery, Stanford University Medical Center, Redwood City, CA.

Received for publication February 6, 2024; accepted in revised form September 10, 2024.

Corresponding author : Jeffrey Yao, MD, Robert A. Chase Hand and Upper Limb Center,Department of Orthopaedic Surgery, Stanford University Medical Center, 450 Broadway Street, MC 6342, Redwood City, CA 94063; e-mail: jyao@stanford.edu.

0363-5023/24/---0001$36.00/0 https://doi.org/10.1016 j.jhsa.2024.09.008

Background

OSTEOPOROSIS AND OSTEOPENIA are common conditions with considerable morbidity. The lifetime incidence of any osteoporotic fragility fracture is estimated to be 40% to 50% in women and 13% to 22% in men.1 An estimated 9 million osteoporotic fractures occur worldwide annually.2,3 Fragility fractures may decrease function and contribute to nearly six million disability-adjusted life years lost annually.2 Moreover, patients with hip fractures experience a five-fold to eight-fold increase in all-cause mortality post-fracture.4 The rates of intervention for osteoporosis remain low, despite evidence that screening and treatment decrease future risk of fragility fractures. Improved screening to identify and appropriately treat individuals with poor bone health may help decrease the morbidity and mortality associated with fragility fractures.
Methods

Data collection

Institutional review board approval was obtained. An institutional database was queried to identify all patients between 1998 and 2019 who underwent both a DXA scan and a hand radiograph within 12 months of each other. The reports for the DXA scan within 12 months of the radiograph were obtained and the T-scores were extracted. High-resolution images of corresponding posteroanterior view hand radiographs were exported from our institution's Picture Archiving and Communication System. Hand radiograph images were labeled with DXA T-score and category (osteoporosis, osteopenia, or normal). Definitions of categories followed the standard World Health Organization (WHO) definitions using T-scores21 as follows: normal, T >= -1.0; osteopenia, -2.5 < T < -1.0; and osteoporosis, T <= -2.5.
Diagram of model architecture

FIGURE 1: Diagram of model architecture.

Neural Network Algorithm Development

All image preprocessing, model execution, and performance evaluation were performed using Python. A model was designed using the ResNet-50 algorithm, which comprises 49 convolution layers and one fully connected layer constructed into 16 residual blocks. ResNet architectures employ residual connections to avoid the problem of gradient vanishing, where gradients diminish after passing too many layers. The base model was pretrained on the ImageNet data set, a large data set containing millions of images. The neural network was programmed with the PyTorch 2.0 framework and trained for 35 epochs.
Results

There was a total of 687 images in the normal category, 607 images in the osteopenia category, and 130 images in the osteoporosis category, for a total of 1,424 images. When predicting low bone density (osteopenia or osteoporosis) versus normal bone density, sensitivity was 88.5%, specificity was 65.4%, overall accuracy was 80.8%, and the area under the curve was 0.891, at the standard threshold of 0.5. If optimizing for both sensitivity and specificity, at a threshold of 0.655, the model achieved a sensitivity of 84.6% at a specificity of 84.6%.

Discussion

In this study, a neural network was developed and validated to screen for osteoporosis and osteopenia in routine hand radiographs. Specifically, a CNN was trained to identify low BMD in hand radiographs as correlated to the reference standard based on DXA hip T-scores and was found to have a sensitivity of 88.5%, a specificity of 65.4%, and an accuracy of 80.8% of diagnosis. The high sensitivity, and therefore low false negative rate, reflects the potential utility of the algorithm as a screening tool to identify patients with low BMD.
Research results visualization
images in the osteoporosis category, for a total of 1,424 images (Table 1). Twelve images were excluded due to overlying splint/casting material or lack of a posteroanterior view. Two patients were excluded due to incomplete DXA reports. Eight pediatric patients were excluded. There were 68 patients without a left hip DXA; of these, the right hip was used for 47 patients, the spine for six patients, and the forearm for 15 patients. Women accounted for 86.8% of images, and 53.3% of patients were White, 20.9% were Asian, 14.3% were Hispanic, 3.4% were Black, and 8.1% were another ethnicity or unknown. The mean T-score overall was -1.02 ±1.13 (mean ±SD), with a mean T-score of -0.11 ±0.78 in the normal group, -1.66 ±0.38 in the osteopenia group, and -2.89 ±0.51 in the osteoporosis group.
Of these, 26 normal radiographs, 26 osteopenia radiographs, and 26 osteoporosis radiographs were used as the balanced validation set for a total of 78 images. The remaining 660 normal radiographs, 582 osteopenia radiographs, and 104 osteoporosis radiographs were used as the training set for a total of 1,346 images.

When predicting low BMD (osteopenia or osteoporosis) versus normal BMD, sensitivity was 88.5%, specificity was 65.4%, and precision was 83.6% at the standard classification threshold of 0.5 (Fig. 2A). Overall accuracy was 80.8%. The F1-score was 0.86, and the AUC was 0.891 (Fig. 2B). When optimizing for both sensitivity and specificity, at a threshold of 0.655, the model achieved a sensitivity of 84.6% at a specificity of 84.6%
Research results visualization
FIGURE 3: Performance of the model in predicting osteoporosis from non-osteoporosis (normal or osteopenia). A Two-way confusion matrix at the standard classification threshold of 0.5. BMD, bone mineral density. NPV, negative predictive value. PPV, positive predictive value (also referred to as precision). B Receiver operating characteristic curve of model.
When predicting low BMD (osteopenia or osteoporosis) versus normal BMD, sensitivity was 88.5%, specificity was 65.4%, and precision was 83.6% at the standard classification threshold of 0.5 (Fig. 2A). Overall accuracy was 80.8%. The F1-score was 0.86, and the AUC was 0.891 (Fig. 2B). When optimizing for both sensitivity and specificity, at a threshold of 0.655, the model achieved a sensitivity of 84.6% at a specificity of 84.6%
Research results visualization
FIGURE 4:Three-way confusion matrix for the model's prediction of normal, osteopenia, or osteoporosis. AI TO SCREEN BONE MINERAL DENSITY ON X-RAYS 5
regarding bone quality. We previously published a study examining the correlation between the 2MCP on hand radiographs with hip BMD from DXA scans.15 In comparison to our model's sensitivity of 88.5% and specificity of 65.4% for detecting low BMD, trained human measurement of 2MCP demonstrated a sensitivity of 88% and a specificity of 60% for distinguishing osteopenic from normal in-dividuals, and a sensitivity of 100% and a specificity of 91% for distinguishing osteoporotic from normal individuals.15 However, measurement relies on trained personnel and is subject to potential human error and bias. A subsequent study by Tecle et al20 demonstrated a CNN's ability to process a radio-graph image to identify osteoporosis using the 2MCP as a proxy osteoporosis predictor. Their model had a sensitivity of 82.4% and a specificity of 94.3% but used the 2MCP as the reference standard for osteo-porosis prediction.20 Although comparisons are difficult given that our model uses DXA T-scores as the reference standard, which may account for some differences in our results, differences may also arise from a smaller, disparate data set as we only included radiographs with a corresponding DXA. Addition-ally, in the absence of knowledge of the most reliable indicator of bone density in the hand, we used AI to correlate hand radiographs to DXA scans in an agnostic fashion without image segmentation. Our findings synthesize those of prior studies to further support the potential utility of hand radiographs in screening for osteoporosis/osteopenia, with correla-tion to the gold standard reference of DXA BMD. There are several strengths of this study. Inclusion of the whole hand radiograph as the input image al-lows the algorithm to define important features from the whole hand and uses the radiographic image as produced without additional segmentation needed. Our study uses DXA scans as the reference standard for ground truth bone density classifications and an-alyses of model performance. Although DXA testing itself is prone to some limitations, it remains the current reference standard for the quantification of bone density. The ResNet architecture was selected for efficiency in processing images, without the is-sues of overfitting that can occur in alternative larger models such as Vision Transformer. Finally, incor-poration of explainable AI methods allows for some degree of transparency regarding the model. Although not perfect, this additional step in the pipeline can help with model interpretability and alleviate concerns regarding the 'black box'nature of deep learning algorithms, which at baseline, lack explanations for their decision-making.
The model presented in this study achieved high sensitivity in detecting low BMD on hand radio-graphs. From a clinical standpoint, there are several reasons to screen for both osteopenia and osteopo-rosis, rather than osteoporosis alone. First, patients with osteopenia are also at a high risk for fragility fractures. Siris et al28 found that only 6.4% of fragility fractures in women occurred in those meeting the WHO definition of osteoporosis. In another study of women 65 years or older, 54% of patients who sustained hip fractures were non-oste-oporo tic.29 Similarly, a prospective cohort study of both men and women ≥55 years found that only 44% of women and 21% of men sustaining fractures were osteoporotic.30 These data suggest the need for more sensitive screening to identify patients at risk of fracture. Screening for patients with not only osteo-porosis but also osteopenia, captures a larger at-risk patient population, who may have additional factors that contribute to bone fragility. The patients identi-fied through screening can then be referred to clini-cians for further risk assessment and appropriate treatment. Overall, higher sensitivity is prioritized as a hand radiograph is an inexpensive and low-risk screening tool, and detection of preclinical disease may result in earlier treatment and better outcomes. When trying to further stratify low BMD, accuracy decreased when the model attempted to perform a three-way differentiation between all categories of normal BMD, osteopenia, and osteoporosis. The model also performed slightly less well when differentiating osteoporosis images from non-osteo-porosis (osteopenia or normal) images. There are several potential explanations for this finding. First, bone density exists along a continuum even on DXA imaging, and although discrete categories of T-scores are defined as osteopenia or osteoporosis, the char-acteristics found on radiographs likely follow a con-tinuum as well. As supported by the clinical fracture data, there are likely many patients categorized as osteopenic who have features of poor bone health. Second, it is possible that we could better detect differences between the categories with a larger im-age set for both training and validation. Although various advanced imaging modalities have been proposed as an alternative to DXA imag-ing, radiographs remain one of the most common and easily accessible imaging modalities. Specifically, among extremity radiographs, hand radiographs are routinely obtained in numerous practice settings and for many indications. Plain radiographs are also associated with lower radiation exposure.31 Finally, hand radiographs are readily available and
inexpensive. Hand radiographs have similar costs as DXA but are more accessible and less costly than current alternatives such as magnetic resonance im-aging and quantitative computed tomography. Limitations of the study include those inherent to retrospective cohort studies and those using AI al-gorithms. The model architecture in this study was only trained on, and is therefore currently only applicable to, hand radiographs and not other types of radiographs. Overall, our cohort had more women when compared with the general popula-tion, as this is the patient population that tends to have DXA examinations, but is also therefore,representative of the clinical population currently indicated for bone health screening. The low pro-portion of men may have contributed to overfitting when incorporating sex as a factor. Detailed infor-mation regarding any history of treatment for poor bone quality, medications, family history, and substance use is unknown. Individuals with comorbidities that may place them at higher risk for bone disease, such as cancer or renal disease, were not excluded. It is possible that these factors could affect the BMD as seen on both DXA and hand radiographs. Similarly, the presence of other osseous lesions such as fractures, post-traumatic localized osteoporosis, and arthrosis was not a criteria for exclusion and could also affect the observed findings. These characteristics were not used directly as inputs into the algorithm to avoid overfitting, but their inclusion increases the gener-alizability of our results and reflects real-world performance for screening. Handedness may be associated with asymmetries in extremity BMD and may therefore affect both radiologic findings and DXA results, although these differences are esti-mated at 1% to 2%.33,34 Finally, use of a randomly selected imbalanced validation set would have more accurately reflected the overall study population. However, this would be unlikely to substantially affect the sensitivity or specificity, the primary metrics of interest, although there may be larger effects on measured precision. Use of a balanced validation set allows for greater interpretability of metrics such as accuracy, which primarily repre-sents the majority class in an imbalanced set. Overall, a larger data set could potentially improve the performance of the model. Clinical utility of this tool for screening should be validated through further study. In summary, the study demonstrates the ability of a deep learning algorithm to successfully detect low bone density from standard hand radiographs with
high sensitivity and specificity. Compared with the current reference standard (DXA scans), this method uses plain hand radiographs, which are easily acces-sible and inexpensive. This may allow for a more rapid, cost-effective, and readily available screening tool for osteoporosis and osteopenia, capitalizing on a common imaging modality already widely used for other conditions. We envision that this could be used to expand indications for screening in a cost-effective manner, which may in turn allow for earlier and improved diagnosis and treatment.

CONFLICTS OF INTEREST

No benefits in any form have been received or will be received related directly to this article.

ACKNOWLEDGMENTS

This work was supported by the J2022 American Association for Hand Surgery Annual Research Grant. The authors thank Akousist for their assistance on this study 

REFERENCES