Article Text

Download PDFPDF

Original research
Machine learning model for differentiating malignant from benign thyroid nodules based on the thyroid function data
  1. Fuqiang Ma1,2,
  2. Fengchang Yu3,
  3. Shenhui Lv4,
  4. Lihua Zhang5,
  5. Zhilin Lu2,
  6. Quan Zhou2,
  7. He-Rong Mao2,
  8. Lele Zhang1,
  9. Nan Xiang4
  1. 1 Department of Integrated Traditional and Western Medicine,The First Affiliated Hospital, and College of Clinical Medicine of Henan University of Science and Technology, Luoyang, China
  2. 2 Hubei University of Chinese Medicine, Wuhan, Hubei, China
  3. 3 Wuhan University, School of Information Management, Wuhan, China
  4. 4 Hubei Provincial Hospital of Traditional Chinese Medicine, Wuhan, Hubei, China
  5. 5 Huanggang Hospital of Chinese Medicine, Hubei University of Chinese Medicine, Huanggang, Hubei, China
  1. Correspondence to Nan Xiang; 2727{at}hbucm.edu.cn; Lele Zhang; 876758328{at}qq.com; He-Rong Mao; mhr1752{at}hbucm.edu.cn

Abstract

Objectives To develop and validate a machine learning (ML) model to differentiate malignant from benign thyroid nodules (TNs) based on the routine data and provide diagnostic assistance for medical professionals.

Setting A qualified panel of 1649 patients with TNs from one hospital were stratified by gender, age, free triiodothyronine (FT3), free thyroxine (FT4) and thyroid peroxidase antibody (TPOAB).

Participants Thyroid function (TF) data of 1649 patients with TNs were collected in a single centre from January 2018 to June 2022, with a total of 273 males and 1376 females, respectively.

Measures Seven popular ML models (Random Forest, Decision Tree, Logistic Regression (LR), K-Neighbours, Gaussian Naive Bayes, Multilayer Perception and Gradient Boosting) were developed to predict malignant and benign TNs, whose performance indicators included area under the curve (AUC), accuracy, recall, precision and F1 score.

Results A total of 1649 patients were enrolled in this study, with the median age of 45.15±13.41 years, and the male to female ratio was 1:5.055. In the multivariate LR analysis, statistically significant differences existed between the TNs group and thyroid cancer group in gender, age, free triiodothyronine (FT3), free thyroxine (FT4) and TPOAB. Among the seven tested ML models, the best performance was achieved in the Gradient Boosting model in terms of precision, AUC, accuracy, recall and F1 score, with the AUC of 0.82, accuracy of 79.4% and precision of 0.814 after experimental verification. FT4, TPOAB and FT3 were validated as the top three features in the Gradient Boosting model.

Conclusions This study innovatively developed a predictive model for benign and malignant TNs based on the Gradient Boosting Decision Tree algorithm. For the first time, it validated the clinical predictive value of TF parameters (FT4, FT3) and TPOAB as key biomarkers.

  • Machine Learning
  • Thyroid disease
  • DIABETES & ENDOCRINOLOGY

Data availability statement

Data are available upon reasonable request. The data sets used and analysed in this study are available from the corresponding author NX on reasonable request.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

STRENGTHS AND LIMITATIONS OF THIS STUDY

  • Using common thyroid function data, this study employed seven popular machine learning (ML) models and provided a comparison of multiple prediction methods.

  • This study takes into consideration hyperparameters to optimise the performance of ML models.

  • This study employed a retrospective analysis solely based on the single-centre data, which may introduce potential selection bias.

  • The model lacks a graphical user interface.

Background

Thyroid nodules (TNs), as one of the commonly found disorders in endocrinology, are mainly triggered by local abnormal growth of thyroid cells1 and have become a clinical consensus for early and timely differentiation of malignant TNs. Varying according to countries and regions, it is estimated that the prevalence of TNs in the general population is higher than 67%,2 3 with a palpation detection rate of 4%–7%4 and an ultrasound (US) detection rate of 19%–68%.1 Probably cancerated from the severe TNs, thyroid cancer (TC) is predicted to become the fourth most common type of cancer worldwide.5 Overall, China, the USA and India have witnessed the highest incidence of TC in the world,5 6 which has posed tremendous medical and financial burden on society, with the cost of well-identified TC care in the USA only having reached $3.5 billion.7 To date, differentiating malignant from benign nodules has played a crucial role in the management of thyroid diseases, and early identification of malignant TNs is still regarded as a clinical consensus.8 Therefore, it is quite necessary for clinicians and researchers to devote sufficient energy to optimising the diagnostic and therapeutic process to screen malignant TNs.

Currently, US, fine needle aspiration biopsy (FNAB) examination, frozen section (FS), molecular detection, surgery9 or these in combination are often utilised to screen benign and malignant TNs. However, the limitations of the above-mentioned technologies can be clearly seen. For instance, US diagnosis of thyroid diseases can be influenced by the clinical experience of physicians,10 while the clinical value and necessity of FNAB and FS are still of controversy as they are still sophisticated and time-consuming.11 12 In addition, molecular means are still not easily accessible so far in many countries like in China,13 which necessitates the development of a diagnostic tool that can collect data from existing diagnostic tests to improve the predictability of benign and malignant risk of TNs.14

As artificial intelligence (AI) has been widely utilised in analysing the healthcare data,15 machine learning (ML) models are adopted today to predict various diseases by learning the embedded relationships between features. Researchers worldwide have reported a variety of AI applications in the field of thyroid,16 but most of them focus on the prediction of TNs using US, radiology and genomics data.17–20 The ML algorithms, however, have not been adequately used to predict thyroid function (TF)-based malignant TNs through numerical data with the most common, readily available and cost-effective characteristics in today’s clinical practice.

Some malignant TNs may not affect TF as TF indicators are subject to the low sensitivity to malignant TNs. It may also be associated with the accuracy of prediction results that can be easily affected by a variety of factors such as disease and medication. However, an increasing number of studies have shown that TF is significantly correlated with TC.21 For some patients, the chance of success for preferred treatment protocols would be greatly impacted.22 Other studies suggest that patients would feel more satisfied when treatments are given with consideration of their preferences.23

This study aims to investigate the utility of numerical TF data obtained during initial consultations of patients with TNs to train an optimised ML model to predict the occurrence of malignant TNs. The proposed predictive tool is designed to assist clinical decision-making by providing patients with valuable prognostic information, potentially enhancing the successful treatment rates through timely medical intervention and facilitating informed shared decision-making regarding the optimal treatment strategies in the current medical management system.24 Participants’ consents are unnecessary, since the data we utilised in this study were anonymous and all patients’ information was deidentified at the point of collection.

Methods

Participants

A total of 1649 cases with TNs were collected from January 2018 to June 2022 in the Endocrinology Department of Hubei Provincial Hospital of Traditional Chinese Medicine. Among them, 1096 patients (150 males and 946 females) were diagnosed with benign TNs, with the median age of 45.31±13.34 years; while the other 553 (123 males and 430 females) were pathologically confirmed with TC, with the median age of 46.35±14.08 years.

Inclusion criteria

The included cases are those conforming to TNs and TC diagnostic criteria designated in the 2015 American Thyroid Association Management Guidelines for Adult Patients with Thyroid Nodules and Differentiated Thyroid Cancer.9

Exclusion criteria

The excluded cases are the pregnant and lactating patients, and those with acute infections, or with immune system diseases, or with neurological diseases, and without follow-up information. This study strictly observed the Helsinki Declaration, and all patients’ information was kept confidential.

Clinical data collection

Clinical data of patients were collected by well-trained professional medical workers from the Information Department of Hubei Provincial Hospital of Traditional Chinese Medicine, including demographic information (gender and age), TF profiles (five biochemical parameters), thyroid US findings, and surgical pathology reports.

Patient and public involvement statement

None.

Data processing

Data validity assessment

The original data were cross-checked by two workers, and researchers then randomly selected 5% of the processed data for comparison.

Data processing and cleansing

In the classification framework, patients with pathologically confirmed TC were designated as positive cases, whereas those with TNs but without malignant tumours were classified as negative ones. Regarding the evaluation of model performance, when multiple area under the curve (AUC) values were generated from a single predictive model, specifically, multiple test results derived from the same model within a specific study, priority was given to selecting the predictive occurrence data that demonstrated optimal model performance.

Machine learning (ML) training

The seven types of ML algorithms employed for data modelling are Random Forest, Decision Tree, Logistic Regression (LR), K-Neighbours, Gaussian Naive Bayes (GNB), Multilayer Perception (MLP) and Gradient Boosting. Random Forest (RF) is an integrated learning method that performs classification or regression tasks by constructing multiple Decision Trees and aggregating their outputs to improve accuracy and reduce overfitting. Decision Tree is an interpretable classification and regression method that partitions data into subsets based on feature values through a series of binary decisions, typically represented as ‘yes’ or ‘no’ questions, to assign data points to specific categories or predicted numerical values. LR is a statistical method used for binary or multiclass classification problems, estimating probabilities by applying a logistic function to maximise the likelihood of observed data. k-nearest neighbours (k-NN) is a non-parametric classification and regression method that predicts outcomes by measuring the similarity between instances based on distance metrics, such as Euclidean distance, and assigning labels based on the majority vote of its k-NN. GNB is a probabilistic classifier based on Bayes’ theorem with the assumption of feature independence, which simplifies computations while maintaining reasonable performance for many applications. MLP is a type of feedforward artificial neural network consisting of an input layer, one or more hidden layers and an output layer, where information propagates through fully connected layers to learn complex patterns in the data. Gradient Boosting Trees is an ensemble learning algorithm that iteratively adds weak prediction models, typically Decision Trees, to minimise a loss function and enhance predictive performance. To evaluate the predictability of each model, the receiver operating characteristic (ROC) curve was plotted to reflect the model’s performance,25 while accuracy, AUC, precision, recall and F1 score were used to determine the predictability in comparison with the performance of ML algorithms. Then, to eliminate the impact of data division on the model performance, five-fold Cross-Validation was used for the ML model in the experiment.26 Finally, for different ML models, the importance of each feature in the models was further explored.27

To optimise the performance of ML models, hyperparameters should be taken into consideration. In our study, the employed ML algorithms have their own specific hyperparameters, which are detailed as follows:

For the RF algorithm, 100 trees with the Gini criterion were used. The minimum number of samples required to split an internal node was set to 2, and the minimum number of samples required to be at a leaf node was set to 1. For the Decision Tree, the Gini criterion was also used to select the best split strategy at each node. The minimum number of samples required to split an internal node was 2. The LR model was configured with an L2 norm penalty and a tolerance for stopping criteria set to 1e-4. The K Neighbours algorithm was set with five neighbours, with equal weight of all points in each neighbourhood, and the Euclidean distance (L2) was used as the Minkowski Distance. For the GNB, the portion of the largest variance of all features added to variances for calculation stability was set to 1e-9. The MLP model used the ReLU activation function for the hidden layer, the Adam optimizer, an L2 regularisation strength of 0.0001 and a constant learning rate of 0.001. The Gradient Boosting algorithm utilised the binomial and multinomial deviance loss functions, with a learning rate of 0.1 and 100 boosting stages. All these algorithms were implemented using the Scikit-Learn library in Python.

Statistical analysis

The number of cases (%) was adopted for describing the count data, the χ2 test for comparison between groups, the t-test for measuring data accorded with a normal distribution and the ‘mean±SD’ (X±S) for expressing results. The median (M) and IQR were exploited for those not conforming to a normal distribution, and the Mann-Whitney U test for comparison between groups. The LR analysis was applied for variables with statistical differences in univariate analysis, so as to further screen variables. The influence of each prognostic factor was judged by calculating the OR of each variable. The two-tailed test was utilised to examine all statistical tests, and p<0.05 was considered statistically significant.

Results

Demographic characteristics

A total of 1649 patients participated in the study, of which 553 suffered from TC confirmed with postsurgery pathological findings, and 1096 were randomly selected with benign TNs. Ranging from 9 to 89 years, the median age of patients was 45.15±13.41 years, with the male to female ratio of 1:5.055. Details are shown in table 1.

Table 1

Baseline distribution of 1649 patients

Univariate and multivariate LR analysis of benign and malignant TNs

In the univariate LR analysis, statistically significant differences existed between the two groups in age, gender, FT3, FT4, TGAB and thyroid peroxidase antibody (TPOAB) (p<0.05). In the multivariate LR analysis, there were statistically significant differences between the two groups in age, gender, FT3, FT4 and TPOAB (p<0.05). Details are shown in table 2.

Table 2

Univariate and multivariate LR analysis of the influence on benign and malignant TNs

Development and validation of the predictive model

Seven distinct ML algorithms were constructed to develop predictive models, with precision scores ranging from 0.571 to 0.814. Comparative analysis revealed that the Gradient Boosting model outperformed other algorithms across multiple evaluation metrics, including precision, AUC, accuracy, recall and F1 score, as detailed in table 3. Based on its superior performance, the Gradient Boosting algorithm was consequently selected as the optimal predictive model. The corresponding ROC curves, illustrating the performance characteristics of all developed models, are presented in figure 1.

Table 3

Comparative validation of the predictive performance of seven ML algorithms

Figure 1

ROC curve of the ML models. (A) Decision Tree; (B) Gaussian Naive Bayes; (C) Gradient Boosting; (D) K-Neighbours; (E) Logistic Regression; (F) MLP; (G) Random Forest. MLP, Multilayer Perception; ROC, receiver operating characteristic.

Relative importance of features in the ML algorithms

Figure 2 illustrates the variable importance profiles across different ML algorithms for distinguishing between benign and malignant TNs. The analysis reveals consistent patterns in feature importance rankings. While minor variations in variable significance were observed among the algorithms, thyroid-related biomarkers including FT4, FT3 and TPOAB consistently demonstrated high predictive importance across all the models. Notably, in the optimal Gradient Boosting model, these three biomarkers maintained their prominence, with FT4, TPOAB and FT3 emerging as the top-ranked features, respectively.

Figure 2

Importance ranking of features in the ML models. (A) Decision Tree; (B) Gaussian Naive Bayes; (C) Gradient Boosting; (D) K-Neighbours; (E) Logistic Regression; (F) MLP; (G) Random Forest. FT3, free triiodothyronine; FT4, free thyroxine; ML, machine learning; MLP, Multilayer Perception; TGAB, thyroglobulin antibody; TPOAB, thyroid peroxidase antibody; TSH, thyroid stimulating hormone.

Discussions

In this study, seven ML algorithms were established for differentiating benign and malignant TNs using TF profiles from patients with TNs. Through the comprehensive analysis, the Gradient Boosting algorithm was identified as the most robust predictive model, demonstrating superior accuracy and reliability in TC prediction. Furthermore, an in-depth feature importance analysis was conducted to elucidate the relative contribution of various clinical parameters within the optimal model.

Previous studies on ML-based prediction of malignant TNs have predominantly focused on imaging and histological characteristics, including morphological features, nodule dimensions and textural patterns as primary predictive indicators.18 This preference stems from the perceived limited diagnostic utility of TF indexes, which are often considered less reliable due to their susceptibility to various confounding factors. However, emerging evidence has increasingly demonstrated a significant association between TF profiles and TC risk.28 Previous epidemiological studies have shown that thyroid dysfunction is associated with elevated TC risk,29 while specific thyroid hormone levels show distinct correlation patterns: FT3 exhibits an inverse relationship with TC risk,30 and enhanced FT4 levels demonstrate positive correlation with TC incidence. These results actually underscore the clinical relevance of our study, providing valuable insights for enhancing patient counselling and clinical decision-making. By providing more accurate risk stratification, our findings may inform critical decisions regarding therapeutic approaches, potentially guiding the choice between pharmacological intervention and surgical management.31

The ML models in this study demonstrated significant predictive capabilities for malignant TNs by leveraging TF data, offering a more direct and objective assessment of patients’ conditions. This approach laid a scientifically robust and precise foundation for the clinical diagnosis of thyroid disorders. In countries with a high incidence of TC, such as China and India, predicting the availability and affordability indicators can reduce expenditures on further medical examinations, thereby mitigating the financial burden on society. Given that evidence-based consultations have been shown to yield better outcomes followed by shared decision-making and the application of patients’ preferred treatment options,23 31 this personalised and patient-oriented predictive model has the potential to significantly enhance treatment efficacy utilising simple parameters.

To minimise the impact of confounding factors on the TF data, this study collected and analysed the TF data from patients’ initial visits. Multivariate LR analysis revealed significant differences in age, gender, FT3, FT4 and TPOAB between the two groups, identifying these variables as independent risk factors for both benign and malignant TNs. Previous studies have found that the incidence of TC is age-dependent, with a notably higher prevalence among women compared with men.32 In our study, the TC group exhibited a younger age distribution than the TN group, potentially attributable to earlier detection and intervention in TC cases. While some studies have linked lower levels of FT3 and FT4 to an increased risk of TC,33 others have found no significant association28 or even reported an inverse correlation between elevated FT3 levels and TC risk.34 Furthermore, the relationship between FT4 levels and TC appears to be non-linear, with a marked increase in TC risk observed when FT4 levels exceed approximately 2.2 ng/dL.35 Our findings align with certain prior studies, underscoring the intricate relationship between TNs, TF and TC.36 Although TPOAB is primarily utilised for diagnosing Hashimoto’s thyroiditis and Graves’ disease,37 two fine needle aspiration cytology (FNAC) studies from the same institution have established TPOAB as an independent risk factor for malignant thyroid tumours, highlighting a critical association between TPOAB and TC.38 This conclusion is consistent with the results of our study.

This study demonstrated distinct advantages over previous studies in predicting the risk of malignant TNs. First, while significant differences in the TF data have been observed between patients with benign TNs and TC, few studies have focused on TF as a predictive marker. In contrast, our study leveraged the TF data routinely collected and readily accessible in clinical practice. Furthermore, although ML methods have shown diagnostic potential in distinguishing benign from malignant TNs using genomics, radiological and US data, there are still scarce studies employing ML algorithms with TF as a key feature. To the best of our knowledge, this study is among the first to develop an ML-based predictive model utilising TF data for real-time risk assessment of TNs, thus achieving high levels of accuracy and reliability.

However, this study still has its usual limitations. First, all data in the study were collected from a single centre, and the findings lack validation using external data sets. Second, the retrospective design of this work may introduce selection and information bias, potentially affecting the robustness of the results. Third, the model lacks a graphical user interface. All these factors should be carefully considered in future studies so that the generalisability and precision of the findings can be elevated.

Conclusions

This study innovatively developed a predictive model for benign and malignant TNs based on the gradient boosting decision tree algorithm. For the first time, it validated the clinical predictive value of TF parameters (FT4, FT3) and TPOAB as key biomarkers. By leveraging ML interpretability techniques, the dose–response relationship between these indicators and the malignant risk of nodules was elucidated, providing a quantitative basis for early TC screening. This study will facilitate the transition of TN diagnosis and treatment from static assessment to a dynamic intelligent decision-making framework, offering a novel paradigm for the application of precision medicine in endocrine tumour management.

Data availability statement

Data are available upon reasonable request. The data sets used and analysed in this study are available from the corresponding author NX on reasonable request.

Ethics statements

Patient consent for publication

Ethics approval

This study had been reviewed and approved by the Ethics Committee of Hubei Provincial Hospital of Traditional Chinese Medicine (approval no: HBZY2023-C27-02). All methods were adopted in line with relevant guidelines and regulations.

Acknowledgments

We express our heartfelt thanks to Director Qing Liu and Director Wei Yan from Hubei Hospital of Traditional Chinese Medicine for their constructive suggestions in the writing of this paper.

References

Footnotes

  • FM, FY and SL contributed equally.

  • Contributors NX, LLZ and H-RM conceptualised and designed the study and NX has full access to all the data used in the study; FQM and H-RM wrote the article, FCY did machine learning and SHL did statistical analysis. LHZ, ZLL and QZ collected and analysed the clinical data. All authors had revised and agreed with publication before the manuscript submission. FQM, FCY and SHL contributed equally to this work. NX is responsible for the overall content as guarantor.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting or dissemination plans of this research.

  • Provenance and peer review Not commissioned; externally peer reviewed.