Article Text
Abstract
Objective This study aimed to propose a data-driven framework for classification of at-risk people for cardiovascular outcomes regarding obesity and metabolic syndrome.
Design A population-based prospective cohort study with a long-term follow-up.
Setting Data from the Tehran Lipid and Glucose Study (TLGS) were interrogated.
Participants 12 808 participants of the TLGS cohort, aged ≥20 years who have followed for over 15 years were assessed.
Main outcome measures Data for 12 808 participants, aged ≥20 years who have followed for over 15 years, collected through TLGS as a prospective, population-based cohort study, were analysed. Feature engineering followed by hierarchical clustering was used to determine meaningful clusters and novel endophenotypes. Cox regression was used to demonstrate the clinical validity of phenomapping. The performance of endophenotype compared with traditional classifications was evaluated by the value of Akaike information criterion/Bayesian information criterion. R software V.4.2 was employed.
Results The mean age was 42.1±14.9 years, 56.2% were female, 13.1%, 2.8% and 6.2% had experienced cardiovascular disease (CVD), CVD mortality and hard CVD, respectively. Low-risk cluster compared with the high risk had significant difference in age, body mass index, waist-to-hip ratio, 2 hours post load plasma glucose, triglyceride, triglycerides to high density lipoprotein ratio, education, marital status, smoking and the presence of metabolic syndrome. Eight distinct endophenotypes were detected with significantly different clinical characteristics and outcomes.
Conclusion Phenomapping resulted in a novel classification of population with cardiovascular outcomes, which can, better, stratify individuals into homogeneous subclasses for prevention and intervention as an alternative of traditional methods solely based on either obesity or metabolic status. These findings have important clinical implications for a particular part of the Middle Eastern population for which it is a common practice to use tools/evidence derived from western populations with substantially different backgrounds and risk profiles.
- epidemiology
- epidemiology
- cardiac epidemiology
- lipid disorders
- obesity
Data availability statement
Data are available upon reasonable request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
STRENGTHS AND LIMITATIONS OF THIS STUDY
Data from a large cohort study with over 15 years of follow-up were analysed.
Validity of extracted clusters, identified by phenomapping, was tested using survival analysis.
We employed a machine learning method for classification of cardiovascular outcomes.
The method is sophisticated and needs high skill for application.
Cardiovascular disease-related diseases were not excluded at baseline.
Introduction
Nearly one-third of the world’s population is considered obese.1 Moreover, in many countries, the prevalence of obesity is rapidly increasing, and this creates a huge burden on individuals, societies and the health system. In short, one could call obesity a modern global epidemic.2 The global burden of cardiovascular disease (CVD) has almost doubled in recent decades attributable to obesity in a large portion. In brief, obesity is a heterogeneous disorder3 and a leading risk factor for several diseases, which heavily affects many aspects of an individual’s life.4 For example, obesity is a significant risk factor for diabetes, high blood pressure and CVD.5–9 Although numerous studies have shown a strong positive correlation between overweight and obesity and heart diseases, other studies argue that the heterogeneity among these relationships may be due to other risk factors such as age, sex, race and metabolic abnormalities as well as the effect of confounding variables, interactions or comorbidities.7 10 In addition, significant heterogenicity of overweight and obese populations is directly associated with heterogeneous participation of molecules, genes and cells, in addition to environmental, social and economic factors. Moreover, inability to correctly classify patients by considering the existing heterogeneity among obese subjects may translate to a limitation in the effectiveness of therapeutic interventions intended by clinical experts. So far, limited studies have been conducted on the heterogeneity of obese people at the population level.11 Definition of obesity evolved in the recent decade to improve classification of populations at risk for future adverse outcomes, including CVD from simple anthropometric indices to more complex classification, that is, metabolically healthy obese (MHO) or metabolically abnormal obese. This significant shift in distinguishing subjects is largely associated with our better understanding of adipose tissue function, biomarkers and inherent heterogeneity of obesity due to unrecognised interaction of mentioned factors in molecular and genetic studies. So, description of subjects merely grounded on anthropometric indices or metabolic profile would be a simplified solution for a heterogenous entity. This pathway has begun in classification of the most complex diseases such as heart failure and Alzheimer in recent publications.12 13
The problem of unresolved heterogeneity is not unique to medicine; it appears often in various fields such as document classification and image processing. Shah et al proposed that machine learning, which is defined as the process of using data to learn relationships between objects, as an ideal candidate to tackle this problem.14 Machine learning methods are typically divided into two main categories: supervised and unsupervised. Supervised learning seeks to predict specified outputs or outcomes, while unsupervised learning (aka clustering) aims to try to learn the intrinsic structure within data. As an example, for unsupervised learning, a recent study14 used clustering method and phenomapping to determine the best subgroups, particularly in heterogeneous disorders. In clustering approach, people can be divided into groups that may be different from their initial phenotypes. This method can be used for drug optimisation, behavioural changes to improve health, disease prediction and prognosis and identification of disease biomarkers.15 Therefore, one could argue that clustering of study participants can be applied to investigate the heterogeneous factors.16
In terms of recent advancement in the machine learning domain, the purpose of this study was conducted how to categorise the study population for implementing more effective treatment and interventions. This classification is based on predicting the incidence of or death due to cardiovascular disease, through the examination of direct and indirect factors. In this research, a different categorisation of people is presented. Then, survival time of the determined clusters until the occurrence of cardiovascular diseases or death is investigated. Using this framework, one could identify people at risk of obesity and predict the occurrence of cardiac death or complications. The results can be used for appropriate and effective interventions.
Methods
Study design and population
This research analysed data collected through Tehran Lipid and Glucose Study (TLGS), which is a prospective, population-based cohort study to determine the risk factors for non-communicable diseases among a representative Tehran urban population.
A detailed description of the TLGS has been reported elsewhere.17 Data collection is ongoing, with follow-up examinations at about 3-year intervals. In our study, 12 808 participants, aged ≥20 years from phase I (between February 1999 and August 2001) and phase II (between 2002 and 2005) who have followed-up over 15 years were selected.
Patient and public involvement
In our study, no patient was involved.
Data collection
To begin, all subjects have been interviewed by trained interviewers using pretested questionnaires. Information on age, gender, medical history of CVD, medication use, smoking habit, physical examination and family history of premature coronary artery disease were also collected. Anthropometric measures included weight, height and waist and hip circumference. Systolic blood pressure (SBP) and diastolic blood pressure (DBP) have been measured twice in a seated position after a 15-minute rest period using a standard mercury sphygmomanometer.18 19 A blood sample was taken after a 12–14-hour overnight fast. All blood analyses have been done at the TLGS research laboratory on the day of blood collection. Two hours post load plasma glucose (2h-PLPG) was measured in participants without treated diabetes along with plasma glucose and serum lipids as previously described.20
Details of the collection of CVD outcome data have been published before.19 Coronary heart disease included cases of definite myocardial infarction (diagnostic electrocardiographic results and biomarkers), probable myocardial infarction (positive electrocardiographic findings plus cardiac symptoms or signs plus missing biomarkers or positive electrocardiographic findings plus equivocal biomarkers), angiographically proven coronary heart disease and death due to coronary heart disease. CVD was defined as any coronary heart disease events, stroke (a new neurological deficit that lasted 24 hours) or CVD death; hard CVD was defined as a composite of death from CVD, non-fatal myocardial infarction or non-fatal stroke. People who had the disease at entry were not included in this study. Definition of Overweight and Obesity For adults, overweight is defined as if BMI(body mass index) was in the range 25.0–29.9 kg/m2, and obese if BMI ≥30.0 kg/m2. Metabolic syndrome was defined as the presence of three or more of these criteria: FPG(fasting plasma glucose) level ≥100 mg/dL (5.6 mmol/L) or a previously diagnosed DM(diabetes mellitus); triglyceride (TG) level ≥150 mg/dL (1.7 mmol/L) or being on treatment medications; high density lipoprotein cholesterol (HDL-c) level <50 mg/dL (1.3 mmol/L) or the use of cholesterol-lowering drugs; SBP ≥130 mm Hg and/or DBP ≥85 mm Hg or use of medications, and a waist circumference >90 cm.
Framework
Exploratory data analysis
The primary purpose of this analysis is to have a deep understanding of our data set and its underlying structure. In exploratory data analysis, by summarising and visualising the data, we extracted informative graphs, possible initial. Baseline models, a list of outliers, a ranked list of key features and many more relevant insights.21 We used Mice R Package for handling missing data.22
Feature engineering
Feature engineering is a process of using domain knowledge and/or special relevant techniques to extract informative features from raw (aka initial) data. Using informative features can reduce the model complexity and improve its performance and interpretability to have a system capable of delivering high quality results.23 In our framework, one-hot-encoding method was used for encoding the discrete variables.24 We simply used the Z-score transformation for continuous data to be able to compare different variables with different units and range, that were not comparable beforehand after normalising them by Z-score. To alleviate the curse of dimensionality, we used feature selection and feature extraction techniques. For feature selection, correlation coefficient was used to select a set of informative features from an existing set. We used the Polycor R package to first measure the correlation between variables and then to select a small subset (eg, one feature) from each set of correlated variables. We also leverage domain knowledge and previous studies for extracting relevant features as a simple example, using BMI as one feature instead of weight and height.24–26
Clustering
We used cumulative hierarchical clustering, as an unsupervised machine learning approach, hclust R package27 to classify participants and phenotypic variables. For quantitative evaluation of various clustering settings, we employed a popular heuristic approach called Elbow Method28 to determine an optimal number of clusters for our dataset. For qualitative evaluation, various visualisation methods, such as dendrogram and heatmap, were employed.29–32 We then leverage the domain knowledge of subject matter experts (with clinicians in the hospital) to check if each cluster poses a unique and meaningful characteristic.
Various analyses were conducted for clustered data, such as extracted endophenotype groups, analysis of variance was used to evaluate the difference between continuous variables. Difference for categorical variables was assessed using χ2 test.
Survival analysis
Cox proportional hazard model was used to validate extracted clusters through cardiovascular events, which are defined as cardiovascular disease, CVD mortality and hard CVD.33 In all analyses, p <0.05 was considered as statistically significant.
Results
The present study analysed the data for 12 808 participants aged 20–90 years collected between 1999 and 2018. They were followed-up on 15 years in terms of cardiovascular outcomes. Among them, 7197 (56.2%) were women and 5611 (43.8%) men with mean age of 42.1±14.9 years. Cardiovascular disease occurred for 1660 (13.1%) participants, 683 (9.5%) females and 977 (17.4%) males. A sum of 359 (2.8%) individuals experienced CVD mortality and 790 (6.2%) hard CVD (table 1).
Summary of baseline characteristics and procedural feature
Extracted endophenotypes
As a result of this study, we have extracted the following eight endophenotypes (table 2).
Definition of endophenotype groups
As shown in figure 1, we have discovered a few sets of correlated endophenotypes (r>0.8). We also further examined our selection with a domain expert. As a result of this process, we have selected 20 endophenotypes for using in clustering analysis. Visualisation represents the correlation between endophenotypes, where red colour shows high correlation and blue colour represents low correlation between variables.
Endophenotype Heatmap (EndophenoMap) of obesity: a heatmap visualisation which represents the correlation between endophenotypes, where red colour shows high correlation and blue colour represents low correlation.
Based on our clustering result, almost all obese people were divided in the high-risk group as opposed to people who had normal weight were categorised in the low-risk group. The most important and effective variables, including age, body mass index, waist to hip ratio, glomerular filtration rate (GFR), 2h-PLPG, SBP, DBP, TG, TG to HDL ratio, smoking status, abortion history and hormone therapy for menstruation, were selected under the supervision of clinical experts to determine endophenotypes.
The number of abnormal factors from a maximum of two to six factors out of 10 factors were analysed separately for all three outcomes of the study. The goodness of fit and models’ performance were assessed using c-statistic, the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). The model with maximum and minimum 3 abnormal factors out of 10 factors (waist to hip ratio, GFR, 2h-PLPG, SBP, DBP, TG, TG to HDL ratio, smoking status, abortion history and hormone therapy for menstruation) had higher c-statistic and lower AIC and BIC values (0.79, 28402 and 28 440, respectively) than other models (0.77, 28543 and 28 543, respectively). In brief, the endophenotypes model with the lowest AIC and BIC, and highest c-statistic values are preferred (table 3), which means that endophenotypes could predict cardiovascular outcomes more precisely rather than traditional obesity models.
Comparison the novel and traditional obesity methods in cardiovascular outcomes
As reported in table 4, HR for the occurrence of cardiovascular disease, CVD mortality, and Hard CVD based on extracted endophenotypes within the total participants, who were classified in endophenotype 6 compared with endophenotype 1 (the reference group), was 30.3 (95% CI 21.9 to 41.8), 78.4 (95% CI 31.8 to 193.5) and 53.9 (95% CI 30.5 to 95.3), respectively.
HR estimated in Cox model based on cardiovascular outcomes by endophenotypes
In women subgroups, the HR for CVD was significantly higher in endophenotype 8 with 54.0 (95% CI 29.7 to 98.2) compared with the reference group, and also the HR for hard CVD was 59.6 (95% CI 22.1 to 160.2) compared with endophenotype 1 (the reference group). It should be noted that in this subgroup was not possible to analyse the outcome of cardiovascular death due to the small sample size.
Within the men subgroups, endophenotype 6 compared with endophenotype 1 (the reference group) was associated with HR (95% CI) for CVD, CVD mortality and hard CVD of 19.6 (13.5 to 28.5), 43.1 (15.6 to 119.0) and 37.9 (19.2 to 75.1), respectively.
Based on the χ2 test, all the results were statistically significant (p<0.001). The assumption of proportionality of hazards was met for all outcomes.
Discussion
Unsupervised data-driven cluster analysis, an exploratory data-mining technique, when applied to the TLGS population cohort with over 15 years of follow-up in our study, identifies patient subgroups with varying features regarding anthropometric indices, metabolic features either with general obesity diagnosed with BMI >25 kg/m2 or BMI <25 kg/m2. It is now well established that conventional staging of obesity with central or general obesity indices is arguably too simplistic for individual decision-making, which is why further studies were applied to this field with the inherent heterogeneity of this complex disease.
Defining obesity by BMI is a well-known predictor of cardiovascular outcome,34 but long-term outcomes especially cardiovascular ones among the obese population from various trials have been conflicting throughout recent decades35
Arguing about above-mentioned approach has been raised by valuable recent literature suggested the coexistence of various phenotypes with a different cardiovascular risk profile within the same BMI category.36 Hence, available reports documented that neither obesity nor metabolic profile categories could not classify individually subjects for future cardiovascular outcomes, which is the most significant predictor of mortality and morbidity in all population cohorts.
Contemporary studies employing novel clustering method rather than clinician-based algorithms emerged originally in patients with asthma or heart failure who share varying laboratory, physical and clinical features within inherent phenotypes.14 37 Both mentioned publications supported the concept of heterogenicity of bronchial asthma and preserved ejection fraction heart failure patients and demonstrated that statistical learning algorithms, applied to dense phenotypic data, may allow for improved classification of heterogeneous clinical syndromes. The goal of this type of research is defining therapeutically homogeneous patient subclasses. Although recent studies in this field relied merely on maximum input data for individual patients rather than large cohort of subjects, the unfolding machine learning algorithms is of paramount.
The present study is the first and fundamental one to determine the endophenotypes of subjects regarding cardiovascular outcomes grounded on large population-based cohort of TLGS.
We presented a data-driven framework using machine learning techniques to select 12 significant variables for endophenotyping. We first put aside two well-established factors (age and BMI), then evaluated goodness of fit of the models based on the 10 remaining variables. The rationale behind this was to select the optimal minimum/maximum number of factors for endophenotyping to achieve a more feasible classification based on routine laboratory and clinical measurements. These variables were divided into eight endophenotypes. The supervised clinical validation of the occurrence of cardiovascular events for endophenotypes was estimated by Cox analysis of each endophenotype for individual outcomes of interest.
As mentioned above, variable distribution of cardiovascular disease risk factors as well as attributed mortality and morbidity across the spectrum of BMI has resulted in developing inherent phenotypes combining BMI and metabolic profile in the medical literature. There are two common phenotypes of obesity: (1) MHO with a postulated resistance to CV morbidity and (2) metabolically unhealthy normal weight. So, data regarding the effects of different obesity phenotypes based on above criteria are conflicting, especially in longer follow-up periods. In short-term follow-ups, MHO was a benign condition,38 whereas findings of a few studies hwith long-term follow-ups documented contradictory results39
Recent study by Mirzaei et al on TLGS population over 12 years follow-up revealed that although cardiovascular risk did not increase in MHO subjects, all other metabolically unhealthy phenotypes were at a higher risk of CVD adverse outcomes.40
Accordingly, the present endophenotyping method revealed that HR for incident CVD mortality and including female and male populations was 78.4 (95% CI 31/8 to 193/5) and 44.4 (95% CI 18/3 to 107/9) for endophenotype 6 (age >45 years, BMI <25 kg/m2 and at least 3 metabolic abnormality) and endophenotype 8 (age >45 years, BMI >25 kg/m2 and at least 3 metabolic abnormality), respectively. The significant higher HR in endophenotype six compared with endophenotype eight was observed for other outcomes of interest.
This result could be interpreted in two ways. First, this method of clustering applied in the present study, categorised all subjects based on feasible and simple measurements used in recent studies. Surprisingly, with HR for cardiovascular outcomes over 30–70-fold greater in endophenotype 6 and endophenotype 8 than endophenotype 1 (reference endophenotype). It means that this method could be applied confidently for patient classification either for treatment strategies or preventive goals in general populations. The higher HR in the present study compared with previous literature could be partly explained by selection of GFR with machine learning methods as a predictor of cardiovascular outcomes which is missed in conventional metabolically healthy and unhealthy subjects.
Second, the concept of obesity paradox, the better prognosis in overweight/obese individuals affected by CVD compared with leaner subjects, was replicated in our results.
The present study has several strengths and limitations. Being the first large population-based study in Iran and the Middle East with over 15-year follow-up is the main study strength. We also used survival analysis for validation of extracted clusters in which identified by phenomapping. The other strength of the study is that we employed a machine learning technique for classification of cardiovascular outcomes. Regarding limitations, there are two limitations in this study that could be addressed in future research. First, the study focused on cardiovascular outcomes and CVD-related diseases were not excluded at baseline. Furthermore, the method of study is highly developed and needs high degree of skill and knowledge for application.
Conclusions
This is the first study to conduct high-density phenotypic classification (ie, phenomapping) of a general population regarding cardiovascular outcomes. We have shown that unbiased cluster analysis of dense phenotypic data from multiple domains is feasible and can result in meaningful categories of general population at a higher risk of developing cardiovascular adverse events. Given the heterogeneous nature of obesity, phenomapping could be helpful for improved classification and categorisation of obese subjects and may lead to development of novel targeted therapies. Furthermore, phenomapping could help inform the design and conduct of future clinical trials and may be used to identify high-risk populations, thereby improving the controversial finding in this field. Moreover, the results of this study hold significant clinical implications for a specific part of the Middle Eastern population, which typically use tools/evidence from western populations with significantly different backgrounds and risk profiles.
Data availability statement
Data are available upon reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
Written informed consents were obtained from all participants at baseline. The study protocol complied with the 1975 Ethical Guidelines of the Declaration of Helsinki and was approved by the Research Ethics Committee of School of Public Health and Neuroscience Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran (approval ID: IR.SBMU.PHNS.REC.1400.025). The ethics committee of the Research Institute for Endocrine Sciences has approved the Tehran Lipid and Glucose Study and informed written consent has been obtained from all subjects.
Acknowledgments
The authors would like to express their appreciation to the staff of the Obesity Research Center of Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran, for providing the necessary services. We also thank deputy for research and technology, Shahid Beheshti University of Medical Sciences, for supporting this study with grant number 30557.
References
Footnotes
Contributors YM designed the study. EZB, YM and DK analysed the data from the Tehran Lipid and Glucose Study population. EZB and MM wrote the manuscript, MV and MHP corrected the manuscript. DK and MM revised the manuscript. All authors read and approved the final manuscript.YM is responsible for the overall content as the guarantor.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.