Article Text

Download PDFPDF

Original research
Artificial intelligence method to classify ophthalmic emergency severity based on symptoms: a validation study
  1. Hyunmin Ahn
  1. Ophthalmology, Armed Forces Daegu Hospital, Daegu, Korea (the Republic of)
  1. Correspondence to Hyunmin Ahn; overhyun31{at}gmail.com

Abstract

Objectives We investigated the usefulness of machine learning artificial intelligence (AI) in classifying the severity of ophthalmic emergency for timely hospital visits.

Study design This retrospective study analysed the patients who first visited the Armed Forces Daegu Hospital between May and December 2019. General patient information, events and symptoms were input variables. Events, symptoms, diagnoses and treatments were output variables. The output variables were classified into four classes (red, orange, yellow and green, indicating immediate to no emergency cases). About 200 cases of the class-balanced validation data set were randomly selected before all training procedures. An ensemble AI model using combinations of fully connected neural networks with the synthetic minority oversampling technique algorithm was adopted.

Participants A total of 1681 patients were included.

Major outcomes Model performance was evaluated using accuracy, precision, recall and F1 scores.

Results The accuracy of the model was 99.05%. The precision of each class (red, orange, yellow and green) was 100%, 98.10%, 92.73% and 100%. The recalls of each class were 100%, 100%, 98.08% and 95.33%. The F1 scores of each class were 100%, 99.04%, 95.33% and 96.00%.

Conclusions We provided support for an AI method to classify ophthalmic emergency severity based on symptoms.

  • ophthalmology
  • accident & emergency medicine
  • biotechnology & bioinformatics
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

  • This artificial intelligence with various available mathematical and technical methods was a model to improve performance by overcoming the imbalance training and validation sample.

  • It included a bigger data set than previous studies and reflects more detailed symptom presentations.

  • The model was trained to reflect the final diagnoses and treatments, not just the known emergency symptoms, and the model training was intended to reflect both symptoms and also diagnoses and treatments.

  • The utilisation of the AI model could be further improved if more various and bigger data set was obtained.

Introduction

South Korea has a population of about 50 million and is considered to have an advanced economy. In 2018, the gross domestic product per capita was about 35 000 dollars.1 The key feature of the Korean medical system is the National Health Insurance System, and virtually all citizens have joined this insurance system. In 2016, the performance of South Korea in the healthcare access and quality (HAQ) index was ranked 25th out of 195 countries. Aside from tuberculosis, stroke and non-melanoma skin cancer, most of the diseases received very high HAQ index scores, with an average of 90 out of 100. Except for Japan (ranked 12th), South Korea ranked second highest among Asian countries.2

To our knowledge, there has been no study or report on HAQ index scores for eye diseases in South Korea. According to a statistical report by Statistics Korea (KOSTAT, the national statistical office), there are 4.1 eye clinics per 100 000 people, and eye clinics are mainly concentrated around the Seoul capital area and provincial metropolitan cities (1088 of 2057). Unfortunately, 20 of 185 cities are without an ophthalmologist.3 Also, there is no specific study on the Korean population regarding accessibility, management quality and prognostic differences in ophthalmic diseases. However, medical access and patient quality management for other diseases, such as acute myocardial infarction and cancer, have been studied.4 5 The lack of ophthalmologists may cause difficulty in the management of eye diseases in some regions. In addition, ophthalmologic diagnoses and treatments by non-ophthalmologists in the primary healthcare system are very limited. The lack of an ophthalmologic educational programme in the medical curriculum and hospital training for general practitioners and the specificity of diagnoses and treatment of eye diseases underlies these problems.6 7 Although South Korea has good primary HAQ across the board, there are blind spots, such as missing the proper management time and opportunity in eye healthcare for patients.

Artificial intelligence (AI) is used in various fields, including medical, and ophthalmology is no exception. Especially, research on fundus image analysis using AI has been actively conducted.8 9 Considering technological developments of AI in other fields, such as autonomous vehicles or climate forecasting, the applicability of AI in medical, in particular, ophthalmologic fields can be more versatile.10 11 One of the primary applications of AI within emergency medicine is in triage.12 There is possibility that AI systems could improve the emergency decision-making process in triage by supporting patient flow, wait times, resource utilisation and risk stratification.12 However, to our knowledge, there is no available AI for ophthalmologic emergency but only some symptom checkers with poor accuracy are utilised.13–16

This study was aimed to determine the possibility for AI to guide patients for eye treatment, who do not know the urgent severity of the symptoms, due to lack of awareness of eye diseases, to an ophthalmologist. Specifically, we focused on whether AI can distinguish actual urgent patients.

Methods

Study population and data set

All 1681 patients who first visited the Armed Forces Daegu Hospital, in Daegu, South Korea between June and December 2019 were retrospectively analysed. Prior to any data processing sequence, patient data were randomly divided into two data sets, that was, the training set and validation set. The target ratio of the training set and validation set was about 7 to 1, and about 200 cases were required to be included in a validation set. The validation set consisted of about 50 cases in each of the four emergency code classes (more details below), by random extraction to form a balanced validation data set.

Emergency severity classification

Emergency severity was classified into four classes (red, orange, yellow and green classes, indicating immediate to no emergency cases, respectively) according to the articles with modification based on events, symptoms, diagnoses and treatments.17 18 Two labelling methods were used. In the first labelling method (ECode I), only diagnoses and treatments were used. In the second labelling method (ECode II), events, symptoms, diagnoses and treatments were used. The corresponding criteria for classes were as follows (table 1).

Table 1

Emergency severity classification

Red (immediate)

Events (pain in postop eye (less than 2 months), stab/penetrated trauma and chemical burn), symptoms (sudden continuous visual disturbance, moderate-to-severe continuous eyeball pain or photophobia), diagnoses (acute glaucoma, corneal laceration, central retinal artery/vein occlusion under 24 hours, giant cell arteritis with visual disturbance, globe perforation, iris prolapse, intraocular foreign body, orbital cellulitis and third cranial nerve palsy with pupil involvement) and treatment (urgent operation).

Orange (within 24 hours, except for red)

Events (arc eye, blunt trauma, contact lens-related injury, foreign body without globe injury and corneal graft patient), symptoms (sudden/recent onset with marked one-side redness), diagnoses (corneal abrasion, corneal foreign body, corneal ulcer, corneal opacities with pain, foreign body in conjunctival sac, hyphoema, iridocyclitis, orbital wall fracture, retinal detachment, retinal tear and vitreous haemorrhage) and treatments (needing continuous monitoring for 24–72 hours).

Yellow (within 7 days, except for red and yellow)

No event, symptom (diplopia or other visual disturbance), diagnosis (aggravating retinal disease such as age-related macular degeneration, central serous choroidoretinopathy with fluid collection, herpes zoster ophthalmicus, non-managed episcleritis, scleritis, posterior vitreous detachment, Bell’s palsy, optic neuritis, preseptal orbital cellulitis, severe infectious conjunctivitis, retinal vein occlusion over 24 hours and proliferative diabetic retinopathy) and no treatment.

Green (not emergency)

No event, symptom (long-lasting squint), diagnosis (allergy conjunctivitis, blepharitis, chalazion, dry eye, ectropion, watery eye, subconjunctival haemorrhage, non-proliferative diabetic retinopathy, no macular oedema and senile cataract) and no specific treatment.

Reliability analysis of emergency severity classification

We used an intraclass correlation coefficient (ICC, p-value<0.05, 95% CI) for reliability of the emergency severity classification by the responsible author. ECode II was compared with two other results from one general practitioner and one other ophthalmologist.

Clinical variables for training and validation

Three categories of variables (general patient information, event and symptoms) were used as predictors. For general patient information, seven variables (age, sex, systemic disease, allergy, eye disease/trauma, eye operation and operation elapsed time) were used. Systemic disease, eye disease, eye operation and operation elapsed time (if there was a multiple surgical history) were explored with multiple choice questions. In the event category, a trauma event (stab/penetrating and blunt), other event (lens-related/foreign body-related, chemical burn and arc injury) and common cold were included. Corrected visual acuity determined with a near vision card was also checked. Symptom categories were divided into three subcategories such as vision, pain and other symptoms for emphasising vision and pain themselves and the temporal relationship of vision, pain and other symptoms. Each symptom subcategory included the aspect of time, aspect of symptoms and number of symptomatic eyes. The aspect of time consisted of duration, persistence (even with behaviours such as blinking, artificial tear usage or daytime changes) and worsening (in the disease period). In the vision subcategory, nine visual problems (poor vision, visual field defect, metamorphopsia, glare, photopsia, colorosis and nyctalopia) were used. In the pain subcategory, the aspect of pain was classified into eye pain (sharp, dull and aggressive pain (patient usually claimed as ‘My eyes feel like they’re going to pop out’)) and eyelid pain. Visual Analogue Scale scores for self-reported pain degrees in both eyes and eyelid pain were also included. In the other symptoms subcategory, diplopia, lazy eye, nystagmus, eyelid mass, eyelid swelling, ptosis, enophthalmos, exophthalmos, white eyeball pigmentation/abnormalities, conjunctival swelling, red eye, dryness, stinging, burning sense, fatigue, itching, foreign body sense, discharge (sleep), tear and neurologic signs (nausea, vomiting and headache) were included. The symptoms were classified by referring to the American Association of Ophthalmology (AAO) homepage (https://www.aao.org/eye-health/symptoms-list) and were selected based on usual patient complaints. To establish symptom variables that patients complained of, prior screening was conducted. If there was an exceptional symptom that was difficult to classify using the AAO list because of regional and national characteristics, attempts were made to consistently comply with the AAO list. A total of 57 questions (32 symptoms) were used (table 2).

Table 2

The clinical variables for training set and validation set

Model development

An ensemble AI model using combinations of fully connected neural networks was adopted (figure 1). Input variables included multiselect questions, and they were preprocessed by dummy processing. In the training set, unbalanced data set was verified using the synthetic minority oversampling technique (SMOTE) algorithm, which is the most widely used algorithm for generating artificial training samples.19 The model included the K-fold cross-validation method (K=10) for increasing the model performance in relatively few data volumes.20 This model is unique as it was configured to separate vision and pain variables from other variables, and the variables were processed primarily by each multilayer perceptron method. After the process, pain and vision variables were concatenated with respective general patient information, event and other symptom variables. Overfitting was controlled by the dropout method (50%), L2 regulation and the early stopping technique. Prediction performance was measured by the accuracy of the whole data set and precision, recall and F1 score of each class. Accuracy itself was not useful due to bias in the number of each class, which is common in medical reports.21 22 The accuracy was high because of the high proportion of the mild severity class, which was the green class in this paper; therefore, precision, recall and the F1 score were also calculated. Additionally, an ‘acceptable prediction percentage’ was confirmed. An acceptable prediction percentage was defined as an AI prediction that is determined to be ‘true’ when the predicted label is the same as or more urgent than the actual class. For example, when AI predicted a yellow-class patient as a red-class or yellow-class patient, the prediction was considered ‘true’. The earlier hospital visit was considered acceptable in view of the risk of delayed treatment. A clinically dangerous situation was when an urgent patient visited the hospital late.

Figure 1

Data set, sampling and the model architecture. Prior to any data processing sequence, patient data were randomly divided into two data sets, that was, training set and validation set (ratio 7:1). The number of each class validation samples were extracted similarly. In training set, unbalanced data set was verified using the synthetic minority oversampling technique (SMOTE) algorithm. The model included the K-fold cross-validation method (K=10) for increasing the model performance. In machine learning algorithm, vision and pain variables were processed by each multilayer perceptron and concatenated with general patient information and other symptom variables. Overfitting was controlled by 50% dropout method, L2 regulation and the early stopping technique. The prediction performance was measured by accuracy of whole data set and precision, recall and F1 score of each class.

The performance comparison between the model with ECode I and the model with ECode II

Two steps were used to analyse the difference between the model with ECode I and ECode II. First, for the correlation of labels between the model with ECode I and ECode II in the whole data set, Spearman’s correlation analysis was used. The relationships among each class in the model with ECode I and ECode II were also analysed. In the validation set, statistically significant differences between the model oversampled by SMOTE algorithm with ECode I (ECcSMOTE I) accuracy and with ECode II (ECcSMOTE II) accuracy were analysed with p<0.05 indicating significance.

Patient and public involvement

No patient was involved in developing the research question, outcome measurement and design of the study. We are unable to disseminate the findings of the research directly to the study participants.

Results

Reliability of emergency severity classification

The ICC score of ECode II by the general practitioner, other ophthalmologist and responsible authors was 0.979 (p<0.001, 95% CI 0.975 to 0.982).

Model performance

In ECcSMOTE I, 1456 cases (49 (3.39%), 63 (4.30%), 115 (7.92%) and 1229 (84.39%) in the red, orange, yellow and green classes, respectively) were enrolled in the training set, and 225 cases (61 (27.11%), 61 (27.11%), 52 (23.11%) and 51 (22.67%) in each red, orange, yellow and green class) were enrolled in the validation set. In ECcSMOTE II, 1471 cases (77 (5.20%), 110 (7.47%), 120 (8.14%) and 1164 (79.19%), respectively) were enrolled in the training set and 210 cases (54 (25.72%), 52 (24.76%), 52 (24.76%) and 52 (24.76%), respectively) were enrolled in the validation set (table 3). The correlation coefficient between ECode I and ECode II was 0.837 (p=0.000). The relationships among each class between ECode I and ECode II are shown in table 4. The accuracy of ECcSMOTE I model, calculated with the validation set, was 98.88%. The accuracy of ECcSMOTE II was 99.05%. There was no statistically significant difference between ECcSMOTE I and ECcSMOTE II accuracy (p=0.825). The precision of each class was 100%, 99.04%, 97.14% and 100% in ECcSMOTE I. The recalls were 100%, 100%, 99.04% and 97.16%. The F1 scores were 100%, 99.52%, 98.08% and 98.56%, respectively. The acceptable prediction percentage of ECcSMOTE I was 100%. The precision of each class was 100%, 98.10%, 92.73% and 100%, respectively, with ECcSMOTE II. The recalls were 100%, 100%, 98.08% and 95.33%, respectively. The F1 scores were 100%, 99.04%, 95.33% and 96.00%, respectively. The acceptable prediction percentage of ECcSMOTE II was also 100% (table 5).

Table 3

The proportion of ECode I and ECode II

Table 4

The relationships of ECode I and ECode II

Table 5

The model performance of ECcSMOTE II

Discussion

Previous researches

Currently, mobile phone applications and internet sites that diagnose diseases and classify emergency severity based on symptoms are being used to decide whether patients require a hospital visit.13–16 However, unfortunately, previous studies reported negative results for these utilities. The studies showed that the accuracy of disease diagnosis was less than 50% and that of emergency severity classification was about 80%. Also, a recent research study in the field of ophthalmology was performed, although the sample size was insufficient.23 In this study, the overall accuracy of disease diagnosis was 26% (95% CI: 12% to 40%). The accuracy of emergency triage was 39% (95% CI: 14% to 64%) in an emergency situation and 88% (95% CI: 73% to 100%) in a non-emergency situation.

Strengths and limitations

The results of our study differed from those of preceding studies. There were three main reasons for this. First, the biggest difference was the sample size.24 In the present study, 1681 cases were studied in which the sample size was relatively large compared with previous studies that predicted about 50–99 diagnoses and 3–5 triage classifications with around 20–30 symptoms. If {(the number of input variables)/(sample size)} increased, the strength for each variable was lowered, and the variable whose actual strength was high did not have enough strength. If {(the number of output variables)/(sample size)} increased, there may be very few or no samples for each output class, which means that there was no significant difference between each class.25–28 Second, more detailed input variables were employed in this study. Fifty-seven questions were used as input variables. This was clearly different from using 20–30 questions as in previous studies. The variables, such as persistence, aggravation and degree, were used in addition to the presence or absence of symptoms. The aspect refinement allowed building a model with completely different expressions for the same symptom variables. Third, emergency severity prediction, not diagnostic prediction, was assessed. The reason why diagnostic prediction is more difficult than emergency severity prediction was mentioned above. Also, different diseases are often expressed with the same symptoms, and it is very difficult to distinguish diseases clinically by symptoms without examination in ophthalmology. For example, wet-type age-related macular degeneration and central serous chorioretinopathy were labelled as the same ‘yellow’ class in this study, although they were diagnosed differently. The symptoms of both diseases were only persistent visual disturbances.

In this paper, emergency severity was classified using two methods. Diagnoses and treatments, except symptoms such as the red flag sign and events, were only used for the first classification (ECode I). Another classification consisted of symptoms, events, diagnoses and treatment (ECode II). Both ECode I and ECode II showed excellent accuracy with the SMOTE algorithm (ECcSMOTE I and II, 98.88% and 99.05%, respectively). Contrary to concern that the prediction of the ECcSMOTE I model was milder than the actual severity, it was equal or more urgent than the real values (allowable prediction percentage 100%). Also, the values for each class in ECode II were equal or higher than those in ECode I (table 5). Nevertheless, the accuracy of ECcSMOTE I was not lower than that of ECcSMOTE II (p=0.825). It can be seen that labelling is also possible in emergency severity machine learning, even if the red flag sign and event is not considered. This phenomenon was due to the input variation of the training set already including the patient’s symptoms and events, and the input variation was correlated with the output variation during the training process.

There were two major limitations in this study. First, there was a lack of data set samples. There were bigger sample sizes than in the previous reports, but it was relatively insufficient considering the number of input variables. The classification formula implemented in the model was different from simple linear regression, but the lack of samples was limited in machine learning based on statistic operations.29 Additional studies with larger and multicentre validation sets are considered for clinical application. The second limitation was imbalance among the classes, which is a common problem in medical research. This problem suggests that the accuracy of minor classes becomes greatly degraded.21 22 30 The accuracy of the ECsSMOTE II model was only 79.52%, which was almost similar to other studies by the limitation. To compensate for the limitation, the SMOTE algorithm, which is an oversampling method that mathematically emphasises class relationships was used in this study.20 Considering the occurrence of bias with oversampling when evaluating model performance, the actual cases were randomly extracted from the entire data set before oversampling as a validation set. The SMOTE algorithm does not show superior performance compared with the undersampling method, but it can possibly be utilised for high-dimensional class-imbalanced data.31 By further research that overcomes these limitations, the risk of clinical use of AI in ophthalmologic emergency should be supplemented.

Conclusions

The accuracies of the models using the SMOTE algorithm (ECcSMOTE I and II) were close to 100% (98.88% and 99.05%, respectively). This preliminary study yielded excellent predictions, in contrast to previous reports, using various available mathematical and technical methods that aimed to compensate for the limitations, although studies in more diverse populations than in this study are ideal. In conclusion, we propose an AI method to classify ophthalmic emergency severity based on symptoms.

References

Footnotes

  • Contributors HA: substantial contributions to the conception, design of the work; the acquisition, analysis and interpretation of data for the work.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Ethics approval Approval was obtained from the Medical Research Ethics Committee, the Armed Force Medical Command, Republic of Korea (protocol number AFMC-19126-IRB-19–096). All research protocols were conducted in accordance with the Declaration of Helsinki.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement All data relevant to the study are included in the article. Data are available in a public, open access repository.