Migraine outcome model development
A migraine outcome model was defined by a panel of two headache specialists (NAH and RC). To construct the model, the specialists based on their clinical experience defined elements (structured or unstructured data) likely to be captured in the EHR, that reflect the diagnosis and progression of migraine. Structured data included predetermined fields in fixed formats typically used to collect data for payment, or for regulatory or public health purposes, while unstructured data comprised the narrative written by the physician to record information used in patient management (and to maintain the medico-legal record). Although unstructured data often contain more complete information captured during a patient visit than do structured data, interpretation requires either human review and manual extraction, or sophisticated software for automated extraction.
The selected features were migraine-associated headache; headache severity (mild, moderate, severe); severity headache descriptors (including pulsating, debilitating, stabbing, throbbing, disabling, and piercing); headache progression (documented improvement or worsening); and commonly reported associated symptoms, which included photophobia, phonophobia, nausea, and vomiting.
The model was focused on symptoms since these are reflective of the patient’s migraine experience. Each selected data element was weighted to define a 10-point scale encompassing headache severity (1–7 points) and associated features (0–3 points) in a procedure consistent with current US regulatory guidance for measurement of response to acute treatment [11]. In this model, headache severity was scored as none or no headache documented (1 point); mild or severity not documented (3 points); moderate (5 points); severe or severe headache descriptor (7 points). Encounters with multiple headache features were assigned the highest headache severity represented. Associated features (nausea, vomiting, and either photophobia or phonophobia) each scored 1 point, when present.
Technology optimization for extraction of features
Data source and study population
Deidentified EHR from a US tertiary care academic medical center containing information recorded between 2018 and 2020 were studied to identify primary care and neurology encounters for inclusion in the study. To increase the number of encounters representing visits focused on migraine, records were selected based on a random sampling with patient-level and encounter-level filters applied. Patient-level filters included presence of migraine in the structured or unstructured medical records and presence of at least two evaluable encounters.
For each selected patient, two evaluable encounters with at least 2 weeks of separation between the encounters were selected for the study. Evaluable encounters were those in which the primary reason for the consultation was the complaint of headache or in which there were a minimum of two mentions of headache within the encounter narrative. Patients and associated encounters were separated into training and validation data sets.
Reference standard creation
To optimize and validate the accuracy of automated feature extraction of data elements included in the migraine outcome model, a reference standard was created. Two independent, trained annotators with clinical degrees manually reviewed each record and labeled each feature and associated meta-data. Annotators received training on the annotation application, feature and meta-data definitions, and appropriate usage prior to review and annotation of encounters. Features included clinical concepts such as headache, migraine, nausea, vomiting, photophobia, and phonophobia. Meta-data included attributes that change the meaning for a documented feature such as negation, severity, and descriptors. Migraine features were tested at the level of each encounter, i.e., it was not assumed that the same symptoms persisted longitudinally from one encounter to the next. Thus, accuracy required that a feature be correctly identified in a specific encounter.
To ensure adequate quality in reference standard creation, annotators were blinded to each other’s annotation and inter-annotator agreement was measured daily by Cohen’s kappa score. A minimum kappa score of 0.7 was required for the reference standard to be considered adequate. After kappa score calculation, all cases of disagreement were reviewed by both annotators for resolution. Unresolved cases were escalated to a third annotator for resolution.
Automated feature extraction
This study included the deployment of natural language processing (NLP) and machine learning algorithms, both aspects of artificial intelligence (AI). These were applied to extract features from the unstructured data of filter-enriched encounters. For example, NLP may identify the features headache, nausea, and vomiting in different parts of an encounter narrative. Machine-learned associations may identify patterns supporting the disambiguation of abbreviations such as “MA” to “migraine with aura” instead of “mass,” “medical assistant,” or “Massachusetts.” NLP architecture and pipeline employed has been previously described [12].
Both the NLP and inference rules were optimized to extract for the clinical domain area by Verantos, Inc. (Menlo Park, CA). Since structured data are often used to identify clinical concepts in RWE, features were separately extracted from data in the EHRs using structured query language (SQL) to provide a comparison for accuracy of feature extraction from unstructured data.
Statistical analyses
The primary study objective was to evaluate the accuracy of automated scoring of migraine severity from elements extracted from the EHR compared with manual scoring. Accuracy determination for the migraine outcome model was performed using R programming language, version 3.3.2. Results were reported as the percentage of encounters with matching migraine outcome scores based on automated versus manual feature extraction. Matches were defined as ‘exact’ (matching the manual reference score exactly on the 10-point scale) or ‘close’ (matching the manual reference score within 1 point on the 10-point scale). For this study, success was defined as achieving a close match in migraine outcome score among at least 70% of encounters.
We also evaluated the accuracy of automated feature extraction, as that was critical to automated scoring. Therefore, each element of the outcome model was compared against the manual reference standard in terms of recall, precision, and F1 score. Recall (sensitivity) was defined as the percent of data elements identified by manual annotation that were also identified through automated annotation. Precision (positive predictive value) was defined as the percent of data elements identified through automated annotation that were also identified by manual annotation. The F1 score was calculated as the weighted harmonic mean of the precision and recall. For this study, an average F1 score threshold was set at 80% to demonstrate sufficient accuracy of automated feature extraction. Concepts with fewer than 20 occurrences were excluded from accuracy measurements. The average accuracy measures were weighted on reference standard occurrence counts to account for variability in feature occurrence among encounters. Microsoft Excel 365 was used for all data analyses.