Ju M, Short AD, Thompson P, Bakerly ND, Gkoutos GV, Tsaprouni L, Ananiadou S.
JAMIA Open, Volume 2, Issue 2, July 2019, Pages 261–271
Chronic obstructive pulmonary disease (COPD) phenotypes cover a range of lung abnormalities. To allow text mining methods to identify pertinent and potentially complex information about these phenotypes from textual data, we have developed a novel annotated corpus, which we use to train a neural network-based named entity recognizer to detect fine-grained COPD phenotypic information.
Materials and methods
Since COPD phenotype descriptions often mention other concepts within them (proteins, treatments, etc.), our corpus annotations include both outermost phenotype descriptions and concepts nested within them. Our neural layered bidirectional long short-term memory conditional random field (BiLSTM-CRF) network firstly recognizes nested mentions, which are fed into subsequent BiLSTM-CRF layers, to help to recognize enclosing phenotype mentions.
Our corpus of 30 full papers (available at: http://www.nactem.ac.uk/COPD) is annotated by experts with 27 030 phenotype-related concept mentions, most of which are automatically linked to UMLS Metathesaurus concepts. When trained using the corpus, our BiLSTM-CRF network outperforms other popular approaches in recognizing detailed phenotypic information.
Information extracted by our method can facilitate efficient location and exploration of detailed information about phenotypes, for example, those specifically concerning reactions to treatments.
The importance of our corpus for developing methods to extract fine-grained information about COPD phenotypes is demonstrated through its successful use to train a layered BiLSTM-CRF network to extract phenotypic information at various levels of granularity. The minimal human intervention needed for training should permit ready adaption to extracting phenotypic information about other diseases.
Chair of Clinical Bioinformatics at University of Birmingham
A biochemist by training, Professor Georgios Gkoutos was initially involved in the field of Computational Biology following a MSc degree by research on correlated mutations analysis on G-Protein...
Alcohol use and burden for 195 countries and territories, 1990-2016: a systematic analysis for the Global Burden of Disease Study 2016
23 August 2018
GBD 2016 Alcohol Collaborators The Lancet (2018) 392(10152):1015-1035 Background: Alcohol use is a leading risk factor for death and disability, but its overall association with health remains...
Antimicrobial-impregnated central venous catheters for prevention of neonatal bloodstream infection (PREVAIL): an open-label, parallel-group, pragmatic, randomised controlled trial
1 June 2019
Gilbert R, Brown M, Rainford N, Donohue C, Fraser C, Sinha A, Dorling J, Gray J, McGuire W, Gamble C, Oddie SJ, PREVAIL trial team The Lancet, Child & Adolescent Health (2019)...