Machine Learning Predicts Accurately Mycobacterium tuberculosis Drug Resistance From Whole Genome Sequencing Data

Wouter Deelder, Sofia Christakoudi, Jody Phelan, Ernest Diez Benavente, Susana Campino, Ruth McNerney, Luigi Palla and Taane G. Clark

Frontiers in Genetics (2019), 10: 922.

Technical summary

Background: Tuberculosis disease, caused by Mycobacterium tuberculosis, is a major
public health problem. The emergence of M. tuberculosis strains resistant to existing
treatments threatens to derail control efforts. Resistance is mainly conferred by mutations
in genes coding for drug targets or converting enzymes, but our knowledge of these
mutations is incomplete. Whole genome sequencing (WGS) is an increasingly common
approach to rapidly characterize isolates and identify mutations predicting antimicrobial
resistance and thereby providing a diagnostic tool to assist clinical decision making.
Methods: We applied machine learning approaches to 16,688 M. tuberculosis isolates
that have undergone WGS and laboratory drug-susceptibility testing (DST) across 14
antituberculosis drugs, with 22.5% of samples being multidrug resistant and 2.1% being
extensively drug resistant. We used non-parametric classification-tree and gradientboosted-tree models to predict drug resistance and uncover any associated novel putative
mutations. We fitted separate models for each drug, with and without “co-occurrent
resistance” markers known to be causing resistance to drugs other than the one of interest.
Predictive performance was measured using sensitivity, specificity, and the area under the
receiver operating characteristic curve, assuming DST results as the gold standard.
Results: The predictive performance was highest for resistance to first-line drugs,
amikacin, kanamycin, ciprofloxacin, moxifloxacin, and multidrug-resistant tuberculosis
(area under the receiver operating characteristic curve above 96%), and lowest for thirdline drugs such as D-cycloserine and Para-aminosalisylic acid (area under the curve below
85%). The inclusion of co-occurrent resistance markers led to improved performance
for some drugs and superior results when compared to similar models in other largescale studies, which had smaller sample sizes. Overall, the gradient-boosted-tree models
performed better than the classification-tree models. The mutation-rank analysis detected
no new single nucleotide polymorphisms linked to drug resistance. Discordance between
DST and genotypically inferred resistance may be explained by DST errors, novel rare
mutations, hetero-resistance, and nongenomic drivers such as efflux-pump upregulation. Conclusion: Our work demonstrates the utility of machine learning as a flexible approach
to drug resistance prediction that is able to accommodate a much larger number of
predictors and to summarize their predictive ability, thus assisting clinical decision
making and single nucleotide polymorphism detection in an era of increasing WGS data
generation.

Areas of work

Science

Scientific priorities

Applied Analytics

Science

Outputs

CogStack information retrieval and extraction platform gives access to underused data