Green Algorithms for Health Data Science

Loïc Lannelongue: Department of Public Health and Primary Care, University of Cambridge (UK)

Jason Grealey: Baker Heart and Diabetes Institute; Department of Mathematics and Statistics, La Trobe University (Australia)

Michael Inouye: Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge; Baker Heart and Diabetes Institute; The Alan Turing Institute (UK); Health Data Research UK Cambridge

In the words of Dr Richard Horton, Editor in Chief of The Lancet [1]:

“The climate emergency that we’re facing today is the most important existential crisis facing the human species, and since medicine is all about protecting and strengthening the human species it should be absolutely foundational to the practice of what we do every single day.”

We, in the field of health data science, also abide by the Hippocratic Oath and should rise to the challenge of our climate emergency.

We tend to think that our main contribution is to design algorithms to solve the big health challenges of our time. Although there is little doubt that data science will be a key tool to tackle climate change, we often forget to consider how our own work also adds to the problem. The infrastructure we use, the algorithms and code that we write, all consume large quantities of electricity whose production is responsible for significant greenhouse gas emissions.

There is a widespread underappreciation, and indeed naivete, in our community as to the effects of our algorithms on carbon emissions and global warming. In searching for “data science and climate change”, one has to pass the first ten pages of Google results to find the first article addressing the potential negative impact of algorithms [2]. This needs to change.

In today’s data-driven world, a problem doesn’t exist until we can obtain data on it, which makes quantifying our carbon footprint a necessary first step to understand the depth of the issue before taking steps to reduce our impact.

Less than a handful of studies have tried to quantify algorithmic emissions [3]–[5]. Those have concluded that more sustainable research is needed, particularly in image analysis and natural language processing as these models are famously power-hungry and expensive to train. For example, the latest of google’s chatbots is trained continuously for 30 days, using over 2,000 TPU cores [6]. While some initial steps have been made, we remain largely in the dark for much of the computational research carried out worldwide.

With that in mind, we propose a global framework enabling all scientists to easily measure their carbon emissions using an online calculator, available at www.green-algorithms.org. We use carbon-dioxide equivalent^[1] (CO2e) as a single indicator of carbon impact. However, “2.3 kg of CO2e” is difficult to relate to, so we contextualise emissions calculations by also providing “tree-months”: the amount of CO2 a mature tree is able to sequester (absorb) in one month, which is estimated at approximately 1kg^[2]. If a project’s emissions correspond to 120 tree-months, this means that 10 trees would need one full year to absorb its emissions. As we have found out, this is worryingly quick to achieve.

Many health data scientists are involved in genetic analyses, where performing a genome-wide association study (GWAS) is a frequent, sometimes daily, occurrence. For popular resources such as the UK Biobank, having 1,000 researchers worldwide each run a GWAS on 100 traits would release 290 tonnes of CO2e into the atmosphere. It would take 1,000 mature trees over 25 years to absorb this amount of CO2. Two hundred and ninety tonnes of CO2e is also approximately equivalent to each of the 1,000 scientists driving 1,000 miles (or 1600 km) in an average European car. Other tools and analyses, such as de novo human genome assembly or molecular dynamics simulations offer similar jarring carbon emissions when used at scale^[3].

Now that we have a measurement tool, what can be done? One approach might be to better capture the cost of computation, not just a financial cost but a carbon cost. This could inform the design, selection, and implementation of algorithms as well as which analyses one needs to run. While financial costs are largely disclosed in grants, we would posit that there is a public interest for transparency of carbon emissions: CO2e estimates could be noted in academic publications.

Finally, there are ways for individual health data scientists themselves to limit their carbon impact:

Only request the minimum memory you need. Power consumption depends on requested memory, not the actual memory usage. Requesting more memory “to be safe” is common but easily fixable.
Test the algorithm or pipeline on small subsets of data. Thus, one may only run a power-hungry analysis once.
Optimise the algorithm. Limit the running time, CPU and memory requirements. If using established software, many times this simply involves updating to the most recent version.
Favour energy efficient data centers. Many data centres make their power usage effectiveness (PUE) available so that researchers can make informed decisions.
Carefully choose the location of the servers if possible. For example, running the same algorithm on the same task would produce 64 times more CO2e in Australia than it would in Switzerland.

Perhaps most simply, run only the analysis you need. Overall, the environmental impacts of data science and computational research is an emerging and rapidly evolving field with seemingly infinite research areas, including infrastructure, algorithms, software, best-practice etc. We are hopeful that these initial steps lay the foundation for future studies and promote debate amongst researchers to work towards greener algorithms.

^[1] The amount of CO2 that would have the same effect on global warming.

^[2] 950g to be exact, based on an estimation of 11.4kg per year [7]

^[3] https://twitter.com/Jason_Grealey/status/1236985636373925888