Machine learning projection model for Covid-19

Youyang Gu, an independent data scientist, has created a statistical projection model for Covid-19 ( that uses machine learning techniques to fit a classical SEIR infectious disease model to the data for daily confirmed cases and deaths, taking into account the effects of social distancing and other factors. From the results I’ve looked at, it appears to be one of the better performing models around. The plots below show results for Switzerland, USA and United Kingdom based on data up to 31 May.

The second plot for each country shows R-t, the effective reproduction number at time t. When R is greater than 1, the epidemic is growing exponentially, and when R is less than 1 it is declining. The basic reproduction number in the absence of interventions to reduce transmission, R0, is typically around 2 for most countries, depending on factors such as population density and crowding. R0 was close to 4 for New York for example.

Looking across the country projections, it is interesting that R-t is currently slightly below 1 for countries such as Switzerland and UK, and marginally above 1 for the USA. It is more substantially below 1 for a few countries such as Norway and Australia, and above 1 for some countries, eg. Brazil, Russia and Nigeria.

A lot of people have now published strong criticisms of the IHME modelling, many identifying the major problem of fitting a mathematically symmetric curve to the epidemic which I noticed early on. Youyang Gu also compares IHME projections with his and shows severe under- and over-estimation issues with the IHME projections, which change wildly with model updates and iterations. See the plot below for a comparison.

Gu concludes:

“Models are going to make wrong predictions, but it’s important that we correct them as soon as new data shows otherwise. The problem with IHME is that they refused to recognize and update their wrong assumptions for many weeks. Throughout April, millions of Americans were falsely led to believe that the epidemic would be over by June because of IHME’s projections.

“On April 30, the director of the IHME, Dr. Chris Murray, appeared on CNN and continued to advocate their model’s 72,000 deaths projection by August. On that day, the US reported 63,000 deaths, with 13,000 deaths coming from the previous week alone. Four days later, IHME nearly doubled their projections to 135,000 deaths by August. One week after Dr. Murray’s CNN appearance, the US surpassed his 72,000 deaths by August estimate. It seems like an ill-advised decision to go on national television and proclaim 72,000 deaths by August only to double the projections a mere four days later.

“Unfortunately, by the time IHME revised their projections in May, millions of Americans have heard their 60,000-70,000 estimate. It may take a while to undo that misconception and undo the policies that were put in place as a result of this misleading estimate.”

This entry was posted in Global health trends, Projections and tagged , , , , , . Bookmark the permalink.

2 Responses to Machine learning projection model for Covid-19

  1. Peter Byass says:

    Thanks Colin & Gu, interesting. I notice that specially on the UK and USA death data there are strong weekly harmonics, suggesting these are deaths by day of registration rather than by day of death? Does this matter in relation to the future projection or its UI? If you look at English hospital deaths by actual date of death (which account for most of UK deaths anyway) the curve is much smoother

    • colinmathers says:

      It should make little difference to the projections, as long as there is not a significant time trend in the delay between death and reporting, which seems unlikely except perhaps at the very start of the epidemic when reporting systems are being established. Moving to use day of death would require deciding how long to wait to get relatively complete data for the day of death, since there will be a distribution of the delay with a tail. And this would likely have more impact on the quality of the projection than using the reported data. A similar issue arises with analysis of annual registered deaths. I’m familiar with the situation in Australia, where there is around 2-3 months after the end of the year till the national dataset is relatively complete for the deaths that occurred in the year. But by relatively complete, I mean that non-injury deaths are relatively complete, apart from a small proportion delayed by autopsies or inquiries. But for injury deaths a larger proportion are delayed by autopsies and coronial investigations, and in the case of potential murders or suicides for much longer. And those high-profile media cases where someone disappears and the body is never found become deaths not registered till 7 years after the disappearance.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s