POLICYFORUM
BIG DATA
The Parable of Google Flu:
Traps in Big Data Analysis
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.
David Lazer, 1, 2* Ryan Kennedy,1, 3, 4 Gary King,3 Alessandro Vespignani 3,5,6
I
CREDIT: ADAPTED FROM AXEL KORES/DESIGN & ART DIRECTION/ISTOCKPHOTO.COM
n February 2013, Google Flu
Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped.
Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States
(1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data
(3, 4), what lessons can we draw from this error?
The problems we identify are not limited to GFT. Research on whether search or social media can predict x has become commonplace (5–7) and is often put in sharp contrast with traditional methods and hypotheses.
Although these studies have shown the value of these data, we are far from a place where they can supplant more traditional methods or theories (8). We explore two issues that contributed to GFT’s mistakes— big data hubris and algorithm dynamics— and offer lessons for moving forward in the big data age.
Big Data Hubris
“Big data hubris” is the often implicit assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. Elsewhere, we have asserted that there are enormous scientific possibilities in big data (9–11). However, quantity of data does not mean that one can ignore foundational issues of mea1
Lazer Laboratory, Northeastern University, Boston, MA
02115, USA. 2Harvard Kennedy School, Harvard University,
Cambridge, MA 02138, USA.
References: 2. D. R. Olson et al., PLOS Comput. Biol. 9, e1003256 (2013). 3. A. McAfee, E. Brynjolfsson, Harv. Bus. Rev. 90, 60 (2012). 4. S. Goel et al., Proc. Natl. Acad. Sci. U.S.A. 107, 17486 (2010). Georgia, 11 to 15 July 2010 (Association for Advancement of Artificial Intelligence, 2010), pp 6. J. Bollen et al., J. Comput. Sci. 2, 1 (2011). 7. F. Ciulla et al., EPJ Data Sci. 1, 8 (2012). Boston, MA, 9 to 11 October 2011 (IEEE, 2011), pp. 165; doi:10.1109/PASSAT/SocialCom.2011.98. 9. D. Lazer et al., Science 323, 721 (2009). 10. A. Vespignani, Science 325, 425 (2009). 11. G. King, Science 331, 719 (2011). 12. D. Boyd, K. Crawford Inform. Commun. Soc. 15, 662 (2012). 13. J. Ginsberg et al., Nature 457, 1012 (2009). 14. S. Cook et al., PLOS ONE 6, e23610 (2011). 15. P. Copeland et al., Int. Soc. Negl. Trop. Dis. 2013, 3 (2013). 16. C. Viboud et al., Am. J. Epidemiol. 158, 996 (2003). 17. W. W. Thompson et al., J. Infect. Dis. 194 (Suppl. 2), S82–S91 (2006). 18. I. M. Hall et al., Epidemiol. Infect. 135, 372 (2007). 19. J. B. S. Ong et al., PLOS ONE 5, e10036 (2010). 20. J. R. Ortiz et al., PLOS ONE 6, e18687 (2011). 23. E. Mustafaraj, P. Metaxas, in Proceedings of the WebSci10, Raleigh, NC, 26 and 27 April 2010 (Web Science Trust, 2010); http://journal.webscience.org/317/. Francisco, CA, 7 to 11 August 2011 (AAAI, 2011), p 25. G. King, PS Polit. Sci. Polit. 28, 443 (1995). 27. R. Lazarus et al., BMC Public Health 1, 9 (2001). 28. R. Chunara et al., Online J. Public Health Inform. 5, e133 (2013). 29. D. Balcan et al., Proc. Natl. Acad. Sci. U.S.A. 106, 21484 (2009). 30. D. L. Chao et al., PLOS Comput. Biol. 6, e1000656 (2010). 31. J. Shaman, A. Karspeck, Proc. Natl. Acad. Sci. U.S.A. 109, 20425 (2012). 32. J. Shaman et al., Nat. Commun. 4, 2837 (2013). 33. E. O. Nsoesie et al., PLOS ONE 8, e67164 (2013). New York, 2013), pp. 527–538. 35. A. J. Berinsky et al., Polit. Anal. 20, 351–368 (2012).