Title:

Big data, Google, and infectious disease prediction: A statistical perspective

Abstract:

Big data generated from the Internet present a great opportunity for real-time disease surveillance and tracking, such as flu in the United States or dengue fever in tropical countries. These big-data insights into real-time infectious disease prediction could help public health officials make timely decisions to save lives. We proposed a theoretically rooted method that leads to robust and accurate real-time tracking of infectious diseases. Our method significantly outperforms all previous internet-based tracking models, including Google Flu Trends and Google Dengue Trends.

In the case of flu, we introduced our real-time digital flu detection method ARGO (AutoRegressive with GOogle data), which combines time series information with Google search data. ARGO is derived from a hidden Markov structure of data-generating mechanism. With a sliding two-year window and an L1 penalty for training, ARGO can incorporate new information as it becomes available, and can automatically select or adjust the most useful Google search queries. We extended ARGO to track dengue fever with great success in five tropical countries including Brazil, Mexico, Thailand, Singapore, and Taiwan. ARGO is then further extended to incorporate cloud-based electronic health records and to generate near-future predictions weeks ahead. Our latest development upgrades the method for infectious disease tracking in spatial scale. Thanks to the ubiquity of internet search data, ARGO is now capable of real-time disease tracking not only at national level but also at regional level. The upgraded ARGO uses penalized spatial-temporal information pooling, making it flexible, self-correcting, robust and scalable.

 

Bio:

Shihao Yang is a PhD candidate at Department of Statistics from Harvard University, advised by Prof. Samuel Kou. His primary research interest is to harness the power of big data to solve real-life problems, with focus on three perspectives: methodological development, computational tools, and probabilistic modeling.

On the methodological development perspective, he developed methods for infectious disease forecast based on internet search data, and built a tailor-made matching method to study cancer immunotherapy with electronic health data. On the computation perspective, he introduced a new method for parallelizable Markov chain Monte Carlo, and another fast approximation method for inference in dynamic systems via constrained Gaussian process. On the probabilistic modeling perspective, he proposed novel stochastic differential equations to capture the underlying dynamics of high-speed, high-volume financial market transactions.