Conceived and designed the experiments: PMP AMS AS. Performed the experiments: PMP AMS AS. Analyzed the data: PMP AMS AS. Contributed reagents/materials/analysis tools: AMS AS. Wrote the paper: PMP AMS AS.
The authors have declared that no competing interests exist.
Twitter is a free social networking and micro-blogging service that enables its millions of users to send and read each other's “tweets,” or short, 140-character messages. The service has more than 190 million registered users and processes about 55 million tweets per day. Useful information about news and geopolitical events lies embedded in the Twitter stream, which embodies, in the aggregate, Twitter users' perspectives and reactions to current events. By virtue of sheer volume, content embedded in the Twitter stream may be useful for tracking or even forecasting behavior if it can be extracted in an efficient manner. In this study, we examine the use of information embedded in the Twitter stream to (1) track rapidly-evolving public sentiment with respect to H1N1 or swine flu, and (2) track and measure actual disease activity. We also show that Twitter can be used as a measure of public interest or concern about health-related events. Our results show that estimates of influenza-like illness derived from Twitter chatter accurately track reported disease levels.
An estimated 113 million people in the United States use the Internet to find
health-related information
Search query data provides one view of internet activity (i.e., the proportion of individuals searching for a particular topic over time), albeit one that is both noisy and coarse. The general idea is that increasing search query activity approximates increasing interest in a given health topic. Since some search query data also carries geographic information (generally based on the issuing IP address), it may also be possible to detect simple geospatial patterns. But search query data do not provide any contextual information; questions like why the search was initiated in the first place are difficult to answer. People search for health information for any number of reasons: concern about themselves, their family or their friends. Some searches are simply due to general interest, perhaps instigated by a news report or a recent scientific publication. Without sufficient contextual information, the relation between search query activity and underlying disease trends remains somewhat unclear.
Twitter is a free social networking and micro-blogging service that enables its
millions of users to send and read each other's “tweets,” or short
messages limited to 140 characters. Users determine whether their tweets can be read
by the general public or are restricted to preselected “followers.” The
service has more than 190 million registered users and processes about 55 million
tweets per day
These examples suggest that useful information about news and geopolitical events
lies embedded in the Twitter stream. Although the Twitter stream contains much
useless chatter, by virtue of the sheer number of tweets, it will still contain
enough useful information for tracking or even forecasting behavior when extracted
in an appropriate manner. For example, Twitter data has been used to measure
political opinion, to measure public anxiety related to stock market prices
In order to explore public concerns regarding rapidly evolving H1N1 activity, we
collected and stored a large sample of public tweets beginning April 29, 2009
that matched a set of pre-specified search terms:
Color-coded dots represent tweets issued by users (shown at the users' self-declared home location). Hovering over the dot displays the content of the tweet; here, the user name is intentionally obscured. A client-side JavaScript application updates the map in near-real time, showing the 500 most recent tweets matching the preselected influenza-related keywords.
Beginning on October 1, 2009, we collected an expanded sample of tweets using
Twitter's new streaming application programmer's interface (API)
Note that the Twitter stream is filtered in accordance with Twitter's API documentation; hence the tweets analyzed here still constitute a representative subset of the stream as opposed to the entire stream.
Moreover, because our main interest was to monitor influenza-related traffic
within the United States, we also excluded all tweets tagged as originating
outside the U.S., tweets from users with a non-U.S. timezone, and any tweets not
written in English. We also excluded all tweets having less than 5 characters,
those containing non-ASCII characters, and tweets sent by a client identifying
itself as “API” (the latter are usually generated by computer and
therefore tend to be “spam”). The remaining tweets were used to
produce a dictionary of English words, from which all commonly-used keywords
comprising Twitter's informal messaging conventions (e.g., #hashtag, @user,
RT, links, etc.) were removed. Porter's Stemming Algorithm
Although influenza is not a nationally notifiable disease in the U.S., an
influenza surveillance program does exist
The weekly term-usage statistics described previously were used to estimate
weekly ILI. To determine the relative contribution of each influenza-related
Twitter term, we used Support Vector Regression
A classification system categorizes examples as instances of some class or concept. For example, one might build a classification system to discriminate between low and high risk for hospital readmission on the basis of information provided in a patient record. A learning method attempts to automatically construct a classification system from a collection, or training set, of input examples. Elements of the training set are usually represented as a collection of values for prespecified features or attributes; for this example, these features could be such measurable properties as age, recent hospitalizations, recent clinic visits, etc. Training set elements are marked a priori with their outcome, or class membership (e.g., “high risk”). Once generated, the classification system can then be used to predict the outcome of future examples on the basis of their respective feature values. Commonly-used learning methods include neural networks, Bayesian classifiers, nearest-neighbor methods, and so on; here, we use SVMs.
SVMs use quadratic programming, a numerical optimization technique, to calculate
a
When used for regression, SVMs produce a nonlinear model that minimizes a preselected linear-error-cost function where features serve as regression variables. Each input data point (or tweet) is described as a collection of values for a known set of variables or features: here, the feature set is defined as the collection of terms in the dictionary appearing more than 10 times per week. For each time interval, the value of a feature is given by its usage statistic for the corresponding term. Thus each tweet is encoded as a feature vector of length equal to the number of dictionary terms occurring more than 10 times per week, where the value assigned is the fraction of total tweets in that time interval that contain the corresponding dictionary term after stemming.
For the estimation work reported here, we relied on the widely adopted
open-source libSVM implementation
The first data set consists of 951,697 tweets containing key words
The red line represents the number of H1N1-related tweets (i.e.,
containing keywords
Percentage of observed influenza-related tweets that also contain hand-hygiene-related keywords (red line) or mask-related keywords (green line). Spikes correspond to increased interest in these particular disease countermeasures, perhaps in reaction to, e.g., a report in the popular media.
Percentage of observed influenza-related tweets that also contain travel-related keywords (red and green lines) or pork consumption-related keywords (blue line). The relative rate of public concern about pork consumption fell steadily during the month of May, in contrast to increased public concerns about travel-related disease transmission.
Percentage of observed influenza-related tweets containing references to specific anti-viral drugs.
The second data set consists of 4,199,166 tweets selected from the roughly 8
million influenza-related tweets (i.e., key words
Percentage of observed influenza-related tweets containing vaccination-related terms.
Percentage of observed H1N1 vaccination-related tweets containing terms related to pregnancy (green line) or vaccine shortage (red line). The relatively low rates observed may indicate either a lack of public concern or a lack of public awareness.
Percentage of observed H1N1 vaccination-related tweets containing terms related to risk perception (red line) or Guillain–Barré syndrome (green line).
In contrast to the descriptive results just described, we next focus on making
quantitative estimates of ILI values based on the Twitter stream using
support-vector regression. Weekly ILI values were estimated using a model
trained on the roughly 1 million influenza-related tweets from the second data
set (October 1, 2009 through May 20, 2010) that were unambiguously tagged with
US locations, using CDC-reported ILI values across the entire United States as
the objective. To verify the accuracy of our method, we used a standard
leaving-one-out cross-validation methodology
The green line shows the CDC's measured ILI% for the 33-week period starting in Week 40 (October 2009) through Week 20 (May 2010). The red line shows the output of a leaving-one-out cross-validation test of our SVM-based estimator. Each estimated datapoint is produced by applying a model to the specified week of tweets after training on the other 32 weeks of data and their respective CDC ILI% values.
We next move beyond estimating national ILI levels to making real-time estimates of ILI activity in a single CDC region. Real-time estimates constitute an important tool for public health practitioners, since CDC-reported data are generally only available one to two weeks after the fact.
Using support vector regression, we fit geolocated tweets to CDC region ILI readings from nine of the ten CDC regions to construct a model. We then use the model to estimate ILI values for the remaining CDC region (Region 2, New Jersey and New York). Since many tweets lacked geographic information, this model was trained and tested on significantly less data (905,497 tweets for which we could accurately infer the US state of origin, less 90,000 of these belonging to Region 2); the remaining tweets were excluded from this analysis.
The green line shows the CDC's measured ILI% for Region 2 (New Jersey/New York) for the 33-week period starting in Week 40 (October 2009) through Week 20 (May 2010). The red line shows the output of our SVM-based estimator when applied to Region 2 tweet data. The estimator is first trained on all data from outside Region 2 and their respective region's CDC ILI% values.
Our results demonstrate that Twitter traffic can be used not only descriptively, i.e., to track users' interest and concerns related to H1N1 influenza, but also to estimate disease activity in real time, i.e., 1–2 weeks faster than current practice allows.
From a descriptive perspective, since no comparable data (e.g., survey results) are available, it is not possible to validate our results. But the trends observed are prima facie reasonable and quite consistent with expectations. For example, Twitter users' initial interest in antiviral drugs such as oseltamivir dropped at about the same time as official disease reports indicated most cases were relatively mild in nature, despite the fact that overall the number of cases was still increasing. Also, interest in hand hygiene and face masks seemed to be timed with public health messages from the CDC about the outbreak in early May. Interestingly, in October of 2009, concern regarding shortages did not appear nor did interest in rare side effects, perhaps because they did not occur in any widespread fashion. Here, absence of a sustained detectable signal may indicate an apathetic public, or may simply indicate a lack of information in the media. In either case, our work proposes a mechanism to capture these concerns in real time, pending future studies to confirm our results using appropriate techniques for analyzing autocorrelated data.
Influenza reoccurs each season in regular cycles, but the geographic location,
timing, and size of each outbreak vary, complicating efforts to produce reliable and
timely estimates of influenza activity using traditional time series models. Indeed,
epidemics are the most difficult to anticipate and model
Using actual tweet contents, which often reflected the user's own level of disease and discomfort (i.e., users were tweeting about their symptoms and body temperature), we devised an estimation method based on well-understood machine learning methods. The accuracy of the resulting real-time ILI estimates clearly demonstrates that the subset of tweets identified and used in our models contains information closely associated with disease activity. Our results show that we were able to establish a distinct relationship between Twitter data and the epidemic curve of the 2009 H1N1 outbreak, both at a national level and within geographic regions.
Our Twitter-based model, in contrast to other approaches
Although, in theory, it is possible to gather diagnosis-level data in near-real time
from emergency department visits
Despite these promising results, there are several limitations to our study. First,
the use of Twitter is neither uniform across time or geography. Mondays are usually
the busiest for Twitter traffic, while the fewest tweets are issued on Sundays;
also, people in California and New York produce far more tweets per person than
those in the Midwestern states (or, for that matter, in Europe). When and where
tweets are less frequent (or where only a subset of tweets contain geographic
information), the performance of our model may suffer. The difference in accuracy at
a national level and regional level observed in the
If future results are consistent with our findings, Twitter-based surveillance
efforts like ours and similar efforts underway in two European research groups
The authors wish to thank Ted Herman for his help and encouragement.
Note: This work was presented in part at the 9th Annual Conference of the International Society for Disease Surveillance in Park City, UT (December, 2010).