Start Your #Infodemic Research with These Newly Released COVID-19 Twitter IDs Datasets from @SMLabTO’s COVID-19 Twitter Pandemic Archive
Today, we are pleased to announce the formal release of a new COVID-19 Twitter Pandemic Archive, a catalog of datasets containing billions of Tweet IDs for COVID-19 tweets and a set of data visualizations featuring high-level monthly stats about the COVID-19 conversations on Twitter. The datasets are being offered as-is for archiving and non-commercial research purposes and are free to download and reuse.
The tweets in these datasets are collected via Twitter’s COVID-19 Streaming Endpoint (API) using a custom script developed by the Social Media Lab. According to Twitter, this new streaming endpoint has no data volume or throughput limitations, and offers a real-time, full-fidelity stream of public Tweets containing the full conversation about COVID-19. (For more information about what tweets are included in this collection see Twitter’s filtering rules for this endpoint.)
As per Twitter’s API Terms, each dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). New datasets are uploaded to the web at the beginning of each month.
For each month, we prepare two data files:
- one file with Tweet IDs for all COVID-19 related tweets that we collect via the API, and
- a second file containing a subset of Tweet IDs for COVID-19 related tweets that also contain a vaccine-related word (i.e., words starting with vaccin*, vacin*, or vax*).
To rehydrate tweets from one of the datasets in the COVID-19 Twitter Pandemic Archive (or a newly created random sample of Tweet IDs… see below), you can use third-party programs such as Hydrator, the Python library Twarc, or Communalytic Pro (dataset limit of 10M Tweet IDs).
As part of the release of this new research data resource, we are also releasing a companion Tweets Sampling Toolkit which will allow researchers to create smaller random sample datasets consisting of Tweet IDs derived from one of the larger datasets available in the new COVID-19 Twitter Pandemic Archive.
In addition to creating a random sample, the Tweets Sampling Toolkit can also perform set operations such as union, difference, and intersection to compare two or more datasets. For example, if you have previously collected your own dataset of COVID-19 related tweets using Twitter’s Standard Search or Streaming API, you could compare it with one of the datasets published in the COVID-19 Twitter Pandemic Archive. This can be done using the “union” function provided in the Tweets Sampling Toolkit to merge two or more datasets of Tweet IDs, while excluding duplicates. Alternatively, you can use the “difference” function to identify and recollect only those tweets (based on their Tweet IDs) that are not part of your original dataset. Finally, you can use the “intersection” function, to locate Tweet IDs that appear in two or more datasets.