What's Coming Next from the COVID-19 Canada Open Data Working Group in 2021

Dec 31, 2020 9 min read blog

Short on time? Click here for the TL;DR

The COVID-19 Canada Open Data Working Group has been bringing you COVID-19 data and insights since March. This will continue as along as the pandemic remains relevant.

Since V-Day, our dataset has included data on Canada’s vaccine rollout. We will be expanding our data offerings further in 2021, as well as bringing you new ways to access and interact with our data, such as the upcoming Covid19CanadaData R package.

Our dashboard will also be developed further, with new ways to visualize health region-level trends and a public release of the source code. Watch this space.

Finally, I will continue to develop and expand the Archive of COVID-19 Data from Canadian Government Sources. My goal is to make it the go-to source for when someone asks “What happened during the COVID-19 epidemic in Canada?”

Happy New Year, everyone. We at the COVID-19 Canada Open Data Working Group look forward to keeping you all abreast of the latest trends in the COVID-19 epidemic in the year ahead. 2021 will start to look a lot rosier come spring, I promise.

What a year

The COVID-19 Canada Open Data Working Group was founded by myself and my colleague Isha Berry in early March of this year to fill a critical data gap—to provide a pan-Canadian picture of the developing COVID-19 pandemic. What began with a simple dataset of cases and deaths and a basic R Shiny dashboard has expanded into much more, most recently VaxView, our tracker for the vaccine rollout in Canada.

What’s next

This pandemic is far from over. This winter, we face what will in many ways be the most challenging phase of the pandemic, even as the hope offered by widespread vaccination in the spring grows ever closer. We will continue tracking this virus as long as it remains relevant.

That being said, some changes are coming in 2021. A lot, actually. This month, we ran a two-week survey on our dataset and dashboard and received a tremendous 242 complete responses—hearing from people from all over Canada who use and/or consume our data. The feedback from these surveys will help us shape our priorities going in 2021. In the coming year, we aim to bring you more data and more ways to explore and interact with these data.

Our datasets are changing and expanding

The growing number of cases in Canada’s second wave has created a growing demand for manual data input from our team. Together, we have identified many parts of our data workflow that could benefit from automation, especially as data offerings from the provinces have become more consistent. This will allow us to work more efficiently and redirect our energy toward expanding our data offerings.

For example, we have identified many provinces with health region-level recovered and testing data, which we do not currently collect. We have also made progress on bringing more demographic data into our individual-level datasets. Our plan is to gradually incorporate these additional datasets into our offerings in the new year.

Another feature of our dataset is that we generally report data based on the date it was publicly reported, since this is the only date that is consistently offered by all jurisdictions. This is sub-optimal for some uses, especially for users focused on a single province, and creates undesirable patterns in the data (for example, for jurisdictions that don’t report on weekends and then report 3 days of data on Mondays).

There is a solution: converting official datasets (e.g., the CSV files offered by many provinces) to be compatible with our dataset to serve as a drop-in replacement. Ultimately, this should be incorporated into the JSON API (e.g., allowing users to download our dataset but with one province’s data being substituted with the official version).

Some feedback in our evaluation surveys was focused on making our data easier to link to other datasets. For example, linking the health region names we use to alternate health region names through the use of the health region unique identifier (HR_UID). Our correspondence files provide this information and population values for all provinces and health regions, but we acknowledge it would be easier if these data were directly incorporated into the time series datasets themselves.

In the new year, we will draw up a proposal for a new data structure (which will only consist of adding new columns to the existing time series). We will then enter into a transition period where datasets will be provided in both old and new formats before the new format replaces the old format as the primary dataset.

If you’d like to be informed of upcoming changes to the dataset or have any comments or feedback, please send us an email at: ccodwg [at] gmail [dot] com.

We’re building an ecosytstem for our data

We’re building an R package ecosystem to make it easier to access, explore and use our data to generate insights. The GitHub repository for our dataset is called Covid19Canada, so it’s only natural that an derivatives of this project build off this naming scheme.

Covid19CanadaAPI

Our JSON API launched in September as an alternate way of accessing our dataset. In particular, it makes it easy to access an always up-to-date pre-processed (e.g., only show time series for Ontario after a particular date) or summarized dataset (e.g., only show the most recent numbers for each province).

Until recently, the JSON API has existed as a functional but not actively developed part of our group’s offerings. In 2021, this will no longer be the case: the API will be a core product that will be used to power future developments of our products. This begins with bringing the API to parity with our primary, CSV-based datasets on GitHub.

I recently submitted several bug fixes for the API as well making our recently added vaccination data available. Further development will probably require a re-write of the code, which you can follow on GitHub. Feel free to make suggestions or code contributions.

Covid19CanadaData

We want our datasets to be as easy to access and manipulate as possible. Covid19CanadaData is a soon-to-be-released R package designed to facilitate access to both the Working Group’s daily dataset as well as my Archive of COVID-19 Data from Canadian Government Sources (discussed in more detail in the final section of this post). Access to the daily dataset will be powered by the API to allow for easy pre-processing.

You can follow the development of this package on GitHub.

Covid19CanadaTrends

Covid19CanadaTrends grew out of a series of scripts I have used to make tweetable summaries of trends in cases and mortality at the province and health region-levels.

(3/42) % change in 7-day rolling average of cases compared to one week ago:
MB: +95.6% 📈 (22.7/day ➡️ 44.4/day)
QC: +57.8% 📈 (409.4/day ➡️ 646.1/day)
ON: +37.2% 📈 (357.6/day ➡️ 490.7/day)
— Jean-Paul R. Soucy (@JPSoucy) September 29, 2020

(8/42) In Ontario, the GTA and #Ottawa are seeing rapid growth.

Toronto: +57.4% 📈 (119.9/day ➡️ 188.7/day)
Ottawa: +37.8% 📈 (49.1/day ➡️ 67.7/day)
York: +35.9% 📈 (31.4/day ➡️ 42.7/day)
Peel: +14.3% 📈 (79.7/day ➡️ 91.1/day)
— Jean-Paul R. Soucy (@JPSoucy) September 29, 2020

Such figures can be easily abused or misinterpreted, a subject I have written about at length before. We must always consider context (e.g., what’s going on with testing, contact tracing, etc.) when interpreting these numbers.

(3/20) A confluence of recent events—the testing backlog, the transition to appointment-based testing, the suspension of contact tracing—have made daily case numbers in the province less reliable, especially in hard-hit areas like Toronto.https://t.co/MvBZKLyWDQ
— Jean-Paul R. Soucy (@JPSoucy) October 10, 2020

Nonetheless, the purpose of this package is to provide a convenient way to summarize and visualize recent trends in the COVID-19 epidemic in Canada.

You can follow the development of this package on GitHub.

Covid19CanadaDashboard

One of the most-requested features of of our popular R Shiny dashboard is a way to visualize health region-level trends. This will be done by incorporating the features of the Covid19CanadaTrends package.

For some time now, I’ve been promising to publicly release the source code of the dashboard. This began with a ground-up rewrite of the dashboard code to reduce redundant code and to make the platform easier to maintain and extend. After all, this dashboard started as a way for me to learn R Shiny, so the code wasn’t exactly pretty. This rewrite has been basically complete for some time now, so it’s just a matter of tying up loose ends before the code is released publicly and further development happens in the open.

You can follow the public release and development of the R Shiny dashboard on GitHub.

A Canadian COVID-19 data archive for the future

Since late August of 2021, I have been quietly amassing what is almost certainly the largest publicly available collection of Canadian COVID-19 datasettes. The Archive of COVID-19 Data from Canadian Government Sources is a collection of daily snapshots of COVID-19 data from various Canadian government sources (and select non-governmental sources). The basis of the project is a Python script that automatically archives nearly 140 datasets (and counting) every day.

I had toyed with the idea behind this archive for some time, but the final straw that actually convinced me to start it was this story out of Iowa regarding flawed COVID-19 data collection that led to thousands of cases being backdated, often by several months, distorting percent positivity statistics. The flaw was discovered by those tracking retroactive changes to the daily datasets provided the Iowa Department of Public Health.

Earlier this month, a huge flaw was discovered in #Iowa’s COVID-19 data which caused new cases to get backdated. This resulted in the positivity rate for testing in Iowa being underestimated in the present. (3/12)https://t.co/LQl1XVBf8K
— Jean-Paul R. Soucy (@JPSoucy) August 26, 2020

I believe this project will be my most enduring contribution to understanding the COVID-19 epidemic in Canada. My goal is to make it the go-to source for re-constructing the historical record when someone asks “What happened during the COVID-19 epidemic in Canada?”

The next steps for this project are to:

Continue to solicit contributions of new datasets and improve metadata for existing datasets.
Make the data accessible and searchable via the Covid19CanadaData package.
Create a website for publishing and exploring the data via Datasette.

You can find the project landing page and data catalogue on GitHub. Please contact me or make an issue/pull request on GitHub if you have suggestions for datasets to add or if you have data to contribute to the archival effort.

See you in 2021

I’d like to wish everyone a happy New Year. We at the COVID-19 Canada Open Data Working Group look forward to keeping everyone abreast of the latest trends in the COVID-19 epidemic in the year ahead. 2021 will start to look a lot rosier come spring, I promise.

You can discuss this post on Twitter.

covid-19 open-data

Jean-Paul R. Soucy

PhD candidate in Epidemiology at the University of Toronto

My research interests include infectious disease epidemiology, health policy, and open data.