Page MenuHomePhabricator

Store COVID-19 data in a format that's machine-readable and can be shared between wikis
Open, HighPublic

Description

tl;dr

Wikidata is a decent place for storing ontology-type data (ie. key facts about somewhat notable entities), but we don't have a good place for storing very specific and detailed data (such as time series).

Problem statement

The COVID-19 related English Wikipedia articles contain various very detailed data (mostly, but not exclusively time series):

  • daily number of total cases / active cases / recovered / dead (globally or per country, sometimes below the country level too)
  • daily number of tests (globally or per country)
  • start/end date of various restrictions like lockdowns
  • number of students affected by school closures (per country)

These also tend to come with very detailed sourcing (ie. different data points come from different sources, sometimes contradictory sources or other commentary).

Currently these are handled with free-form, hand-maintained templates, which contain wikitables to be inserted into the relevant articles directly (e.g. current case count by country, daily case counts by country, US states cases, UK region cases, quarantine times, number of students affected by school closures), with further hand-maintained templates to access the same data in a different format (current stats). There are also some machine-updated templates that mirror data from some official source, to be used for visualization (case count maps).

This has some benefits:

  • Everything is on English Wikipedia so lines of authority are clear, and the quality is high since sourcing, data quality and dealing with vandalism and disinformation are a core competency of enwiki.
  • It is reasonably easy, starting from an article, to find how to edit the data.
  • Editing is somewhat sane, with VisualEditor being available for making changes. ("Somewhat" because the tables are large and VE becomes sluggish; but it's still better than trying to find your way in hundreds of lines of raw text, and it's an interface editors are already familiar with.)

But it has lots of disadvantages too:

  • Diffs are not great. Visual diffs are broken completely (that's T211897: Visual Diffs: Improve table diffing, presumably - the calculation times out so VE just shows the table with no changes, even for structurally simple changes which just change a cell value) and text diffs in a huge table are just not terribly helpful (example).
  • The data is not machine-readable (the tables can probably be scraped with some effort, but even that's terribly fragile).
  • The data is not available on other Wikipedias, so they can't easily benefit from all the hard work of enwiki editors, and on many of them the data is significantly outdated.
  • The data cannot be handled by wikitext logic (such as Lua modules), leading to maintenance problems like the difficulty of keeping row/column totals in sync (T247875: Assist with maintaining aggregate values in numerical tables).
  • Turning the data into graphs or charts is an entirely manual effort (see e.g. T249127: Create regularly updated maps of COVID-19 outbreak). That's a significant burden for enwiki editors, has a large opportunity cost since most potential illustrations just never happen due to lack of capacity for automating them or manually creating them, and it's also a further pain point for cross-wiki reuse since the graphs that do get made are usually not translatable.
  • The data is not available outside Wikipedia, e.g. to people who want to build dashboards.

It's worth considering how we can improve this situation, both in the short term for COVID-19-related efforts, and in the long term more generally.

Acceptance criteria

Have at least one significant COVID-19-related data table which is accessible in a machine-readable way, can be accessed on any Wikimedia wiki via some functionality integrated into wikitext (such as Lua), receives regular updates and does not cause distress to the editor community.

Event Timeline

Some potential solutions:

  • Use Wikidata. There has been some experimentation with this already (e.g. Q81068910 has case count time series data). Good: fully machine-readable; good wikitext integration via Scribunto; reusable cross-wiki and has mature and well-tested propagation mechanisms; accessible via a rich set of APIs and has a mature tooling ecosystem to support that; somewhat in scope since it's a project about data. Bad: stresses the software in ways it's not prepared for (thousands of statements on the same item for time series data), unintuitive format (e.g. dates as qualifiers), introduces cross-wiki usability / ownership / administration issues and consequently social issues, Wikidata tends to have a mixed track record of ensuring data is of high quality (like strict sourcing or dealing with vandalism), editing and diffs for Wikidata are not user-friendly; licence incompatibility with Wikipedia could be an issue.
  • Use tabular data on Commons. Good: mostly machine-readable (the meaning of rows/columns is not self-describing like on Wikidata, but that's not much of an issue in practice) with an intuitive format (an especially good match for time series); good wikitext integration via Scribunto; reusable cross-wiki; accessible via the API. Bad: cross-wiki change propagation model is naive; does not support granular sourcing; cross-wiki issues like above, possibly made worse by dealing with data not being a core competency of Commons; editing and diffs are not user-friendly (raw text).
  • Use actual files on Commons (e.g. enable CSV uploads): mentioned for comprehensiveness, but has no advantages over tabular data and a number of disadvantages (editing/diff support is worse, no wikitext integration...)
  • Enable tabular data on English Wikipedia. This would solve the cross-wiki issues (and maybe change propagation? probably not instantly, but it would make it easy to fix). Other disadvantages remain, and it would turn enwiki into a central repository for something, which wasn't done before (might cause social issues).
  • Use some wikitext-based reasonably machine-readable format on English Wikipedia (e.g. Lua tables). Use automated copying to keep other wikis in sync (e.g. T122086: RFC: Sharing templates and modules between wikis - poor man's version (investigation)). Good: easy wikitext reuse, no cross-wiki ownership issues (not too hard to come up with a system of local overrides) and data is maintained by the community that's best at it; full information content could probably be incorporated; sane options for wikitext and graph integration. Bad: poor viewing / editing / diff interface.
  • Keep the current approach of using wiki tables, and parse out the data either internally (using a Lua module) or externally (using a bot) and expose it in a machine-readable and cross-wiki-reusable (or cross-wiki copied) format. (A similar experiment is Covid19DataBot copying data to Commons). Good: like the previous one + least bad option in terms of editing / patrolling user experience (short of building something new). Bad: Parsing wiki tables is fragile (although it seems possible).
  • Use something custom-built, like a database on Toolforge. Could have decent UX and APIs, and can integrate with on-wiki user identities, but would result in storing primary data in a much less robust location, and a ton of work (including reimplementing some of the core competencies of MediaWiki like versioning / change management).

Adding some people who shown interested in related discussions: @eprodromou, @Amire80, @Daniel_Mietchen, @kaldari.

I agree Wikidata is probably not the best place to store such things. It's a bit too complex and tight in terms of its structures.

A simpler tabular database could be better, but we don't really have such a thing that can be easily configured, used, and shared by the communities. I've never thought of it somehow, and it's good that COVID-19 made us think of it.

Another solution, the technology for which already partly exists, is to use a JSON .tab file and insert them into articles using modules. The data is already shareable across wikis (a very neat, but rarely-used feature!), although unfortunately the modules' code is not (T41610, T243931).

In fact, such a table already exists: https://commons.wikimedia.org/wiki/Data:2019–20_coronavirus_outbreak.tab .

An obvious disadvantage is that it's too loose and doesn't enforce any "discipline"—it doesn't have a tight schema, doesn't require references, and can be easily ruined by a mistaken edit, even by a well-meaning editor. But then, using tables on wikis has the same problems, too, and these are fixed by nothing but editors' carefulness and collaboration. But at least it can be shared across wikis, which is already an improvement over managing tables in each wiki separately.

On a personal, emotional level, I've got to say that this is really an area where we could show the world how with just a bit of effort we can collaborate across countries, cultures, and languages better than the world's governments, which are criticized for imperfect international cooperation.

My personal take on this so far:

  • Community competency and editor convenience beats technical convenience. So anything that would move data editing and patrolling out of English Wikipedia won't work well, and the data should be stored in wiki tables (unless we can find some other on-wiki format that just as easy to read, edit, annotate with sources and diff).
  • A Scribunto-based parser would read table content and provide it as a Lua data table. Parsing wiki tables is not super hard as long as they keep to a minimum set of functionality (no HTML attributes outside of the header, for example). References and templates can contain pipe characters; templates can be handled by expanding the wikitext first (or just not using them), references are easy to replace with a strip token (much like the wikitext parser does it). Not sure how nicely this would play with Lua resource limits, but simple O(1) regexes seem sane even if the table is huge. Column names / types would probably have to be provided in some machine-readable way, such as a JSON blob in a comment. Those basically never change so they don't need to be editor-friendly.
    • The parser could be used to display a machine-readable version of the table (CSV, or something similar to the JsonConfig tabular data format but with more flexible sourcing), which can be accessed via some low-level API (e.g. the parse API with the expandtemplates option) by bots and external tools.
    • One machine-readable format would be a Lua table, and that could be synced by bots via T122086: RFC: Sharing templates and modules between wikis - poor man's version (investigation) or somet similar mechanism to all other wikis which want to use the data. The Lua modules would have to be likewise synced.
    • The same parser could also be reused for providing basic functions like sum of a column / cell range, thus solving T247875: Assist with maintaining aggregate values in numerical tables.
    • The same parser could probably also be reused for graphs / maps, by simply inlining the data into the graph/map specification.

In Russian Wikipedia, we are now trying to switch to an inter-project solution instead of storing data in local templates. And tabular data on Wikimedia Commons is probably the most convenient way to do this.

Examples:

So anything that would move data editing and patrolling out of English Wikipedia won't work well

I wouldn't give up on this so easily, without even trying :)

Did anyone in the English Wikipedia say that they are totally against moving to something shared in principle?

If millions of images can be shared on commons and used in articles, why not tabular data, at least in some cases?

Abit triaged this task as High priority.Apr 13 2020, 8:29 PM

Did anyone in the English Wikipedia say that they are totally against moving to something shared in principle?

In principle not, in practice yes. There has been plenty of dicussions around Wikidata; tabular data would raise the same issues (obscure to editors how the data is retrieved, page protection does not work, governance mechanisms break down if part of the content is on another wiki, different wikis will have conflicting policies and different expectations on how much effort should be put in e.g. verifiability) and then some (no page invalidation, changes are not downstreamed into enwiki watchlists). Plus users would have to relearn a number of things; plus even just the conversation about whether these are viable ideas would take up a lot of time and effort. I was mainly thinking in terms of what we can do right now, and right now trying to move the data is not a realistic option IMO.

In the long term, I think

  • we need something like tabular data (ie. a structured way to store time series and other large datasets)
  • it should support granular (per row, per cell) sources and notes
  • there should be some kind of Wikidata-based way to define what the tables and columns mean (also, i18n)
  • it should have a decent visual editor
  • it should probably have a decent source editor (instead of JSON, something like CSV export / import maybe?)
  • it should have custom diff logic that's better suited to tables
  • the main data hub should be a wiki which specializes in data curation and data quality (maybe enwiki, maybe a more mature Wikidata; maybe something new, although that turns out to be a bad idea most of the time), not Commons which specializes in copyright issues and media editing.

In the even longer term, there should be a proper way to use such data cross-wiki (but that requires us to deal with cross-wiki change propagation, which is a huge task). For now, hacking up a bot to copy edits from a central wiki to the others should work well enough.

Currently these are handled with free-form, hand-maintained templates, which contain wikitables to be inserted into the relevant articles directly (e.g. current case count by country, daily case counts by country, US states cases, UK region cases, quarantine times, number of students affected by school closures), with further hand-maintained templates to access the same data in a different format (current stats).

The larger ones of those are around 200-300K. The current Scribunto limits are 50 MB memory, 10 sec CPU time; the post‐expand include size limit (the only relevant parser limit for fetching and processing the raw wikitext of an article, I think) is 20MB. So as long as an article doesn't invoke dozens of different data tables, seems OK performance-wise.

I've been scraping historical case data from English Wikipedia into the JSONConfig table format and storing them on archive.org by ISO 3166 code:

https://archive.org/download/wikipedia-covid19-cases

The scraped data now covers US states and Canadian provinces, and outputs in JSON table format as well as CSV.

https://archive.org/download/wikipedia-covid19-cases

You can grab a country's data by using the ISO-3166 two-letter country code and adding ".tab.json" or ".csv" depending on the format you want.

https://archive.org/download/wikipedia-covid19-cases/CA.csv
https://archive.org/download/wikipedia-covid19-cases/CA.tab.json

You can also get a sub-national unit's data by using the ISO-3166-2 code. For US states, for instance, this is just "US-" plus the state's 2-letter postal code.

https://archive.org/download/wikipedia-covid19-cases/US-NC.csv
https://archive.org/download/wikipedia-covid19-cases/US-NC.tab.json

I'm working on getting a global total, more sub-national data for more countries, and more country data coverage. Patches welcome!

https://github.com/evanp/covid19databot/tree/javascript

One of the hackathon projects was a Lua module to render tabular data as wikitables. Doesn't seem to handle references, though.