Page MenuHomePhabricator

Establish a standard set of data domains
Open, Needs TriagePublic

Description

It's common and useful to organize Wikimedia data into high-level categories. Here are some examples.

@TBurmeister's public data introduction uses

  • Traffic and readership
  • Content
  • Contributions and contributors

The database breakdown for Iceberg tables, which started with a clear plan but has also seen some organic expansion, uses:

  • Contributors
  • Mediawiki
  • Readership
  • Traffic
  • Wikidata
  • Product
  • Dumps (see T347611)
  • Data Ops

The RDS data glossary uses:

  • Content
  • Reader
  • Contributor
  • Diversity

The dataset documentation pages on Wikitech use:

  • Content
  • Edits
  • Events
  • Traffic

There's a lot of similarity, but a surprising amount of diversity both in the choice of domains and the terminology for the same domain (for example, is it "contributions", "editing", or "edits"?). Clearly, things would be a lot simpler if we sat down and agreed on a standard set.

Event Timeline

I'm not suggesting we start working on this right away; I mainly wanted to capture the need in a task and to start collecting existing domain sets as food for thought.

This is related to and part of a more comprehensive dataset documentation strategy, which includes metadata for how we describe datasets: T349103.

In addition to what Neil listed in the original task description, there are these additional sources that use the concept of data domains with varying language and semantics:

Wikistats uses:

  • Reading
  • Contributing
  • Content

Data domain categories in Wikitech, which I added based on the Data Lake pages on Wikitech and trying to find a middle ground between those and the public data introduction page:

  • Content data‎
  • Contribution data‎
    • Edits data
    • Editors data
  • Traffic data‎

Directories or static sites where datasets are published (outside of DataHub) often embed domain context in their directory structure or filenames, i.e.:

*https://dumps.wikimedia.org/ has embedded in its HTML pages some of these domain concepts which would be useful to highlight and perhaps use to make the content more easily navigable. Much of the data at https://dumps.wikimedia.org/ is assumed to be in the content domain, but the site also provides access to:

  • Analytics datasets (another term that encompasses multiple domains like traffic, readership, etc.)
  • https://dumps.wikimedia.org/other/ a bunch of miscellaenous datasets that seem to include content, engagement and program data, search data, and more. A list of domains would help a lot to group all these datasets into more usable buckets.