Page MenuHomePhabricator

Define dataset documentation strategy
Open, Stalled, MediumPublic

Description

WMF-published datasets are currently documented in multiple different ways on various platforms:

We should define and implement a holistic strategy for how and where we produce, maintain, and publish dataset documentation.

Definition for terms used in this task:
Data usage documentation refers to technical documents for consumers of Wikimedia data. This content helps users understand how to connect their data tasks and research goals to specific datasets.
Dataset documentation is technical content and metadata that describes individual datasets. Dataset documentation informs users about attributes of individual datasets and their relationships to other datasets.
Data consumers: anyone who uses (or could use!) data produced by WMF. Data consumers have differing access to datasets depending on their affiliation.
Data producers: anyone who publishes data for or about wiki projects.

This task is complex because the same data needs different documentation depending on the audience and the stage of the data lifecycle:

Key audiences:

  • Data engineers and SREs (may be both data producers and data consumers)
  • Product analysts and internal WMF Data Platform users (have LDAP membership in the wmf or nda LDAP group and Shell (posix) membership in the analytics-privatedata-users group (see https://wikitech.wikimedia.org/wiki/Analytics/Data_access)) (may be both data producers and data consumers)
  • The general public (data consumers who only have access to public data)

Types of dataset documentation, and the guidelines we need to establish as part of a strategy:

Metadata

  • Metadata that is manually applied to datasets, like tags in DataHub.
    • Our dataset documentation guidelines should include a clear and consistent set of practices for which terms are applied to datasets, and how we use the DataHub business glossary. A controlled vocabulary for tags or glossary terms is necessary to facilitate dataset discovery and meaningful associations across datasets.
  • Metadata that is automatically applied to datasets, like lineage, governance, or other metadata in DataHub.
    • We need a clear specification of what is automatically generated and what should be manually added as tags or elsewhere in dataset docs.

Query examples

  • These may be dataset-specific, or could span multiple datasets. What is the best way to connect a data consumer to a quick-start set of recipes so they can more quickly see and make use of relevant query patterns? If I'm looking at a dataset, how can I access examples of queries and notebooks that use it? This type of information has the potential to save lots of time and avoid duplicate effort.
  • Glossary terms and conceptual content **

This information is necessary to understand both elements of the data at a dataset level, and the overall data landscape. Because it is both broad and specific, it should be accessible from within the interfaces where people interact with specific datasets, and from the documentation where people learning about our data and how to use it.

  • Glossary terms should not only be defined in DataHub unless they are specific only to a specific dataset. If they are more broadly relevant, they should be defined elsewhere to avoid duplication.
    • Guidelines needed: where should domain-level glossary terms be defined, and how should term lists be maintained? Related task T193296 covers data usage documentation (broader than dataset documentation).

Overall, we need:

  • Clear guidelines for data producers about what should be documented in DataHub vs. on wiki pages or in READMEs.
  • How to provide cross-references between these platforms where documentation may reside
  • How to ensure documentation is updated and/or deprecated as part of the data lifecycle
  • How to ensure documentation practices aligns with data modeling, dataset creation, and other data governance / publishing guidelines

Event Timeline

TBurmeister changed the task status from Open to In Progress.Oct 17 2023, 2:36 PM
TBurmeister triaged this task as Medium priority.
TBurmeister added a project: Goal.

Status update: I'm in the research and information-gathering phase, building my understanding of this space and meeting with subject matter experts to try to narrow down priority focus areas so that I can scope project work for this and coming quarters.

  • In the past week I had 3 meetings with people from Data Platform Eng, Product Analytics and Data Products; next week I have two more meetings scheduled.
  • I read various documents written by data consumers, like this article and this PDF guide written by a Wikimedian in 2012, which, though old, still provides a useful conceptual framework and ideas for how to structure content that introduces data consumers to this topic area.
  • I read many wiki pages and project docs, in an attempt to get up to speed on the current status of APP work and other ongoing projects.
  • I learned about webrequests and how the pageviews public dataset is generated, and I started modeling and auditing the documentation for this dataset and its sources.
  • I learned about the data model behind some of the major tables written by MediaWiki, and started a list of important concepts to make sure data access docs cover for those data sources.

Goals for next week:

  • Finalize meetings with project owners / subject area experts
  • Get up to speed on the status of Commons Impact Metrics work and potential areas of documentation impact in that project
  • Identify focus areas for tech docs project work in Q2-Q4 and start scoping specific project tasks and milestones.
  • Learn about our other major public datasets and how they are generated
  • Continue gathering data consumer use cases and examples of analysis tasks to inform future information design work

Met with more stakeholders and product owners / directors; determined that the first project to be prioritized in this area is: https://phabricator.wikimedia.org/T350911

See subtasks for details of currently-planned work; more to be figured out but this is the general outline for the next few weeks. The Data Engineering team is working on moving its team-focused docs off Wikitech and on to MediaWiki.org. That will pave the way for revisions and restructuring of how the data platform(s) are documented on Wikitech, so we can make that technical content align more with the needs of its various users, and focus more on the technical topics rather than the team structure.

TBurmeister changed the task status from In Progress to Stalled.Nov 21 2023, 3:32 PM

Marking this as stalled while we work on improving and migrating the Data Platform docs on Wikitech (T350911), which is a prerequisite for working on the dataset documentation (a subset of the overall Data Platform docs).

Related work completed this week: