WMF-published datasets are currently documented in multiple different ways on various platforms:
- Wikitech
- DataHub
- Static sites
- READMEs in GitLab/GitHub repositories (especially relevant for dataset pipeline docs, which can often end up containing content that data consumers should know)
- Jupyter notebooks (These are often used to function as documentation by providing query examples or interactive tutorials)
- mediawiki.org
- Meta-wiki:
- Example: https://meta.wikimedia.org/wiki/Research:Unique_devices
- Blog posts
- probably more!
We should define and implement a holistic strategy for how and where we produce, maintain, and publish dataset documentation.
Definition for terms used in this task:
Data usage documentation refers to technical documents for consumers of Wikimedia data. This content helps users understand how to connect their data tasks and research goals to specific datasets.
Dataset documentation is technical content and metadata that describes individual datasets. Dataset documentation informs users about attributes of individual datasets and their relationships to other datasets.
Data consumers: anyone who uses (or could use!) data produced by WMF. Data consumers have differing access to datasets depending on their affiliation.
Data producers: anyone who publishes data for or about wiki projects.
This task is complex because the same data needs different documentation depending on the audience and the stage of the data lifecycle:
Key audiences:
- Data engineers and SREs (may be both data producers and data consumers)
- Product analysts and internal WMF Data Platform users (have LDAP membership in the wmf or nda LDAP group and Shell (posix) membership in the analytics-privatedata-users group (see https://wikitech.wikimedia.org/wiki/Analytics/Data_access)) (may be both data producers and data consumers)
- The general public (data consumers who only have access to public data)
Types of dataset documentation, and the guidelines we need to establish as part of a strategy:
Metadata
- Metadata that is manually applied to datasets, like tags in DataHub.
- Our dataset documentation guidelines should include a clear and consistent set of practices for which terms are applied to datasets, and how we use the DataHub business glossary. A controlled vocabulary for tags or glossary terms is necessary to facilitate dataset discovery and meaningful associations across datasets.
- Metadata that is automatically applied to datasets, like lineage, governance, or other metadata in DataHub.
- We need a clear specification of what is automatically generated and what should be manually added as tags or elsewhere in dataset docs.
Query examples
- These may be dataset-specific, or could span multiple datasets. What is the best way to connect a data consumer to a quick-start set of recipes so they can more quickly see and make use of relevant query patterns? If I'm looking at a dataset, how can I access examples of queries and notebooks that use it? This type of information has the potential to save lots of time and avoid duplicate effort.
- Glossary terms and conceptual content **
This information is necessary to understand both elements of the data at a dataset level, and the overall data landscape. Because it is both broad and specific, it should be accessible from within the interfaces where people interact with specific datasets, and from the documentation where people learning about our data and how to use it.
- Glossary terms should not only be defined in DataHub unless they are specific only to a specific dataset. If they are more broadly relevant, they should be defined elsewhere to avoid duplication.
- Guidelines needed: where should domain-level glossary terms be defined, and how should term lists be maintained? Related task T193296 covers data usage documentation (broader than dataset documentation).
Overall, we need:
- Clear guidelines for data producers about what should be documented in DataHub vs. on wiki pages or in READMEs.
- How to provide cross-references between these platforms where documentation may reside
- How to ensure documentation is updated and/or deprecated as part of the data lifecycle
- How to ensure documentation practices aligns with data modeling, dataset creation, and other data governance / publishing guidelines