This task covers multiple workstreams related to improving data access documentation for WMF-generated data. I will add subtasks as I define the sub-projects and their priorities.
- Data usage documentation refers to technical documents for consumers of Wikimedia data. This content helps users understand how to connect their data tasks and research goals to specific datasets.
- Dataset documentation is technical content and metadata that describes individual datasets. Dataset documentation informs users about attributes of individual datasets and their relationships to other datasets.
- Data consumers: anyone who uses (or could use!) data produced by WMF. Data consumers have differing access to datasets depending on their affiliation.
- Data producers: anyone who publishes data for or about wiki projects. (Documentation work here for the WMF Tech Docs Team primarily involves WMF staff and collaborators, but we may also want to provide guidelines for documenting datasets and analyses that data consumers create using WMF-generated data.)
Key user journeys
Users need data usage documentation and/or dataset documentation at various stages in their journey, depending on their goals, experience level, and other criteria. The primary high-level user journeys I've identified so far for data docs are:
- Explore datasets
- Find datasets for my task
- Decide between datasets
- Work with a specific dataset
- Publish derived datasets and analyses
Project plans and details
More detailed info in project doc (google doc for now, content to move on-wiki when it's more stable)
History of this phab task
The original issue highlighted by this phab task was "make it easier to figure out how to access the various sources of raw data (public and private) about Wikimedia projects and what the policies and procedures around using them are." That is one (big) piece of the puzzle, but the overall picture is larger, so I'm expanding this task to cover that expanded scope.
Links referenced in original phab task:
- Main entry point for public data, but very out of date.
- wikitech:Analytics/Data access
- Main entry point for internal/private/production cluster data.
- office:Data access guidelines
- Guides written by particular teams:
Proposals and plans referenced in original phab task:
- Continues as the main entry point for public data, with a pointer to the private data entry point.
- meta:Research:Private data
- New main entry point for private data. Content moved here from wikitech:Analytics/Data access. Explains what you might use private data for, how you would get access, and why that's hard to do.
- Should the main organizing principle be the topic of the data (e.g. editing patterns or article content) or the access method (e.g. the API or the dumps)?
11/15/23: replacing terminology "data access" with "data usage" because "data access" is too synonymous with permissions but the scope of "data usage" extends far beyond that.