Page MenuHomePhabricator

Consolidate and improve data usage documentation for WMF-generated data
Open, In Progress, MediumPublic

Description

This task covers multiple workstreams related to improving data access documentation for WMF-generated data. I will add subtasks as I define the sub-projects and their priorities.

Terminology

  • Data usage documentation refers to technical documents for consumers of Wikimedia data. This content helps users understand how to connect their data tasks and research goals to specific datasets.
  • Dataset documentation is technical content and metadata that describes individual datasets. Dataset documentation informs users about attributes of individual datasets and their relationships to other datasets.
  • Data consumers: anyone who uses (or could use!) data produced by WMF. Data consumers have differing access to datasets depending on their affiliation.
  • Data producers: anyone who publishes data for or about wiki projects. (Documentation work here for the WMF Tech Docs Team primarily involves WMF staff and collaborators, but we may also want to provide guidelines for documenting datasets and analyses that data consumers create using WMF-generated data.)

Key user journeys

Users need data usage documentation and/or dataset documentation at various stages in their journey, depending on their goals, experience level, and other criteria. The primary high-level user journeys I've identified so far for data docs are:

  • Explore datasets
  • Find datasets for my task
  • Decide between datasets
  • Work with a specific dataset
  • Publish derived datasets and analyses

Project plans and details

More detailed info in project doc (google doc for now, content to move on-wiki when it's more stable)

History of this phab task

The original issue highlighted by this phab task was "make it easier to figure out how to access the various sources of raw data (public and private) about Wikimedia projects and what the policies and procedures around using them are." That is one (big) piece of the puzzle, but the overall picture is larger, so I'm expanding this task to cover that expanded scope.

Links referenced in original phab task:

Proposals and plans referenced in original phab task:

  • meta:Research:Data
    • Continues as the main entry point for public data, with a pointer to the private data entry point.
  • meta:Research:Private data
    • New main entry point for private data. Content moved here from wikitech:Analytics/Data access. Explains what you might use private data for, how you would get access, and why that's hard to do.
  • Should the main organizing principle be the topic of the data (e.g. editing patterns or article content) or the access method (e.g. the API or the dumps)?

11/15/23: replacing terminology "data access" with "data usage" because "data access" is too synonymous with permissions but the scope of "data usage" extends far beyond that.

Event Timeline

MBinder_WMF lowered the priority of this task from High to Medium.
MBinder_WMF moved this task from Triage to Backlog on the Product-Analytics board.
Vvjjkkii renamed this task from Consolidate data analyst documentation to z1daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from z1daaaaaaa to Consolidate data analyst documentation.Jul 1 2018, 9:10 AM
Mainframe98 lowered the priority of this task from High to Medium.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.
nshahquinn-wmf renamed this task from Consolidate data analyst documentation to Consolidate data analysis documentation.Jul 18 2018, 8:56 AM
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf renamed this task from Consolidate data analysis documentation to Consolidate data access documentation.Jul 18 2018, 9:15 AM
nshahquinn-wmf updated the task description. (Show Details)

I started working on this at the Hackathon and created a draft of an updated public data portal, reorganized around the type of data (e.g. editing metadata) rather than the data source (e.g. EventStreams).

I also connected with @Lea_WMDE and discussed data analysts at Wikimedia Deutschland; there's only one contractor currently, but hopefully there will be more soon, so that would be another opportunity for clear, well-organized documentation to deliver value.

nshahquinn-wmf removed nshahquinn-wmf as the assignee of this task.
nshahquinn-wmf moved this task from Backlog to Doing on the Product-Analytics board.

Just a note that as of 2022 October the Data Engineering team is working on moving content from https://wikitech.wikimedia.org/wiki/Analytics/ subpages to https://wikitech.wikimedia.org/wiki/Data_Engineering

TBurmeister renamed this task from Consolidate data access documentation to Consolidate and improve data access documentation.Jul 31 2023, 4:44 PM
TBurmeister changed the task status from Open to In Progress.
TBurmeister claimed this task.
TBurmeister renamed this task from Consolidate and improve data access documentation to Consolidate and improve data access documentation for WMF-generated data.Oct 17 2023, 2:19 PM
TBurmeister removed a project: Privacy Engineering.
TBurmeister updated the task description. (Show Details)

Status update: I'm in the research and information-gathering phase, building my understanding of this space and meeting with subject matter experts to try to narrow down priority focus areas so that I can scope project work for this and coming quarters.

  • In the past week I had 3 meetings with people from Data Platform Eng, Product Analytics and Data Products; next week I have two more meetings scheduled.
  • I read various documents written by data consumers, like this article and this PDF guide written by a Wikimedian in 2012, which, though old, still provides a useful conceptual framework and ideas for how to structure content that introduces data consumers to this topic area.
  • I read many wiki pages and project docs, in an attempt to get up to speed on the current status of APP work and other ongoing projects.
  • I learned about webrequests and how the pageviews public dataset is generated, and I started modeling and auditing the documentation for this dataset and its sources.
  • I learned about the data model behind some of the major tables written by MediaWiki, and started a list of important concepts to make sure data access docs cover for those data sources.

Goals for next week:

  • Finalize meetings with project owners / subject area experts
  • Get up to speed on the status of Commons Impact Metrics work and potential areas of documentation impact in that project
  • Identify focus areas for tech docs project work in Q2-Q4 and start scoping specific project tasks and milestones.
  • Learn about our other major public datasets and how they are generated
  • Continue gathering data consumer use cases and examples of analysis tasks to inform future information design work

Work in this area will proceed in collaboration with the Research and Data Platform Engineering teams as we work on creating new content to help researchers navigate our data landscape, while also coordinating that with changes to the underlying data infrastructure and documentation strategy for that. Details to be worked out in the coming weeks; but at minimum this will include:

TBurmeister renamed this task from Consolidate and improve data access documentation for WMF-generated data to Consolidate and improve data usage documentation for WMF-generated data.Nov 15 2023, 9:40 PM
TBurmeister updated the task description. (Show Details)

I've started a draft on-wiki that attempts to start integrating some of the Research-focused learning goals and data user journeys I identified into an outline. Will continue to build on this as we figure out how to structure the content, i.e. as a set of wiki pages and/or a revised version of the Research:Data portal, or something else still TBD: https://meta.wikimedia.org/wiki/User:TBurmeister_(WMF)/Sandbox/Research:Data

@KCVelaga also has a draft page for dataset-specific content, but we still need to strategize about what info needs to be presented where before we can get a good sense of the "how to present it": https://meta.wikimedia.org/wiki/User:KCVelaga_(WMF)/Data_sources_sandbox