Page MenuHomePhabricator

Create an Introduction to Wikimedia open data
Open, In Progress, MediumPublic


Write a document, or revise, to cover the following concepts:

  1. Introduce the major concepts data users should understand when getting started with Wikimedia open data. Define and disambiguate "content" vs "data". Explain the difference between wiki dumps and wiki replicas.
  2. Explain the different types of data we publish:
  • Event data (edits…)
  • Analytics / "Traffic" data
    • Pageviews
    • Unique Devices
    • Clickstream data
    • Revision and user history
    • Data by country
    • Wikidata QRank
  • Other data
  • ORES scores? Liftwing ML outputs?
  1. Map types of content and data that ppl often care about to the datasets where they live (example-based disambiguation of content vs data, plus decision tree for next steps)
  2. Provide (or link to a page that provides) clear navigation to data sources and access methods. currently does this part pretty well.

Creating this content page will enable further edits and streamlining/simplification (reduce content duplication!) on pages like: and

Event Timeline

TBurmeister changed the task status from Open to In Progress.Nov 30 2023, 9:12 PM

@Isaac and @KCVelaga are working on finalizing the content of the data source tables (example) that will serve as important reference content about the datasets. Target completion date: end of 2023. @TBurmeister will investigate options for how we present that tabular data on-wiki, making it maintainable while maximizing its usability.

@TBurmeister will create drafts of new overview pages for each of our major data "families": traffic/readership, content, edits, and editors. This will involve moving content from existing / legacy pages like I will be creating drafts in my sandbox on-wiki, sharing them with Isaac and KC for initial review, then reaching out to WMF and community stakeholders for feedback before finally publishing them at I hope to complete the drafts by end of 2023 but it's not likely that the final versions will be done until early 2024. This work may also intersect with in-progress work on Data Platform Engineering docs (T350911).

Part 1 of these docs is focused on understanding the data; that content work is tracked in this task.
Part 2 of these docs is focused on understanding the tools and techniques for working with Wikimedia data and infrastructure; that content work is tracked in T353280.

Status update: Still working on drafts for data overview pages; gathering together various pieces of content about analysis nuances and pitfalls for the different types of datasets. Started and shared a first draft of an attempt to map common research questions and terminology to canonical data sources. Also put time into investigating display/presentations options for semantically complex tabular data on wiki, but no fancy yet easy-to-maintain solution has presented itself yet.

Did some work to add categories to DPE docs on Wikitech to try to make it easier to identify and revise all this related content that exists on multiple wikis, with the end goal of being better able to identify which content should go on Meta and which should go on Wikitech, and create reasonable connections between the two:

@Isaac and @KCVelaga are working on finalizing the content of the data source tables (example) that will serve as important reference content about the datasets.

For archive goodness, this was completed:

I'm not sure if the aforementioned spreadsheet is actually completed yet, that is a task assigned to @Isaac and @KCVelaga, and I'm still working on / workshopping the tab that attempts to map groups of research interests to data families.

That said, the main update for this task this week is that I've started writing the the "Intro to Wikimedia Open Data" doc, but it's currently in a very raw form so not shared yet. I've broken the content into sections that I'm working on in separate documents, to try to wrangle all the existing content about the different types of data and analysis, and write new content to contextualize it for readers with no familiarity with the Wikimedia movement. So...currently working on 3 intro docs but hoping to eventually either be able to bring them together into one, or maybe the "Intro to Wikimedia Open Data" page will end up being more of a navigational landing page that provides a clear path into these topic overviews.

@TBurmeister: @KCVelaga_WMF's involvement was limited to the hackathon during the RDS offsite. KC has a full plate supporting Language and Moderator Tools teams with their annual plan projects which are highest priority – this is not something that we have planned for or that he has bandwidth for.

Status update:

  • Made progress on content drafts this week, including starting to finalize a page structure that feels able to be consistently applied across overview docs for the different data domains. Estimated completion status before docs can be shared for initial feedback with SMEs:
    • Intro to Wikimedia Open Data: 60%
    • Overview of traffic/pageviews analysis: 30%
    • Overview of content analysis: 40%
    • Overview of edits/editors analysis: 50%
  • Target date for sharing the above docs for SME feedback is Feb 2. After that initial feedback round and resulting content updates, the plan is that the docs will go on-wiki and I'll coordinate a community feedback process with Kinneret / the Research team.
  • Met with Kinneret and Leila from Research to share about ongoing doc work and discuss plans potential future work on Research docs on Meta.
  • Did some project management and diagramming work to try to clarify the scope of this work and related work on DPE docs, and separate that from work that needs to happen next fiscal year.
  • Working on identifying how to best align and coordinate this docs work with related efforts like T333895

Updated explanation of this task and how it fits in with other ongoing work:
This overview doc is part of a refresh of Research:Data and related data documentation pages on Meta. Research:Data will continue to function as the primary landing page for researchers to get started with Wikimedia data. However, instead of providing a one-page list of many available data access methods, in my proposed revision, it will instead provide links to several new pages:

Landing pageResearch:Data
→OverviewIntro to Wikimedia data for researchers / data scientists[this task]
→Overview + getting startedPages covering key concepts, data sources, and access methods for major data domains: Content, Traffic, Contributing & Contributors3 docs in progress, see subtasks
→ReferenceIndex of data sources and access methods for all data domains, with quick links to API landing pages, Dumps pages, and available UIs. This will replicate information about the data sources and access methods on each of the data domain landing pages, but that's okay because it serves a different purpose.draft in spreadsheet

Right now, Research:Data functions partially as a reference doc, and partially as an introduction, but it doesn't provide enough contextual information to help newcomers get started. Together, this new set of pages will improve the reader's experience by providing a more gentle introduction and reducing information overload, while making it possible for a quick reference list of data sources to function more effectively for that use case.

First draft of this doc has been shared with subject matter experts for their review and feedback. After that round of reviews, I'll move to the next stage of publishing on-wiki and planning for larger community feedback cycles.

Draft of this new doc is now published on-wiki!

There are some known loose ends / open TODOs in the page, but feedback and revisions are welcome on any of the other content.

Next steps:

  • Finish the other document drafts for the specific data domains (referenced in the sister tasks of this one)
  • Reconnect with the Research team (especially @KinneretG) about how to invite community feedback and revisions, and how/whether to replace the current Research:Data page with this more comprehensive and up-to-date content.

Status update: community feedback period is ongoing, and I'm working on a reply to address questions / concerns that have already been posted on the Talk page.

Changes I made to the draft page this week include:

Additional feedback received in other channels (copying here as TODO items):

From @Quiddity:

  • Ensure that the final version of the page has an articulated plan for scaling/growth (as content will inevitably get added, so it should clearly slot in somewhere/somehow), but also plan for stability (if/once translated, it's hard to change)
  • Create a maintainer guide (or add content somewhere for page-maintainers and future-content adders about how to help/collab). Should cover likely questions like "Where exactly should I list my new/missing tool?" For example: From a glance, it appears that every new addition should be added in both (A) one or more sub-sections of "Data domains", and (B) within the "Reference list of data sources" table. A contributor guide would explicitly call-out this purposeful duplication that ought to be maintained.