Page MenuHomePhabricator

MediaWiki History Plan: use cases and potential work
Closed, ResolvedPublic2 Estimated Story Points

Description

User Story: T352787

Description

Detail the current use and potential of the MediaWiki history dataset. Potential is everything from work that could be done to improve the current pipeline to work that could update it to serve more use cases or better serve current ones.

Acceptance Criteria

  • A list of current use is started. It is perfectly ok for this list to include a pointer to a future-existing lineage chart. Duplicating those details in this list is out of scope. Documentation should center around Datahub.
  • A list of potential by type of work is started. This should be only illustrative of the kinds of potential and the details should be left to product owners. Also centered around Datahub.

Event Timeline

MediaWiki History is described in detail in the following places:

So the output of this task might be just to link to these and to update/clarify. I could use input from everyone watching this.

A full list of current use cases could only be compiled by reaching out to researchers who download this dataset. Limited to what we know, current use cases are roughly:

  • Wikistats Contributing and Content sections
    • Communications (data for reporters, speeches, etc)
    • Community Relations (context for conversations)
    • Community (context for work)
  • Movement Insights and Product Analytics: monthly reporting and ad-hoc reporting via edit_hourly, superset, etc.
  • External Researchers use dumps for all kind of things
  • Research team: use this table and the raw wikitext for pretty much everything

Potential use cases include:

  • Incremental updates could speed up the metrics that we deliver to C-levels and the Board from over 45 days to a few days (the dumps 2.0 pipeline is proving the technology necessary for this)
  • A queryable version of this loaded in a public cluster could move all intensive QueryPage(s) off of MediaWiki's databases and provide the community with a way to explore more insights in minutes instead of months.
  • An improved algorithm can allow us to add columns routinely to serve changing needs
  • MediaWiki history normalizes data from MediaWiki databases. We could write this back to the MW databases and help to remove the legacy PHP code that has to deal with those old entries. This could lead to significant code reduction and simplification of MW. This means simpler PHP code for MW maintainers. But it also means simpler workflows for communities of content creators and curators.
  • Better data science around this dataset could be a crown jewel of our research community. We can mine the way community behavior shifts year after year, find signal for impact of policy or technical changes, and so much more. We might be able to answer many historical questions from this dataset alone. This would require no change, just a closer look and asking the right questions.
  • And more...

@WDoranWMF seems like next step for validation of docs is to have an engineer dogfood the docs while onboarding / executing T352790: MediaWiki History Plan: use cases and potential work. Does that match your understanding

@Milimetric thank you for gathering these user cases here. A few follow up questions:

  1. what a queryable version of MW History loaded in a public cluster would be? Especially with the public user in mind.
  2. what work would be required to improve the algorithm and who would need to be involved?
  3. have you been in conversation with MediaWIki Group about the prospect of writing normalized MediaWiki history data back to the MW databases? If yes, what was their feedback or response?
  4. As you think about "better data science" around this dataset, what do you envision?
Milimetric set the point value for this task to 2.Jan 9 2024, 6:13 PM