Page MenuHomePhabricator

Store page-links-change data in a database table and make available through a Special page
Open, Needs TriagePublic

Description

While the EventStream is valuable in its current form, and works for the primary identified use cases, it only stores data going back 30 days, making it less useful for community anti-spam efforts and research purposes.

Ideally we'd have a database table in MediaWiki storing this data, which would be searchable through a Special page, in a similar fashion to Special:LinkSearch searching the external links table.

User stories

  • As a researcher, I want to search for the history of certain link patterns so I can understand how the encyclopedia has been built over time with respect to its citations.
  • As a Wikipedia editor, I want to view a list of accounts which have added a specific link, to uncover and block spammers.

Event Timeline

The data in the mediawiki.page-links-change stream is coming from the exact same source as the data in links and externallinks tables, so this data should already be stored in MW.

this data should already be stored in MW.

Just to clarify, is data older than 30 days still stored, or would this data storage follow the availability of data on the stream itself?

this data should already be stored in MW.

Just to clarify, is data older than 30 days still stored, or would this data storage follow the availability of data on the stream itself?

The tables represent current state of the page, so the links are in the table for as long as they are present on the page. If the link is deleted from the page, it's deleted from the table.

Right - so there's no database tracking the individual link additions and removals? That's what I'm interested in with this task - a permanent store of the data coming through the event stream.

I think it's unlikely to get this feature implemented inside of MediaWiki. Storing and serving the full history of links changes inside of MariaDB probably won't scale.

However, this kind of use case (exporting MW state changes, transforming and serving them outside of MediaWiki) is what T291120: MediaWiki Event Carried State Transfer - Problem Statement is about (currently going through the tech decision forum). Longterm, the Data Eng team would like to build a platform that supports this kind of thing more easily. See also WIP Shared-Data Platform ideas and this excellent Data Mesh article.

So, we'd love to one day build a platform that allows this, but Data Eng would not own this service specifically.

If you get access to the Analytics Hadoop cluster, you can query this data historically. Or, you can consume the data from EventStreams and maintain your own historical database.