Page MenuHomePhabricator

Create a way to filter only WB-related changes from Commons recentchanges
Open, HighPublic

Description

Right now one of the ways we fetch change data - and only one available outside WMF production servers - is using recentchanges API. However, currently there is no support in this API for fetching only updates that relate to structured data - all edits to Commons File: space look the same to it, even though most of them aren't related to SDC data and thus are useless for SDC Query engine to fetch and process. It would be nice to have some mechanism that allows only fetch SDC-related changes. Unfortunately, right now the only thing that distinguishes them is comments like /* wbsetlabel-add:1|en */ but these would be expensive to parse and impossible to pre-filter on API level.

Maybe if we could tag such changes or select changes by slot, it would solve the issue.

Event Timeline

Restricted Application added a project: Core Platform Team. · View Herald TranscriptWed, Aug 21, 6:38 AM

This is a special case of being able to filter by MCR slot type, which we'll surely need in the long term.

daniel added a subscriber: daniel.

Maybe if we could tag such changes or select changes by slot, it would solve the issue.

The problem with selecting changes by is that changes can affect multiple slots. Not at the moment with SDC, but in principle. I don't see any efficient way to apply this kind of filtering with the way we currently store recentchanges.

Using change tags for all MediaInfo edits might be a solution. Though it may overload the ChangeTags system. Perhaps the recentchanges system should be migrated to Elastic, that would help with a lot of things :)

I'm untagging MCR and CPT for now. Once the Multimedia team figures out what they want to do, they can ping us or the wikidata team to discuss the technical solution.

Smalyshev added a comment.EditedWed, Aug 21, 3:56 PM

Tags should work, at least for now, I think, if I can filter by tag efficiently. There's not a lot of data edits so far, compared to overall Commons edit volume.

Eventually, I'd like to be able to select all edits that touch particular slot (even if it also touches other slots) but from what I see in recentchanges table I am not sure it's possible without changing the table structure. Please correct me if I'm wrong and there's a way.

I also think while short-term it's SDC General issue, longer term it is general Multi-Content-Revisions issue since tools would want to know which edit changed which slot.

Tgr added a comment.Wed, Aug 21, 5:15 PM

In theory it would be something like recentchanges JOIN revision ON rc_this_oldid = rev_id JOIN slots ON slot_revision_id = rev_id AND slot_role_id = whatever AND slot_origin = rev_id. Except slot_origin is not used consistently that way; this would miss rollbacks, for example.

Anomie added a subscriber: Anomie.Tue, Aug 27, 4:25 PM

Perhaps the recentchanges system should be migrated to Elastic, that would help with a lot of things :)

OTOH, that would break it for any MediaWiki wikis not using Elastic...

In theory it would be something like recentchanges JOIN revision ON rc_this_oldid = rev_id JOIN slots ON slot_revision_id = rev_id AND slot_role_id = whatever AND slot_origin = rev_id. Except slot_origin is not used consistently that way; this would miss rollbacks, for example.

We'd also have to watch whether that query runs efficiently enough.

BTW, I think you could simplify that: recentchanges JOIN slots ON slot_revision_id = rc_this_oldid AND slot_role_id = whatever AND slot_origin = rc_this_oldid

Tgr added a comment.Tue, Aug 27, 7:37 PM

OTOH, that would break it for any MediaWiki wikis not using Elastic...

IMO we'll have to bite that bullet at some point and change MediaWiki from a PHP-based application to a container-based one, so we can package ElasticSearch, Node.js, BlazeGraph, Lua etc. with it. (I'm hoping to start some conversation about that at TechConf which seems thematically well-aligned.) And yeah, recentchanges is fundamentally poorly suited to being in SQL.

RecentChanges has many flaws (for example, it is not a reliable stream as timestamps are not sequential and it can't be queried by RC ID - see https://gerrit.wikimedia.org/r/c/mediawiki/core/+/302368) but it is the only way to get change stream for a wiki without setting up Kafka, etc. as I understand. So I imagine until we get containers with all that stuff working we're stuck with RC as the only option to get changes in public.

IMO we'll have to bite that bullet at some point and change MediaWiki from a PHP-based application to a container-based one, so we can package ElasticSearch, Node.js, BlazeGraph, Lua etc. with it. (I'm hoping to start some conversation about that at TechConf which seems thematically well-aligned.) And yeah, recentchanges is fundamentally poorly suited to being in SQL.

That has been proposed and opposed by various people for a long time. Being a container-based application limits you to whatever the container-maker includes, and limits you to environments where you can use a container in the first place. I believe one reason MediaWiki is so popular is because it's easy to get started on any generic LAMP stack (and several variations as well).

Tgr added a comment.Tue, Aug 27, 9:14 PM

That has been proposed and opposed by various people for a long time.

I think it's worth revisiting, and there was unanimous agreement on that at last TechConf's relevant session. It's probably a conversation better left to another task, though (my bad for derailing) - for now, the filter has to deal with SQL as @Smalyshev says.

I think it's worth revisiting, and there was unanimous agreement on that at last TechConf's relevant session.

When I look at the minutes at that link I see very little discussion of containers at all, and mostly in the context of large farms rather than individual installs.

Smalyshev triaged this task as High priority.Tue, Aug 27, 9:39 PM
Cparle added a subscriber: Cparle.Wed, Sep 4, 4:33 PM

We're no longer the Multimedia team (we're "Structured Data" now 😺 )