Page MenuHomePhabricator

page-links-change stream is assigning template propagation events to the wrong edits
Open, HighPublic

Description

The EventStream appears to be firing events related to template propagation and assigning them to the last editor to make an edit to the page, implying that an editor made an edit which added or removed links when all that happened was a template update propagated to a page.

One such example is below:

event: message
id: [{"topic":"eqiad.mediawiki.page-links-change","partition":0,"timestamp":1550573806001},{"offset":-1,"partition":0,"topic":"codfw.mediawiki.page-links-change"}]
data: {"added_links":[{"external":false,"link":"/wiki/Edge_(video_game)"},{"external":false,"link":"/wiki/Talk:Edge_(video_game)/GA1"}],"database":"enwiki","meta":{"domain":"en.wikipedia.org","dt":"2019-02-19T10:56:46+00:00","id":"0c8d2b30-3435-11e9-b0e4-1866da993d2e","request_id":"XGl-pQpAAEIAAA4B8rcAAACJ","schema_uri":"mediawiki/page/links-change/1","topic":"eqiad.mediawiki.page-links-change","uri":"https://en.wikipedia.org/wiki/Talk:Tampon_Run","partition":0,"offset":5705026},"page_id":45403409,"page_is_redirect":false,"page_namespace":1,"page_title":"Talk:Tampon_Run","performer":{"user_edit_count":20912,"user_groups":["abusefilter","sysop","*","user","autoconfirmed"],"user_id":15991542,"user_is_bot":false,"user_registration_dt":"2011-12-29T02:44:39Z","user_text":"Samwalton9"},"removed_links":[{"external":false,"link":"/wiki/Amy_Rose"},{"external":false,"link":"/wiki/Deus_Ex:_Mankind_Divided"},{"external":false,"link":"/wiki/List_of_Sonic_the_Hedgehog_characters"},{"external":false,"link":"/wiki/Talk:Deus_Ex:_Mankind_Divided/GA1"},{"external":false,"link":"/wiki/Talk:List_of_Sonic_the_Hedgehog_characters"}],"rev_id":648195620}

This event relates to a page I created (https://en.wikipedia.org/wiki/Talk:Tampon_Run), but it has received no edits since 2015. The rev_id relates to the last edit I made there (https://en.wikipedia.org/w/index.php?diff=648195620). There are no RecentChangesLinked for that timestamp, and a cursory glance through RecentChanges didn't find anything around that timestamp that would correspond to the edit. I take it, therefore, that this has something to do with the cache of the WikiProject:Video games template on that page. Indeed, some of the data in that event corresponds to a recent edit to Template:WPVG announcements, which is transcluded in the WP:VG banner (https://en.wikipedia.org/w/index.php?title=Template%3AWPVG_announcements&type=revision&diff=883776243&oldid=882982540).

See my comment below for a more concrete example of a template being updated and then propagating through the encyclopedia.

Event Timeline

Samwalton9 triaged this task as Normal priority.
Samwalton9 updated the task description. (Show Details)Feb 19 2019, 5:23 PM
Samwalton9 updated the task description. (Show Details)Feb 21 2019, 3:58 PM
awight removed a subscriber: awight.Mar 21 2019, 4:06 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMar 21 2019, 4:06 PM
bmansurov removed bmansurov as the assignee of this task.Apr 9 2019, 3:01 PM

No bandwidth to work on this task.

Samwalton9 raised the priority of this task from Normal to High.EditedFri, Jun 7, 12:13 PM

As I'm working with the event stream more, this is becoming a substantial issue for me. This template changed today (https://en.wikipedia.org/w/index.php?title=Template:Cite_ODNB&action=history), and the stream fired an event for every single page it propagates to, claiming that the cache updates were real edits - but it provides the metadata for the previous edit to the page.

The stream definitely shouldn't be linking template propagation to the page's previous edit - this is a clear bug.

The less clear issue is the stream's behaviour with templates more generally. If someone adds a link to a template and that link then propagates to 100 pages, do we really want 100 events? I don't think so, but that information might be useful to someone. Would it be possible to flag such events, to distinguish them from 'genuine' link changes? Perhaps the best solution is simply not to include the performer data for a template propagation so these edits can easily be filtered.

Restricted Application added a project: Analytics. · View Herald TranscriptFri, Jun 7, 1:20 PM

(my bad, I read the task too quickly)

I expect you to provide us all with cupcakes at Wikimania as an apology. ;-)

Samwalton9 renamed this task from page-links-change stream is firing events related to transcluded template caches to page-links-change stream is assigning template propagation events to the wrong edits.Tue, Jun 11, 9:19 AM
Samwalton9 updated the task description. (Show Details)

Sam emailed me offlist, and we had a small discussion. Here are some excerpts

Otto: I think the link change is detected per edit, not from a template in any way. The event itself is likely firing when expected. What if someone wanted to know when the links for a particular page or set of pages changed? They wouldn't necessarily know about a template, but do know which pages they want.

I think maybe we just need some extra information in the event about what caused the change, template or editor or otherwise.

Sam: If there's a simple way to flag events as being edits vs template propagation that would at least mean I could simply filter these edits out (I don't need them for my use of the stream). Do you have an idea of how complicated that would be to do?

I don't know how complicated it is, but I agree that it would be good information to have and hopefully solve this problem!

Pchelolo added a subscriber: Pchelolo.

So, the event is based on LinksUpdate hook. the LinksUpdate has getTriggeredRevision method which we use to assign a revision ID to the page. If it's null - we use the latest revId of the page. I believe that (logically, needs to be verified) the LinksUpdate object triggered by a template propagation will have null triggering rev id and thus we'd be able to distinguish the events.

Next steps:

  1. Try out what triggering rev id is returned in case of template propagation
  2. Adjust the schema to make rev id optional.
  3. Adjust the code.

Not quite sure who should do it. I can have a look, but I'm quite occupied with other things right now.

So, the event is based on LinksUpdate hook. the LinksUpdate has getTriggeredRevision method which we use to assign a revision ID to the page. If it's null - we use the latest revId of the page. I believe that (logically, needs to be verified) the LinksUpdate object triggered by a template propagation will have null triggering rev id and thus we'd be able to distinguish the events.

The event in the header is a template propagation event, and still appears to have a rev id (of the previous edit to the page).

Next steps:

  1. Try out what triggering rev id is returned in case of template propagation
  2. Adjust the schema to make rev id optional.
  3. Adjust the code.

    Not quite sure who should do it. I can have a look, but I'm quite occupied with other things right now.

Thanks for the insight!

If it's null - we use the latest revId of the page.

Will this be the rev_id of the page after the links update edit is applied, or the parent revision from before? The code looks like it is attempting to use the rev_id after the links are updated, which makes sense to me. Dunno how MW works here, but it is it possible $title->getLatestRevID() is returning the parent revision because the new revision hasn't been propagated yet (e.g. from master to slave)?

I'm not sure what the rev_id field is meant to represent in the page-links-change event, but likely since we have rev_id in so many other events, it should mean the same thing: the revision that the event represents, e.g. the revision that had links change.

LinksUpdate has getTriggeredRevision

Hm, I see getTriggeredUser and getRevision (which we use), but not getTriggeredRevision. Perhaps we should have a new field on the event: triggering_rev_id or something, which has a different meaning than rev_id? So we'd have:

$triggering_rev_id = $linksUpdate->getRevision()->getId();
$rev_id = $title->getLatestRevID();  // Or whatever get's the actual revision with the links changes.

So, I've done some experiments here.

When a template is updated and a link is added to it, the LinksUpdate hook is actually executed multiple times - for the template itself, with a latest just created revision of the template, and for the pages where the template is transcluded, with the revision equal to the latest revision of the page.

The documentation states in the schema:

rev_id:
  description: The head revision of the page which links has been changed.
  type: integer
  minimum: 0

Thus the code is actually working correctly.

The question of whether to emit the events for template-based links additions/deletions is a separate one. I would think 'why not'?

Being able to filter whether it's a real edit or not is another option, we can utilize $linksUpdate->getCauseAction() for that - just add a cause_action string into the schema. Would that help? The issue here would be that the cause action is page-edit for both situations here, but we can make Mediawiki differentiate between the two.

The question of whether to emit the events for template-based links additions/deletions is a separate one. I would think 'why not'?

I think the answer here is that we have no choice, because the template may add/remove links from the transcluded pages.

Being able to filter whether it's a real edit or not is another option, we can utilize $linksUpdate->getCauseAction() for that - just add a cause_action string into the schema. Would that help? The issue here would be that the cause action is page-edit for both situations here, but we can make Mediawiki differentiate between the two.

I'd be +1 on adding an action of the sorts of template-edit.

I am looking at this from a spam-detection point-of-view. The way I see
this, this may result in records on my name because I add a spamlink
because a spammer added a link to a template. That would disable a lot of
statistical spam-detection mechanisms (and, e.g. mechanisms like xLinkBot).

Although the edit is triggering the template update, is it possibe to have
the event without username (because it is because of the db, not because of
the editor initiating the edit) if it is due to a template update, and with
username if the edit added the template?

Dirk

Flagging events as being a result of template propagation seems like a sensible solution to me!