Page MenuHomePhabricator

page-links-change stream is assigning template propagation events to the wrong edits
Open, MediumPublic

Description

The EventStream appears to be firing events related to template propagation and assigning them to the last editor to make an edit to the page, implying that an editor made an edit which added or removed links when all that happened was a template update propagated to a page.

One such example is below:

event: message
id: [{"topic":"eqiad.mediawiki.page-links-change","partition":0,"timestamp":1550573806001},{"offset":-1,"partition":0,"topic":"codfw.mediawiki.page-links-change"}]
data: {"added_links":[{"external":false,"link":"/wiki/Edge_(video_game)"},{"external":false,"link":"/wiki/Talk:Edge_(video_game)/GA1"}],"database":"enwiki","meta":{"domain":"en.wikipedia.org","dt":"2019-02-19T10:56:46+00:00","id":"0c8d2b30-3435-11e9-b0e4-1866da993d2e","request_id":"XGl-pQpAAEIAAA4B8rcAAACJ","schema_uri":"mediawiki/page/links-change/1","topic":"eqiad.mediawiki.page-links-change","uri":"https://en.wikipedia.org/wiki/Talk:Tampon_Run","partition":0,"offset":5705026},"page_id":45403409,"page_is_redirect":false,"page_namespace":1,"page_title":"Talk:Tampon_Run","performer":{"user_edit_count":20912,"user_groups":["abusefilter","sysop","*","user","autoconfirmed"],"user_id":15991542,"user_is_bot":false,"user_registration_dt":"2011-12-29T02:44:39Z","user_text":"Samwalton9"},"removed_links":[{"external":false,"link":"/wiki/Amy_Rose"},{"external":false,"link":"/wiki/Deus_Ex:_Mankind_Divided"},{"external":false,"link":"/wiki/List_of_Sonic_the_Hedgehog_characters"},{"external":false,"link":"/wiki/Talk:Deus_Ex:_Mankind_Divided/GA1"},{"external":false,"link":"/wiki/Talk:List_of_Sonic_the_Hedgehog_characters"}],"rev_id":648195620}

This event relates to a page I created (https://en.wikipedia.org/wiki/Talk:Tampon_Run), but it has received no edits since 2015. The rev_id relates to the last edit I made there (https://en.wikipedia.org/w/index.php?diff=648195620). There are no RecentChangesLinked for that timestamp, and a cursory glance through RecentChanges didn't find anything around that timestamp that would correspond to the edit. I take it, therefore, that this has something to do with the cache of the WikiProject:Video games template on that page. Indeed, some of the data in that event corresponds to a recent edit to Template:WPVG announcements, which is transcluded in the WP:VG banner (https://en.wikipedia.org/w/index.php?title=Template%3AWPVG_announcements&type=revision&diff=883776243&oldid=882982540).

See my comment below for a more concrete example of a template being updated and then propagating through the encyclopedia.

There should be some way to understand if an event relates to a 'real' change to the page, or if it is a result of template propagation.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

(my bad, I read the task too quickly)

I expect you to provide us all with cupcakes at Wikimania as an apology. ;-)

Samwalton9-WMF renamed this task from page-links-change stream is firing events related to transcluded template caches to page-links-change stream is assigning template propagation events to the wrong edits.Jun 11 2019, 9:19 AM
Samwalton9-WMF updated the task description. (Show Details)

Sam emailed me offlist, and we had a small discussion. Here are some excerpts

Otto: I think the link change is detected per edit, not from a template in any way. The event itself is likely firing when expected. What if someone wanted to know when the links for a particular page or set of pages changed? They wouldn't necessarily know about a template, but do know which pages they want.

I think maybe we just need some extra information in the event about what caused the change, template or editor or otherwise.

Sam: If there's a simple way to flag events as being edits vs template propagation that would at least mean I could simply filter these edits out (I don't need them for my use of the stream). Do you have an idea of how complicated that would be to do?

I don't know how complicated it is, but I agree that it would be good information to have and hopefully solve this problem!

Pchelolo added a subscriber: Pchelolo.

So, the event is based on LinksUpdate hook. the LinksUpdate has getTriggeredRevision method which we use to assign a revision ID to the page. If it's null - we use the latest revId of the page. I believe that (logically, needs to be verified) the LinksUpdate object triggered by a template propagation will have null triggering rev id and thus we'd be able to distinguish the events.

Next steps:

  1. Try out what triggering rev id is returned in case of template propagation
  2. Adjust the schema to make rev id optional.
  3. Adjust the code.

Not quite sure who should do it. I can have a look, but I'm quite occupied with other things right now.

So, the event is based on LinksUpdate hook. the LinksUpdate has getTriggeredRevision method which we use to assign a revision ID to the page. If it's null - we use the latest revId of the page. I believe that (logically, needs to be verified) the LinksUpdate object triggered by a template propagation will have null triggering rev id and thus we'd be able to distinguish the events.

The event in the header is a template propagation event, and still appears to have a rev id (of the previous edit to the page).

Next steps:

  1. Try out what triggering rev id is returned in case of template propagation
  2. Adjust the schema to make rev id optional.
  3. Adjust the code.

Not quite sure who should do it. I can have a look, but I'm quite occupied with other things right now.

Thanks for the insight!

If it's null - we use the latest revId of the page.

Will this be the rev_id of the page after the links update edit is applied, or the parent revision from before? The code looks like it is attempting to use the rev_id after the links are updated, which makes sense to me. Dunno how MW works here, but it is it possible $title->getLatestRevID() is returning the parent revision because the new revision hasn't been propagated yet (e.g. from master to slave)?

I'm not sure what the rev_id field is meant to represent in the page-links-change event, but likely since we have rev_id in so many other events, it should mean the same thing: the revision that the event represents, e.g. the revision that had links change.

LinksUpdate has getTriggeredRevision

Hm, I see getTriggeredUser and getRevision (which we use), but not getTriggeredRevision. Perhaps we should have a new field on the event: triggering_rev_id or something, which has a different meaning than rev_id? So we'd have:

$triggering_rev_id = $linksUpdate->getRevision()->getId();
$rev_id = $title->getLatestRevID();  // Or whatever get's the actual revision with the links changes.

So, I've done some experiments here.

When a template is updated and a link is added to it, the LinksUpdate hook is actually executed multiple times - for the template itself, with a latest just created revision of the template, and for the pages where the template is transcluded, with the revision equal to the latest revision of the page.

The documentation states in the schema:

rev_id:
  description: The head revision of the page which links has been changed.
  type: integer
  minimum: 0

Thus the code is actually working correctly.

The question of whether to emit the events for template-based links additions/deletions is a separate one. I would think 'why not'?

Being able to filter whether it's a real edit or not is another option, we can utilize $linksUpdate->getCauseAction() for that - just add a cause_action string into the schema. Would that help? The issue here would be that the cause action is page-edit for both situations here, but we can make Mediawiki differentiate between the two.

The question of whether to emit the events for template-based links additions/deletions is a separate one. I would think 'why not'?

I think the answer here is that we have no choice, because the template may add/remove links from the transcluded pages.

Being able to filter whether it's a real edit or not is another option, we can utilize $linksUpdate->getCauseAction() for that - just add a cause_action string into the schema. Would that help? The issue here would be that the cause action is page-edit for both situations here, but we can make Mediawiki differentiate between the two.

I'd be +1 on adding an action of the sorts of template-edit.

I am looking at this from a spam-detection point-of-view. The way I see
this, this may result in records on my name because I add a spamlink
because a spammer added a link to a template. That would disable a lot of
statistical spam-detection mechanisms (and, e.g. mechanisms like xLinkBot).

Although the edit is triggering the template update, is it possibe to have
the event without username (because it is because of the db, not because of
the editor initiating the edit) if it is due to a template update, and with
username if the edit added the template?

Dirk

Flagging events as being a result of template propagation seems like a sensible solution to me!

@Ottomata @Pchelolo Is there any chance someone might have time to implement even the template-edit part of this? This is really the only identified major bug we have with this stream right now and it would be great if it could be fixed; unfortunately the person originally responsible for its implementation has now left the Foundation.

Pchelolo lowered the priority of this task from High to Medium.Jul 26 2019, 2:11 PM

Change 525835 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add cause_action to page/properties-change

https://gerrit.wikimedia.org/r/525835

I'll assign this to me, but I can't commit to a timeline! I'll try to do it within the next few months, please feel free to bother me!

Change 525835 abandoned by Ottomata:
Add cause_action to page/properties-change

https://gerrit.wikimedia.org/r/525835

Change 527122 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add cause_action to page/links-change and bump to version 1.1.0

https://gerrit.wikimedia.org/r/527122

The issue here would be that the cause action is page-edit for both situations here, but we can make Mediawiki differentiate between the two.

Looking at some refreshLinks job event data, I think this is actually 'edit-page'.

I can add a cause_action field to the page-links-change event, but I'm don't know how to make Mediawiki set something different in the case of a template edit. @Pchelolo can you help here?

Also, should we go ahead and include cause_agent in the event too? Is this different than what is returned by getTriggeringUser(), which is used as the event's performer field?

Change 528925 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventBus@master] Add cause_action to page-links-change event

https://gerrit.wikimedia.org/r/528925

One more: should we add cause_action to page-properties-change too?

Thanks for starting to work on this @Ottomata! Are you waiting on feedback from @Pchelolo?

Change 534526 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Add cause_action to page/links-change and bump to version 1.1.0

https://gerrit.wikimedia.org/r/534526

Change 527122 merged by Ppchelko:
[mediawiki/event-schemas@master] Add mediawiki/common and mediawiki/page/common

https://gerrit.wikimedia.org/r/527122

hey @Samwalton9, for the MW side changes, we'll add this into the schedule for our next Clinic Duty sprint which should start on Oct 30th. I'll try to get the suggested changes sanity checked before then. Is that timeline ok for you?

I was asked to look at this, but it's not clear to me what the current situation is. It's not even 100% clear to me what the actual ask is, as there are a lot of edge cases in this problem-space.

The update to pagelinks can be triggered by an edit to the page, or an edit to a template used by the page, or a change to a file used in the page, or a purge/reparse of the page.[1] Possibly other things too.

If it's an edit to the template, it could be due to a link changed in the template itself, or it could be due to a change in the way the template uses the passed parameters (with the link being derived from those parameters in some way), or it could be one of the other triggers and it just so happened that the template-edit-triggered recursive reparse of the page went through before the actual cause's reparse happened. For that matter, the last applies to a direct edit of the page too: it might not have been the edit itself, just that the associated reparse picked up some unrelated change too.[2]

[1]: Examples of changes to pagelinks for a purge/reparse not resulting from edits include links that are conditional the current date/time or on #ifexists. There are many other possibilities.
[2]: Editors often make "null edits" or "dummy edits" to specifically trigger this when the unrelated change's update seems to have been lost or overly delayed.

So, the event is based on LinksUpdate hook. the LinksUpdate has getTriggeredRevision method which we use to assign a revision ID to the page. If it's null - we use the latest revId of the page. I believe that (logically, needs to be verified) the LinksUpdate object triggered by a template propagation will have null triggering rev id and thus we'd be able to distinguish the events.

I think this was referring to LinksUpdate::getRevision(), I don't see that there has ever been a getTriggeredRevision method.

Looking through the code, it does appear true that the template edit itself will return a non-null value while the recursive reparses will have a null value.

From this and other comments here it doesn't sound like the state of LinksUpdate::getRevision() is reflected in the stream, though; it's used if present, but if not the stream (incorrectly, for the use case here) fills in the latest revision of the page instead.

Being able to filter whether it's a real edit or not is another option, we can utilize $linksUpdate->getCauseAction() for that - just add a cause_action string into the schema. Would that help? The issue here would be that the cause action is page-edit for both situations here, but we can make Mediawiki differentiate between the two.

I'd be +1 on adding an action of the sorts of template-edit.

Since other things can cause the recursive reparse, template-edit might be inaccurate. recursive-reparse might be more accurate, but might lose too much information. Or we'd need page-edit-recursive, api-purge-recursive, and so on (and be sure to avoid producing "page-edit-recursive-recursive" or similar silliness).

Or we could change only page-edit to template-edit, leaving all other causes unchanged when doing recursive reparses. Although note that it's not necessarily a Template-namespace page being edited that might be triggering the recursive reparse; enwiki articles, for example, sometimes use transclusion or Scribunto's title:getContent() to pull in data from other articles.

Also, should we go ahead and include cause_agent in the event too? Is this different than what is returned by getTriggeringUser(), which is used as the event's performer field?

They're assigned in different places, although at some quick search I can't see any code paths that specifically set them to different values.

hey @Samwalton9, for the MW side changes, we'll add this into the schedule for our next Clinic Duty sprint which should start on Oct 30th. I'll try to get the suggested changes sanity checked before then. Is that timeline ok for you?

Overall I see two options proposed here:

  1. Use LinksUpdate::getRevision() to determine whether the page-edit is a direct edit or is a recursive reparse.
  2. Adjust the getCauseAction() when doing the recursion.

Both seem feasible. As far as MediaWiki goes, #1 wouldn't need any changes to MediaWiki, while #2 would need a small change to LinksUpdate::queueRecursiveJobs(). For #2, we'd need to decide on the specific behavior as mentioned above.

I think the simplest change that satisfies Sam here would be the best choice. It'd be nice if MW could identify a real 'cause action', but it sounds like #1 is way simpler.

Setting rev_id to the latest revision of the page is likely still thing right thing to do, seeing a the description of that field is 'The head revision of the page at the time of this event'.

If we aren't going to modify cause_action, then I think we shouldn't use it. Instead, perhaps we can just add a new field varied based on the return LinksUpdate::getRevision()? Perhaps a boolean? direct_edit? Or if we can think of something more future proof that is not the same thing as cause_action, we could use some string field. @Pchelolo thoughts?

CCicalese_WMF added a subscriber: CCicalese_WMF.

Marking as Resolved as it is in the Done column. Feel free to reopen if there is remaining work.

Is this complete @Ottomata @Pchelolo? My understanding from the above was that discussion was still ongoing.

Change 534526 abandoned by Ottomata:
Add cause_action to page/links-change and bump to version 1.1.0

https://gerrit.wikimedia.org/r/534526

I think this is not done.

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/528925

From Petr:

The problem with this is that we first need to figure out if in core we can set these to the useful (for our purpose) values. Currently the cause-action will be page-edit regardless whether it was a transclusion or it was a real edit. So -1 until we figure that bit out.

Or actually, we were going to do

'Use LinksUpdate::getRevision() to determine whether the page-edit is a direct edit or is a recursive reparse.

which I don't think was done?

Moving out of Done into Next pending a response from @Pchelolo.

I just came across this event:
URL: //www.jstor.org/stable/23601441 (added)
Timestamp: 2020-08-15 06:29:56
Revision ID: 973073751 (https://en.wikipedia.org/w/index.php?title=Iota_Alpha_Pi&diff=973073751&oldid=973000267)

I think this is the same issue, but it's less clear to me - that URL had actually been added about a year prior so it's not clear to me why that edit in particular triggered an event. I don't know if the preceding or following edits to the page also caused an event - I only noticed that edit because this user is one we're tracking as part of TWL. Anyway, another example of an event I'd like to be able to filter out :)

The underlying issue here is that the LinksUpdate object simply doesn't have the information whether it was caused by a direct edit, or an indirect update. This information would have to be injected somehow, either in the constructor or via setCause(). I'd suggest to add an optional $hints array to either, which can be used to carry such information.

Hi - I am also having trouble. In one day, it falsely reported over 2,000 edits on ukwiki as made by InternetArchiveBot - not a small number of false positives.

Krinkle added a subscriber: Krinkle.

See also: T216492: Page-links-change stream doesn't capture duplicated links.

Instrumentation for the mediawiki.page-links-change event lives in the EventBus extension, not WikimediaEvents.