Page MenuHomePhabricator

EventStreams sending same data over and over (page links change)
Open, LowPublicBUG REPORT

Description

Concerning this API endpoint:
https://stream.wikimedia.org/?doc#/streams/get_v2_stream_mediawiki_page_links_change

An example JSON record given below, as sent by the stream. Notice when checking the diff https://arz.wikipedia.org/w/index.php?diff=5641431 it does not match the links in the JSON. This JSON record is sent repeatedly by the stream, maybe 10 days in a month and 1 to 7 times per day. The diff number and article title changes, but the set of links in the JSON stay the same repeating.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
{"$schema":"/mediawiki/page/links-change/1.0.0","meta":{"uri":"https://arz.wikipedia.org/wiki/%D8%B1%D9%88%D8%AF%D9%8A%D9%88%D9%85","request_id":"bc403e26b8b72080c369aa66","id":"26739413-d570-4363-af37-af690a94f501","dt":"2021-09-01T23:30:50Z","domain":"arz.wikipedia.org","stream":"mediawiki.page-links-change","topic":"codfw.mediawiki.page-links-change","partition":0,"offset":203083041},"database":"arzwiki","page_id":1389768,"page_title":"روديوم","page_namespace":0,"page_is_redirect":false,"rev_id":5641431,"performer":{"user_text":"InternetArchiveBot","user_groups":["bot","*","user","autoconfirmed"],"user_is_bot":true,"user_id":142851,"user_registration_dt":"2020-12-18T16:05:11Z","user_edit_count":20253},"added_links":[{"link":"/wiki/%25D8%25AA%25D8%25B5%25D9%2586%25D9%258A%25D9%2581:%25D9%2585%25D9%2582%25D8%25A7%25D9%2584%25D8%25A7%25D8%25AA_%25D9%2581%25D9%258A%25D9%2587%25D8%25A7_%25D9%2585%25D8%25B9%25D8%25B1%25D9%2581%25D8%25A7%25D8%25AA_BNF","external":false},{"link":"/wiki/%25D8%25AA%25D8%25B5%25D9%2586%25D9%258A%25D9%2581:%25D9%2585%25D9%2582%25D8%25A7%25D9%2584%25D8%25A7%25D8%25AA_%25D9%2581%25D9%258A%25D9%2587%25D8%25A7_%25D9%2585%25D8%25B9%25D8%25B1%25D9%2581%25D8%25A7%25D8%25AA_GND","external":false},{"link":"/wiki/%25D8%25AA%25D8%25B5%25D9%2586%25D9%258A%25D9%2581:%25D9%2585%25D9%2582%25D8%25A7%25D9%2584%25D8%25A7%25D8%25AA_%25D9%2581%25D9%258A%25D9%2587%25D8%25A7_%25D9%2585%25D8%25B9%25D8%25B1%25D9%2581%25D8%25A7%25D8%25AA_LCCN","external":false},{"link":"/wiki/%25D8%25AA%25D8%25B5%25D9%2586%25D9%258A%25D9%2581:%25D9%2585%25D9%2582%25D8%25A7%25D9%2584%25D8%25A7%25D8%25AA_%25D9%2581%25D9%258A%25D9%2587%25D8%25A7_%25D9%2585%25D8%25B9%25D8%25B1%25D9%2581%25D8%25A7%25D8%25AA_LNB","external":false},{"link":"/wiki/%25D8%25AA%25D8%25B5%25D9%2586%25D9%258A%25D9%2581:%25D9%2585%25D9%2582%25D8%25A7%25D9%2584%25D8%25A7%25D8%25AA_%25D9%2581%25D9%258A%25D9%2587%25D8%25A7_%25D9%2585%25D8%25B9%25D8%25B1%25D9%2581%25D8%25A7%25D8%25AA_NDL","external":false},{"link":"/wiki/%25D8%25AA%25D8%25B5%25D9%2586%25D9%258A%25D9%2581:CS1_maint:_uses_authors_parameter","external":false},{"link":"/wiki/International_Standard_Book_Number","external":false},{"link":"/wiki/National_Library_of_Latvia","external":false},{"link":"/wiki/Oxford_University_Press","external":false},{"link":"/wiki/%25D9%2585%25D9%2583%25D8%25AA%25D8%25A8%25D8%25A9_%25D8%25A7%25D9%2584%25D9%258A%25D8%25A7%25D8%25A8%25D8%25A7%25D9%2586_%25D8%25A7%25D9%2584%25D9%2588%25D8%25B7%25D9%2586%25D9%258A%25D9%2587","external":false},{"link":"/wiki/%25D9%2585%25D9%2583%25D8%25AA%25D8%25A8%25D8%25A9_%25D9%2581%25D8%25B1%25D9%2586%25D8%25B3%25D8%25A7_%25D8%25A7%25D9%2584%25D9%2588%25D8%25B7%25D9%2586%25D9%258A%25D9%2587","external":false},{"link":"/wiki/%25D9%2585%25D9%2584%25D9%2581_%25D8%25A7%25D8%25B3%25D8%25AA%25D9%2586%25D8%25A7%25D8%25AF%25D9%2589_%25D9%2585%25D8%25AA%25D9%2583%25D8%25A7%25D9%2585%25D9%2584","external":false},{"link":"/wiki/%25D9%2586%25D9%2585%25D8%25B1%25D8%25A9_%25D8%25AA%25D8%25AD%25D9%2583%25D9%2585_%25D9%2585%25D9%2583%25D8%25AA%25D8%25A8%25D8%25A9_%25D8%25A7%25D9%2584%25D9%2583%25D9%2588%25D9%2586%25D8%25AC%25D8%25B1%25D8%25B3","external":false},{"link":"/wiki/Hamish_Hamilton_Ltd","external":false},{"link":"/wiki/%25D9%2585%25D8%25B9%25D8%25B1%25D9%2581_%25D8%25A7%25D9%2584%25D8%25BA%25D8%25B1%25D8%25B6_%25D8%25A7%25D9%2584%25D8%25B1%25D9%2582%25D9%2585%25D9%2589","external":false},{"link":"/wiki/%25D9%2585%25D8%25B3%25D8%25A7%25D8%25B9%25D8%25AF%25D8%25A9:CS1_errors","external":false},{"link":"https://www.wikidata.org/wiki/Q1087","external":true},{"link":"https://commons.wikimedia.org/wiki/Category:Rhodium","external":true},{"link":"https://www.quora.com/topic/Rhodium-1","external":true},{"link":"https://www.google.com/search%3Fkgmid%3D/m/025scm0","external":true},{"link":"https://catalogue.bnf.fr/ark:/12148/cb12218903f","external":true},{"link":"https://academic.microsoft.com/v2/detail/521398313","external":true},{"link":"https://academic.microsoft.com/v2/detail/2910290644","external":true},{"link":"https://id.loc.gov/authorities/sh85113755","external":true},{"link":"https://kopkatalogs.lv/F/%3Ffunc%3Ddirect%26local_base%3Dlnc10%26doc_number%3D000307942","external":true},{"link":"https://d-nb.info/gnd/4178038-3","external":true},{"link":"https://archive.org/details/naturesbuildingb0000emsl","external":true},{"link":"https://archive.org/details/elementsvisualex0000gray","external":true},{"link":"https://archive.org/details/periodictableits0000scer","external":true},{"link":"//doi.org/10.1351%252Fgoldbook","external":true},{"link":"//doi.org/10.1351%252Fgoldbook","external":true},{"link":"https://data.bnf.fr/ark:/12148/cb12218903f","external":true},{"link":"https://id.loc.gov/authorities/subjects/sh85113755","external":true},{"link":"https://kopkatalogs.lv/F%3Ffunc%3Ddirect%26local_base%3Dlnc10%26doc_number%3D000307942%26P_CON_LNG%3DENG","external":true},{"link":"https://id.ndl.go.jp/auth/ndlna/00569786","external":true}]}

No idea. Feel free to adjust for the right audience I wasn't sure.

Hmmm very strange! @Pchelolo? Sounds like something is strange with the MW hook.

Some observations off the top of my head:

  • If a link update (more specifically, a RefreshLinksJob) fails, it will be re-scheduled. That would cause the event to be re-sent (but for the same page and revision ID)
  • when a page gets re-parsed because a template was updated (e.g. by adding links to it), that will trigger an event with links updates that has nothing to do with the current revision ID of that page. Attributing the links update to the edit identified by rev_id is very often wrong.
  • Adding links to a template will cause the same links to show up as added to all pages that use that template.
  • The links go to external "authority files" (BNF, GND, LCCN, LNB are all classification systems used by libraries). Templates for external identifiers are often fed from Wikidata, so an edit on Wikidata would cause the page to be re-parsed and a links-update event to be fired. However, these identifiers are usually specific to a single page, so seeing the same update for multiple pages is surprising. Is it really the *exact* same, or does it just look kind of the same, because it's the same set of external identifiers, and most of them stay the same, and just one of them was updated?

Removing inactive assignee (Platform Engineering: Please unassign tasks of previous team members.)

My workaround is compare the date of the diff (via MW API) with the date in the JSON and if they are too far apart assume the JSON is buggy data, ignore and log it. There is a massive log, now.

Ottomata lowered the priority of this task from High to Low.Jul 22 2025, 12:40 PM