Page MenuHomePhabricator

🐼πŸͺ Build prototype to compare the diff (before and after reparse) (πŸ’Œ)
Open, Needs TriagePublicSpike

Description

Aim:
Explore the feasibility of comparing the html of the wikipedia page before and after reparsing with the wikidata change

The prototype will compare the html from the cache (if available) and check if it changed, and insert the RC log accordingly (if diff is 'no change', suppress recent change log. If there is no cache to compare with, send the RC change log anyway.)

Tradeoffs for the approach:

  • Around a third of pages have no cache for comparison. Therefore things would remain in the current state, of there being β€˜some’ false positives. So this would only be able to solve the problem by a maximum of 66%. If the page is not in the cache, we can’t suppress the notification

Screenshot from 2026-04-09 16-43-40.png (1,681Γ—1,017 px, 167 KB)

  • False positives due to race conditions

Notes for reviewers:

Things to consider:

  • Does this seem likely to adversely affect e.g. page loading in production?
  • There are known code quality issues on the tickets & it would conflict with master, no worries about these things for now - this is just a prototype for proof of concept.

The prototype commits:

Architecture:

  • We created a new hook (includes/Hook/RefreshLinksJobBeforeInsertRecentChangeHook.php) and handled it on Wikibase (client/includes/Hooks/RefreshLinksJobBeforeInsertRecentChangeHandler.php).
  • We reused some already existing diff and comparison with the parser cache logic from (includes/JobQueue/Jobs/RefreshLinksJob.php)
  • The hook will be fired only when there is a hit on ParserCache and the html changed after the edit to Wikibase.
  • Then, the hook will be handled on Wikibase by injecting related change to the RC table.
  • We removed previous RC table injection code from (client/includes/Changes/ChangeHandler.php) class and put it to hook handler after diff check is done.

Local testing
The details for local testing are in the comments.

Ticket acceptance criteria:

  • Issue - recent changes not being written to the database (and/or mention it as an issue for the reviewers)
  • Check it's working locally - with no change, seeing that the recent change doesn't get inserted. Ideally, make a video showing it working
  • There is a one-line that was added (some other ticket from the updated master) - we need to 'undo' it and leave a comment that it would need to be reconciled with our version
  • Have a list of 'still todos' for the reviewers (including the ones above)

Event Timeline

Change #1261347 had a related patch set uploaded (by Neslihan Turan; author: Neslihan Turan):

[mediawiki/extensions/Wikibase@master] [WIP] Handle hook for def conditioned recent change logs and pass change to job to be inserted lated.

https://gerrit.wikimedia.org/r/1261347

Change #1260687 had a related patch set uploaded (by Neslihan Turan; author: Neslihan Turan):

[mediawiki/core@master] [WIP] add hook for def conditioned recent change logs.

https://gerrit.wikimedia.org/r/1260687

Prep local dev environment:
To continue with this test, you would need to have a two-host setup (Repository and Client). The easiest way to set this up is through docker-dev. If you already have your own environment, feel free to use it too.
Steps:
Pull from this repository:

https://gitlab.wikimedia.org/repos/wmde/docker-dev

into a working directory of your choice.
Clone mediawiki-core into the docker-dev directory. Also, clone the Vector and MinervaNeue skins.

git clone --depth=1 https://gerrit.wikimedia.org/r/mediawiki/core mediawiki
git clone https://gerrit.wikimedia.org/r/mediawiki/skins/Vector mediawiki/skins/
git clone https://gerrit.wikimedia.org/r/mediawiki/skins/MinervaNeue mediawiki/skins/

Running:
To start the containers with the two-host setup, run in the

docker-dev

directory. Depends on your used port, you might want to change the port number:

MW_DOCKER_PORT=8081 ./bin/use Wikibase-Client

Testing steps after local env is ready:
Here is the video demo of how it works, and below are all the steps done in the video.

  1. Pull changes for both Wikibase and core.
  2. Make sure ParserFunction extension is installed.
cd extensions/
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/ParserFunctions

Add the following code at the bottom of your LocalSettings.php file:

wfLoadExtension( 'ParserFunctions' );
  1. Create a Template page in your client wiki as
http://client.mediawiki.local.wmftest.net:8080/wiki/Template:IsExistTemplate

Content of the template page:

{{#if:{{#property:P2}}
| p2 exist
| p2 doesnt exist
}}
  1. Use your template in an article
{{IsExistTemplate}}
  1. Create an Wikibase item, make sure your article has site linked to your Wikibase item, and this item has P2 statement.
  1. Run all jobs in client and repo to have a fresh start.
  1. Make a null edit to your article page (click edit and save with no change), to make sure the page is in parser cache. (Otherwise, it is not gonna work and we think it is because ParserCache is empty, wdyt?)
  1. Edit the value of P2 property.
  1. Run jobs for both client and repo.
mw dev mediawiki mwscript runJobs -- --wiki dev
mw dev mediawiki mwscript runJobs -- --wiki client
  1. Go RC page on your client and make sure you have Wikibase changes filter on.
  1. You should not see a log in the RC table about your edit to P2 value. Because the effect of change on P2 has not caused any change in your article (it was P2 exists, and it still is P2 exists.)
  1. Again make a null edit to your article on the client.
  1. Delete P2 from your Wikibase item.
  1. Run jobs.
  1. Go RC page on your client and make sure you have Wikibase changes filter on.
  2. This time you should see a log about this change. As your edit on P2 has an effect on article page (it was P2 exists, now it is P2 doesn't exist)

Hi!

I had a quick look at this and got a bit lost at the setup stage; having a client and repo working isn't really in my normal day to day situation!

Some general setup struggles

I tried to get working client/repo setup with this docker-dev setup from WMDE as described above; it didn't work for me super smoothly but here are some notes I took for the next person following along:

Although there was an automatically created Q1 on dev linked to the Main Page on client I saw no RC entries (before applying your patch) even when I would expect some. I was also unable to add a sitelink to an item with the error (wikibase-api-no-external-page) coming back from the api.

I suspect you actually need to run ./bin/use Wikibase Wikibase-Client in order to get sites tables working correctly.

I also think setting a custom port as described above might break the wiring between sites. To anyone trying to also reproduce this I would recommend trying no custom port. I also noticed unusual behaviour trying to make items and properties where there first few attempts would always fail with a page already exists error.

I was treating this dev setup a little like a black box though and didn't really investigate further once I seemed to have RC change propagation working mostly as expected on master

Trying the prototype

I really struggled to get a reproducible setup with your protoype in a reasonable time but I clearly also saw a some unexpected behaviour: mostly "false negatives".

Actually I found after trying these patches (by checking out to exactly the attached commits and also moving backward to REL1_46 for Vector and Parserfunctions) that I now had 0 RC entries injected at all. Can easily be due to my incompetence though.

I saw errors in the output of

./modules/mediawiki/bin/mwscript runJobs.php --wiki client

like

0b9552f7a (id=42,timestamp=20260511171216) t=111error=UnexpectedValueException: Job parameter 0 is not JSON serializable. in /srv/docker-dev/mediawiki/includes/JobQueue/JobSpecification.php on line 88

I was trying to create a clear reproduction of an expected race condition I had in mind (actually I can't claim most of the credit - @Ladsgroup planted the idea) but given the above I wasn't able to clearly demonstrate it.

My worry is that the design for this includes false negatives (in addition to the false positive situation you describe in the task description) in the situation where the Parsercache is repopulated before this job has a chance to run.

Did you consider how you will mitigate this kind of race or are you convinced it can never happen? My assumption is that this can happen and that it would be quite a problem for users because now they wouldn't be certain to get RC entries when they are supposed to.

General additional thoughts

As I think @Nicholusmuwonge_wmde clearly identified in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1260687/11/includes/Hook/RefreshLinksJobBeforeInsertRecentChangeHook.php#21 you don't want to let any Wikibase leak into core so I think you want to be careful about how you design this hook so it's not coupled to Wikibase

I'm also not sure of how you're hoping to implement this if you were to "do it for real" but I would expect that changing the signature of methods of RefreshLinksJob like runForTitle is going to impact other callers.

I wonder if you considered alternative ways to implement a similar idea of HTML diffing? e.g. did you consider doing it in the ChangeHandler in client?

I also wondered if you managed to get a feeling of the possible lower bound for the impact of this prototype (i.e. a possible lower bound for the red section of your diagram above)?

Lucyfediachambers renamed this task from πŸͺ Build prototype to compare the diff (before and after reparse) β„³ to πŸͺ Build prototype to compare the diff (before and after reparse).Tue, May 19, 8:12 AM

Find the reviews correspondence in this document.

Lucyfediachambers renamed this task from πŸͺ Build prototype to compare the diff (before and after reparse) to 🐼πŸͺ Build prototype to compare the diff (before and after reparse).Thu, Jun 4, 8:22 AM
Lucyfediachambers renamed this task from 🐼πŸͺ Build prototype to compare the diff (before and after reparse) to 🐼πŸͺ Build prototype to compare the diff (before and after reparse) (πŸ’Œ).Thu, Jun 4, 4:35 PM

The timebox for this ticket is over, so this ticket will be marked as done and will move on to the evaluation tickets. T419244