Page MenuHomePhabricator

Test result cache is seemingly not being invalidated properly
Open, In Progress, HighPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What should have happened instead?:
The test should be shown as passing consistently.

Event Timeline

I am also seeing this with tests that should be failing: Z28056 should be failing (see Z28051) but it’s passing according to the UI. Rerunning the tests does not help.

Suspected root cause: DB replication lag in production.

  • Primary DB gets new revision R2 and the wikilambda_ztester_results cache is cleared
  • But replica DB still shows old revision R1 AND still has the old cached result for R1
  • ApiPerformTest reads from the replica: Title::getLatestRevID() returns R1, then findZTesterResult finds the stale (R1, fail) row → false cache hit

Change #1269386 had a related patch set uploaded (by Daphne Smit; author: Daphne Smit):

[mediawiki/extensions/WikiLambda@master] ApiPerformTest: Use READ_LATEST to avoid stale test result cache hits

https://gerrit.wikimedia.org/r/1269386

I suspect that the problem is related to lag; that the results from CacheTesterResultsJob got stored from R1 after the run for R2 got started (R2 was made only two minutes after R1). Then the second CacheTesterResultsJob might have been killed as a duplicate, and so the stale values left behind? When ZObjectSecondaryDataUpdate ran on the R1 -> R2 edit, the test cache was already empty so the "7. If appropriate, clear wikilambda_ztester_results for this ZID" step was a no-op, and then the R1 job saved the results.

Jdforrester-WMF changed the task status from Open to In Progress.Apr 9 2026, 1:51 PM
Jdforrester-WMF assigned this task to DSmit-WMF.
DSmit-WMF changed the task status from In Progress to Open.Apr 14 2026, 8:41 AM
DSmit-WMF removed DSmit-WMF as the assignee of this task.

Change #1269386 abandoned by Daphne Smit:

[mediawiki/extensions/WikiLambda@master] ApiPerformTest: Use READ_LATEST to avoid stale test result cache hits

https://gerrit.wikimedia.org/r/1269386

Change #1270922 had a related patch set uploaded (by Daphne Smit; author: Daphne Smit):

[mediawiki/extensions/WikiLambda@master] Guard tester-result cache writes against stale revision tuples

https://gerrit.wikimedia.org/r/1270922

DSantamaria changed the task status from Open to In Progress.Apr 14 2026, 1:01 PM
DSantamaria assigned this task to DSmit-WMF.

Change #1270922 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] Guard tester-result cache writes against stale revision tuples

https://gerrit.wikimedia.org/r/1270922

Seems like this is still happening: Z33838 should be passing but is supposedly failing with/due to Z503, which should have changed more than ten minutes ago (and I have verified by manually testing).
Edit: I just realized that the test I mentioned in the task description continues to show the failure.

The patch written would only fix future cache writes, it doesn't fix existing cached results.
We should probably also write something for the existing cases.

The merged change (gerrit:1270922) guards insertZTesterResult against writing stale revision tuples going forward, but it doesn't clean up rows that were already stale when it deployed.

To address the existing cases, I've written a maintenance script purgeStaleZTesterResults.php that identifies rows in wikilambda_ztester_results where any of the three stored revision IDs (function, implementation, tester) no longer matches page_latest for that ZID, and deletes them. It supports --dryRun to preview the impact first.

Usage:

(docker compose exec mediawiki) php extensions/WikiLambda/maintenance/purgeStaleZTesterResults.php --dryRun
(docker compose exec mediawiki) php extensions/WikiLambda/maintenance/purgeStaleZTesterResults.php

This would log:
NOTE: this is an example output and might not resemble changes on production!

Row 2751:
  function       Z23781  rev 19071  ->  20852  [STALE]
  implementation Z23782  rev 19064  ->  20853  [STALE]
  tester         Z23783  rev 19031  ->  20854  [STALE]
Row 6152:
  function       Z30995  rev 33948  ->  34092  [STALE]
  implementation Z30996  rev 33947  ->  34093  [STALE]
  tester         Z30997  rev 33949  ->  34094  [STALE]
Row 6719:
  function       Z31632  rev 36245  ->  36845  [STALE]
  implementation Z31633  rev 36258  ->  36846  [STALE]
  tester         Z31634  rev 36243  ->  36847  [STALE]
Row 7411:
  function       Z33274  rev 40084  ->  42209  [STALE]
  implementation Z33275  rev 40094  ->  42210  [STALE]
  tester         Z33276  rev 40080  ->  42211  [STALE]
Row 7601:
  function       Z33277  rev 40979  ->  42212  [STALE]
  implementation Z33278  rev 41253  ->  42213  [STALE]
  tester         Z33283  rev 40970  ->  42218  [STALE]
Row 7602:
  function       Z33277  rev 40979  ->  42212  [STALE]
  implementation Z33278  rev 41253  ->  42213  [STALE]
  tester         Z33281  rev 40957  ->  42216  [STALE]

Would delete 6 stale row(s) from wikilambda_ztester_results.

This is a one-time cleanup and can also be re-run if stale rows accumulate again.
I will add it to the deploy notes if approved.

Change #1286320 had a related patch set uploaded (by Daphne Smit; author: Daphne Smit):

[mediawiki/extensions/WikiLambda@master] Add one-off maintenance script to purge stale tester result cache rows

https://gerrit.wikimedia.org/r/1286320