Page MenuHomePhabricator

growthexperiments-fixlinkrecommendationdata-dryrun 'link-recommendation' is not a link recommendation task type error
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
trace
cgoubert@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-cron$ kubectl logs -f growthexperiments-fixlinkrecommendationdata-dryrun-2025052kbtqj mediawiki-main-app 
extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: Start run
extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: Running on growthexperiments.dblist
abwiki Total number of OK search index entries: 210
abwiki  (results in multiple topics counted multiple times)
abwiki Total number of dangling search-index entries: 0
abwiki Total number of OK db-table entries: 98
abwiki Total number of dangling db-table entries: 0
acewiki Total number of OK search index entries: 253
acewiki  (results in multiple topics counted multiple times)
acewiki Total number of dangling search-index entries: 0
acewiki Total number of OK db-table entries: 128
acewiki Total number of dangling db-table entries: 0
adywiki Total number of OK search index entries: 46
adywiki  (results in multiple topics counted multiple times)
adywiki Total number of dangling search-index entries: 0
adywiki Total number of OK db-table entries: 18
adywiki Total number of dangling db-table entries: 0
afwiki Total number of OK search index entries: 28339
afwiki  (results in multiple topics counted multiple times)
afwiki Total number of dangling search-index entries: 1
afwiki Total number of OK db-table entries: 9438
afwiki Total number of dangling db-table entries: 0
'link-recommendation' is not a link recommendation task type
Impact

Since we exit the foreachwiki loop on error, the run stops. The script is not run for subsequent wikis.

Notes

We can use T395245: Add a flag to the mwscript wrapper to set +e when required to continue running on error, at the risk of obscuring repeated failure, but we'd like to make sure that this is not a more serious failure.

Event Timeline

Clement_Goubert triaged this task as High priority.

The current error message is not particularly helpful. We've merged refactor: show wiki and type of invalid task type to make this a bit more meaningful so that we can investigate further.

Since we exit the foreachwiki loop on error, the run stops. The script is not run for subsequent wikis.

That should ideally change. At least for this particular error, we probably want to log a logstash error (+trigger an alert), but only stop the execution for the current wiki, not for all of them.

(See also the latest comments on T373176)

Once the changes for T395245: Add a flag to the mwscript wrapper to set +e when required are live, we can make it so it'll run for the rest of the wikis. The line is logged to logstash, but as there is no structured logging in maintenance script, it is not particularly tagged as an error. This logstash link captures only that message. As part of T395245 we are also going to be prepending the wiki name to errors.

SRE Observability could you help on how to stand up an alert for when that log pops up? The general alerting system we put in place for the periodic jobs can't do this, a job either succeeds or doesn't.

Change #1150711 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::maintenance::growthexperiment: Ignore foreachwiki errors

https://gerrit.wikimedia.org/r/1150711

Change #1150711 merged by Clément Goubert:

[operations/puppet@production] mw::maintenance::growthexperiment: Ignore foreachwiki errors

https://gerrit.wikimedia.org/r/1150711

The loop for this job now ignores errors and continues, as it was in the old wrapper.

I have deleted the failed jobs to clear alerting.

I have deleted the failed jobs to clear alerting.

This job seems to have stopped running altogether? I'm not seeing it anymore at all when doing kubectl get jobs -l team=growth. Should it not be (re-)started every day?

There seems to be no data from this job, neither for the weekend nor for today.

It runs every day at 07:20 UTC, it ran this morning and completed at 09:28. It was broken until yesterday because of the loop exiting on error, and the reason you don't see any runs is because I deleted the failed jobs to clear the alerting.

You can see the runs logged in the Kubernetes Events dashboard

We can increase the job's ttlSecondsAfterFinished so more than one run appears in kubectl get jobs, although we do only keep one failed job in the kubectl history.

It runs every day at 07:20 UTC, it ran this morning and completed at 09:28. It was broken until yesterday because of the loop exiting on error, and the reason you don't see any runs is because I deleted the failed jobs to clear the alerting.

You can see the runs logged in the Kubernetes Events dashboard

We can increase the job's ttlSecondsAfterFinished so more than one run appears in kubectl get jobs, although we do only keep one failed job in the kubectl history.

It seems to be better now:

image.png (320×606 px, 44 KB)

First and more plausible data has the 28th of May (Wednesday) as date, with additional wikis showing up on the 29th. Before that, there was only some data from afwiki which is also the very first wiki. I assume that something used to go wrong during that first script run.

The run for akwiki fails with this error, which used to stop the run for subsequent wikis due to the error handling.

The run for akwiki fails with this error, which used to stop the run for subsequent wikis due to the error handling.

In particular that error is:

akwiki akwiki: 'null' is not a link recommendation task type

Some notes to investigate for us;

  • akwiki is a locked wiki. Why is the maintenance script running there at all? That is pointless, a waste of resources, and should change.
  • akwiki has GENewcomerTasksLinkRecommendationsEnabled set to false per that being the default. Why is the maintenance script not skipping it based on that?

Change #1154021 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/GrowthExperiments@master] maint: skip wikis where Add Link is disabled in PHP settings

https://gerrit.wikimedia.org/r/1154021

  • akwiki has GENewcomerTasksLinkRecommendationsEnabled set to false per that being the default. Why is the maintenance script not skipping it based on that?

The attached change should do that now.

The run for akwiki fails with this error, which used to stop the run for subsequent wikis due to the error handling.

Once the above change is merged, this particular error should be fixed. There is the overall issue of why the script runs there at all, but that is for someplace else to fix.

What I'm worried about: We did not get any alerts for that error.
While it is good that it goes on to the next wiki, the fact that no alert was triggered is concerning. OTOH, I can also see the issue with us getting ca. 300 individual alerts if something is wrong across wikis. Maybe having the script exit the foreachwiki loop on error is the better solution after alll? Not sure. Curious what @Urbanecm_WMF and @Sgs think.

Maybe SRE Observability can provide a way for you to have an alert from logstash

Change #1154021 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] maint: skip wikis where Add Link is disabled in PHP settings

https://gerrit.wikimedia.org/r/1154021

Etonkovidova subscribed.

To monitor

To monitor

Looks good,

Last timestanp - Jun 19, 2025 @ 08:45:59.621 - https://logstash.wikimedia.org/goto/21fd294b30f3ae8ee73324a01996e01a

Slowly increasing since 06/18 - but in the numbers are low (~750 max).