Page MenuHomePhabricator

Migrate GrowthExperiments maintenance jobs to mw-cron
Closed, ResolvedPublic

Description

Migrate GrowthExperiments periodic mediawiki jobs from mwmaint to mw-cron on kubernetes.

  • growthexperiments-deleteOldSurveys
  • growthexperiments-listTaskCounts
  • growthexperiments-fixLinkRecommendationData-dryrun
  • growthexperiments-fixLinkRecommendationData-dryrun-eswiki (job deleted)
  • growthexperiments-fixLinkRecommendationData-dryrun-frwiki (job deleted)
  • growthexperiments-fixLinkRecommendationData-dryrun-cswiki (job deleted)
  • growthexperiments-purgeExpiredMentorStatus
  • growthexperiments-updateMetrics
  • growthexperiments-userImpactUpdateRecentlyRegistered
  • growthexperiments-userImpactUpdateRecentlyEdited
  • growthexperiments-userImpactDelete
  • growthexperiments-updateIsActiveFlagForMentees
  • growthexperiments-refreshPraiseworthyMentees

Sharded jobs:

  • growthexperiments-refreshLinkRecommendations-s{1..7}
  • growthexperiments-updateMenteeData-s{1..7}

Doc on the new platform

ServiceOps new will handle migrating the jobs, but would appreciate input from Growth-Team on:

  • jobs that should be watched more closely
  • jobs that are low criticality and could be migrated first
  • outdated jobs that can be removed
  • any potential gotchas in the way these jobs use MediaWiki

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+10 -0
operations/puppetproduction+24 -6
operations/puppetproduction+10 -9
operations/puppetproduction+8 -3
operations/puppetproduction+8 -3
operations/puppetproduction+1 -0
operations/puppetproduction+16 -10
operations/puppetproduction+2 -12
operations/puppetproduction+23 -9
operations/puppetproduction+8 -2
operations/puppetproduction+13 -4
operations/puppetproduction+8 -2
operations/puppetproduction+8 -2
operations/puppetproduction+8 -2
operations/puppetproduction+24 -0
operations/puppetproduction+8 -2
operations/puppetproduction+3 -0
operations/puppetproduction+8 -2
operations/puppetproduction+0 -16
operations/puppetproduction+3 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I _think_ some jobs can be removed. I'll double check internally and get back to you.

We can drop the per-wiki fixLinkRecommendationData jobs. I uploaded patches for that (hopefully correct ones).

Also, another question: How would this affect the logs? Not all engineers on the Growth team have deployment privileges, but most have at least restricted. I'm afraid there could be an equivalent of T378429 for viewing logs of the job (as the default pathway to logs would probably be an appropriate kubectl logs command from the deploy host, which requires deployment rather than restricted).

EDIT: I just discovered the description links to docs, which point to Logstash. That presumably answers my question.

Change #1120556 merged by RLazarus:

[operations/puppet@production] growthexperiments.pp: Mark unnecessary jobs as absent

https://gerrit.wikimedia.org/r/1120556

Change #1120557 merged by RLazarus:

[operations/puppet@production] growthexperiments.pp: Drop absented jobs

https://gerrit.wikimedia.org/r/1120557

jobs that are low criticality and could be migrated first

Mentioned on the patch. I'd suggest to start with growthexperiments-purgeExpiredMentorStatus (no impact besides useless DB rows being stored), growthexperiments-deleteOldSurveys (enforces the data retention window) or growthexperiments-updateMetrics. I explicitly suggest to NOT start with growthexperiments-fixLinkRecommendationData-dryrun and friends, as they all generate data points to the same grafana chart, and if there is an issue with one of the job not running, the charts might look incorrectly.

Understood, thank you.

jobs that should be watched more closely

profile::mediawiki::maintenance::growthexperiments::refreshlinkrecommendations interacts with other services beside mediawiki (specifically, linkrecommendation). it may run into firewall restrictions and similar.

Permissions (firewall and otherwise) should be the same as mediawiki production, but we may run into the race condition where envoy isn't completely up yet when mediawiki starts calling out. We'll see.

outdated jobs that can be removed

I _think_ some jobs can be removed. I'll double check internally and get back to you.

Thanks for the cleanup! I have another question, can these jobs be re-run if they donยดt work first time after migrating (in other words are they idempotent enough we can restart them as needed)?

Question: How would this work impact beta? Some of our jobs are fairly important to run there, as we rely on beta for our QA purposes.

The puppet code I changed should only migrate jobs to kubernetes if the realm is production, and keep the systemd timers in beta. If it doesn't then I messed up, and I'll fix it.

Thanks for the cleanup! I have another question, can these jobs be re-run if they donยดt work first time after migrating (in other words are they idempotent enough we can restart them as needed)?

All of our jobs should be safe to restart at any time. They might do some work again (not realise it was done already), but their end result shouldn't be impacted by the restarts.

A slightly unrelated thought: The growthexperiments-refreshLinkRecommendations-s{1..7} jobs are designed as a continuous job (the hourly timer is there mostly to ensure the job is running, rather than actually do something hourly). Would the Kubernetes migration allow us to actually write it as a proper continuous job? Or would that not be a benefit the migration could give us?

jijiki triaged this task as Medium priority.Feb 19 2025, 4:29 PM
jijiki raised the priority of this task from Medium to Needs Triage.

Change #1117862 abandoned by Clรฉment Goubert:

[operations/puppet@production] mediawiki: Migrate one dry-run job to kubernetes

Reason:

Job dropped, we will use another job for testing

https://gerrit.wikimedia.org/r/1117862

I was looking at setting up T385709: Periodic job alerting, and realized you've already been onboarded to alertmanager and critical alerts for your team are sent to slack. Do you want the same for periodic jobs, or would you rather have phabricator tasks created with your team's PHID?

I was looking at setting up T385709: Periodic job alerting, and realized you've already been onboarded to alertmanager and critical alerts for your team are sent to slack.

In theory yes. In practice, only the infrastructure for that was implemented, but actual usage never materialized. Hoping to change that next month by tackling T386116: Create alerts for "Add (Section-) Image" task being broken for users ("No suggestions available").

Do you want the same for periodic jobs, or would you rather have phabricator tasks created with your team's PHID?

Them going to logstash would be a start, we have issues with that in the current system: T370560: Make failures from refreshLinkRecommendation job visible in Logstash, but so far we did not have time to investigate.

I'm not sure what alerting you have in mind in particular. If it is not too much work, then starting with sending things to slack could be a good first step, and we can fine-tune things from there?

What I'm envisioning is alerting when the kubernetes Job fails, i.e. the process exits with a code other than 0, or the pod gets kicked of the node, or OOMs. In the first iteration we probably won't retry failing jobs, they would be run again at the next scheduled time.

In that case, you would get a slack message that the Job has failed from prometheus, and you would be able to go to logstash, as everything output by a MediaWiki container in kubernetes is sent there, or use kubectl logs from the deployment host (assuming you have perms) to see the output of your Job.

What I'm envisioning is alerting when the kubernetes Job fails, i.e. the process exits with a code other than 0, or the pod gets kicked of the node, or OOMs. In the first iteration we probably won't retry failing jobs, they would be run again at the next scheduled time.

In that case, you would get a slack message that the Job has failed from prometheus, and you would be able to go to logstash, as everything output by a MediaWiki container in kubernetes is sent there, or use kubectl logs from the deployment host (assuming you have perms) to see the output of your Job.

Ok, this sounds sensible, let's do that.

I think that none of our existing jobs should routinely end with a non-0 exit status. And I'll just hope that OOMs are not happening too often, and if they do then these jobs are hopefully simple enough for us to debug this.

Do you need anything from us to move that forward?

No, I will comment on this task when I start migrating your jobs so you're aware they're moving, thank you.

Thanks for the cleanup! I have another question, can these jobs be re-run if they donยดt work first time after migrating (in other words are they idempotent enough we can restart them as needed)?

All of our jobs should be safe to restart at any time. They might do some work again (not realise it was done already), but their end result shouldn't be impacted by the restarts.

A slightly unrelated thought: The growthexperiments-refreshLinkRecommendations-s{1..7} jobs are designed as a continuous job (the hourly timer is there mostly to ensure the job is running, rather than actually do something hourly). Would the Kubernetes migration allow us to actually write it as a proper continuous job? Or would that not be a benefit the migration could give us?

Realizing I forgot to answer your question. Yes, we plan in a further evolution of the platform to support running continuous tasks, but not for this first iteration.

Change #1132673 had a related patch set uploaded (by Clรฉment Goubert; author: Clรฉment Goubert):

[operations/puppet@production] alertmanager: Route task-level GrowthExperiments alerts

https://gerrit.wikimedia.org/r/1132673

Change #1132674 had a related patch set uploaded (by Clรฉment Goubert; author: Clรฉment Goubert):

[operations/puppet@production] mw::periodic_jobs: Migrate deleteOldSurveys

https://gerrit.wikimedia.org/r/1132674

Change #1132673 merged by Clรฉment Goubert:

[operations/puppet@production] alertmanager: Route task-level GrowthExperiments alerts

https://gerrit.wikimedia.org/r/1132673

Change #1134696 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/puppet@production] alertmanager: add task receivers for 4 teams

https://gerrit.wikimedia.org/r/1134696

Change #1132674 merged by Clรฉment Goubert:

[operations/puppet@production] mw::periodic_jobs: Migrate deleteOldSurveys

https://gerrit.wikimedia.org/r/1132674

growthexperiments-deleteOldSurveys migrated to k8s. Do you mind if I do a manual run to test functionality, or is it important it runs on its normal schedule?

Change #1135005 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/puppet@production] alertmanager: route T&S tasks to their Slack

https://gerrit.wikimedia.org/r/1135005

Change #1134696 merged by Clรฉment Goubert:

[operations/puppet@production] alertmanager: add task receivers for 4 teams

https://gerrit.wikimedia.org/r/1134696

Change #1135781 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s

https://gerrit.wikimedia.org/r/1135781

Change #1135789 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s

https://gerrit.wikimedia.org/r/1135789

Change #1135781 abandoned by Hnowlan:

[operations/puppet@production] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s

https://gerrit.wikimedia.org/r/1135781

Change #1135916 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance::growthexperiments: migrate updateMetrics job to k8s

https://gerrit.wikimedia.org/r/1135916

Change #1136770 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate deleteExpiredUserImpactData to k8s

https://gerrit.wikimedia.org/r/1136770

mwscript-k8s --php_version 8.1 -f -- ./extensions/GrowthExperiments/maintenance/updateMenteeData.php --wiki testwiki --verbose
โณ Starting ./extensions/GrowthExperiments/maintenance/updateMenteeData.php on Kubernetes as job mw-script.eqiad.mom4wyhf ...
โณ Waiting for the container to start...
๐Ÿš€ Job is running.
๐Ÿ“œ Streaming logs:
Profiling data:
  * filtermentees: 1.07 seconds
  * edittimestampinternal: 0.09 seconds
  * usernames: 0.02 seconds
  * reverted: 0.1 seconds
  * questions: 0.11 seconds
  * editcount: 0.02 seconds
  * registration: 0.02 seconds
  * blocks: 0.07 seconds
===============
Done. Took 9 seconds.

That worked, so this script is good to go as well once we're sure T391695: UncachedMenteeOverviewDataProvider query is extremely aggressive causing partial outages is completely resolved

Change #1135789 merged by Hnowlan:

[operations/puppet@production] mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s

https://gerrit.wikimedia.org/r/1135789

Change #1135916 merged by Hnowlan:

[operations/puppet@production] mw::maintenance::growthexperiments: migrate updateMetrics job to k8s

https://gerrit.wikimedia.org/r/1135916

Change #1136770 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate deleteExpiredUserImpactData to k8s

https://gerrit.wikimedia.org/r/1136770

Change #1142563 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate listTaskCounts to k8s

https://gerrit.wikimedia.org/r/1142563

Change #1142579 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: move refreshLinkRecommendations job to shared object

https://gerrit.wikimedia.org/r/1142579

Change #1142563 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate listTaskCounts to k8s

https://gerrit.wikimedia.org/r/1142563

listTaskCounts has been migrated to Kubernetes, and the task is exiting 0 which is a success as far as job management goes. However, the job prints no output - are there metrics or other artifacts that we can use to verify that the job has completed its work successfully?

listTaskCounts has been migrated to Kubernetes, and the task is exiting 0 which is a success as far as job management goes. However, the job prints no output - are there metrics or other artifacts that we can use to verify that the job has completed its work successfully?

Mind that you have "correctly" migrated the config for the job to run with --output none which makes the job silent. If you remove it you'll get the results printed out. The job writes data into a cache which is used in Special:NewcomerTasksInfo, seeing different figures there after a run is a good symptom.

Change #1143528 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate refreshLinkRecommendations s1 shard to k8s

https://gerrit.wikimedia.org/r/1143528

Change #1143529 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s

https://gerrit.wikimedia.org/r/1143529

Change #1143590 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: move updateMenteeData to upper level job

https://gerrit.wikimedia.org/r/1143590

listTaskCounts has been migrated to Kubernetes, and the task is exiting 0 which is a success as far as job management goes. However, the job prints no output - are there metrics or other artifacts that we can use to verify that the job has completed its work successfully?

Mind that you have "correctly" migrated the config for the job to run with --output none which makes the job silent.

Ah, that would do it :)

If you remove it you'll get the results printed out. The job writes data into a cache which is used in Special:NewcomerTasksInfo, seeing different figures there after a run is a good symptom.

Great, thank you! I see changes in this table so I think things look good for that job.

Change #1143590 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: move updateMenteeData to upper level job

https://gerrit.wikimedia.org/r/1143590

Mh, something might have gone awry when migrating that listTaskCounts maintenance script. I'm noticing two issues:

  1. some wikis seem to have stopped reporting task counts via this script, like plwiki and ptwiki. But for others, like frwiki, it seems to still be reporting the expected data. This problem seems to have started on May 8th => T394018: Link Recommendation Task pool data missing for some wikis

image.png (341ร—601 px, 30 KB)

  1. Also on May 8th, we started to see errors in logstash from that maintenance script: Error fetching URL "http://localhost:6004/v1/events?hasty=true": (curl error: 7) Couldn't connect to server. Again, not all wikis are affected, and the wikis are not the same as in (1) above! =>T394019: ListTaskCounts: Error fetching URL "http://localhost:6004/v1/events?hasty=true": (curl error: 7) Couldn't connect to server

image.png (383ร—455 px, 30 KB)

Do you have any suspicion about what might be going on?

Change #1145962 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate all updateMenteeData jobs to k8s

https://gerrit.wikimedia.org/r/1145962

Change #1145962 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate all updateMenteeData jobs to k8s

https://gerrit.wikimedia.org/r/1145962

Change #1142579 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: replace refreshLinkRecommendations define, s1 to k8s

https://gerrit.wikimedia.org/r/1142579

Change #1146010 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::periodic_job: add concurrency parameter to k8s jobs

https://gerrit.wikimedia.org/r/1146010

Change #1143528 abandoned by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate refreshLinkRecommendations s1 shard to k8s

Reason:

Already done

https://gerrit.wikimedia.org/r/1143528

Change #1146566 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/puppet@production] mw::maintenance: migrate growthexperiments-updateIsActiveFlagForMentees

https://gerrit.wikimedia.org/r/1146566

Change #1146569 had a related patch set uploaded (by Kamila Souฤkovรก; author: Kamila Souฤkovรก):

[operations/puppet@production] mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees

https://gerrit.wikimedia.org/r/1146569

Regardless, I would like to emphasize that the migration of the refreshLinkRecommendations job should be postponed until we figured this out. That job is long-running and should not be randomly terminated. If we need to be below some time-limit, then that would require more work on our side.

Despite being sharded, refreshLinkRecommendations on some shards (s3 for instance) seems to take a lot longer than its set interval would indicate, with huge variability between the different shards:
s3 has been running for almost 4 hours:

โ— mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.service - MediaWiki periodic job growthexperiments-refreshLinkRecommendations-s3
   Loaded: loaded (/lib/systemd/system/mediawiki_job_growthexperiments-refreshLinkRecommendations-s3.service; static; vendor preset: enabled)
   Active: activating (start) since Fri 2025-05-16 06:25:31 UTC; 3h 38min ago

s5 completes in ~30 minutes:

โ— mediawiki_job_growthexperiments-refreshLinkRecommendations-s5.service - MediaWiki periodic job growthexperiments-refreshLinkRecommendations-s5
   Loaded: loaded (/lib/systemd/system/mediawiki_job_growthexperiments-refreshLinkRecommendations-s5.service; static; vendor preset: enabled)
   Active: inactive (dead) since Fri 2025-05-16 09:36:51 UTC; 37min ago

As we currently do not have a good way to modulate interval based on shard, that means it either needs concurrencyPolicy: Forbid and a corresponding startingDeadLineSeconds to avoid it not running to completion, or a long enough interval for all shards to be able to finish.

We could use listTaskCounts to proof of concept T394423: Investigate startingDeadlineSeconds setting for kubernetes CronJobs using startingDeadlineSeconds and concurrencyPolicy: Forbid

In any case, blocking the migration of startingDeadLineSeconds until we've found a solution.

@Clement_Goubert FWIW, the refreshLinkRecommendations job is closer to a daemon/service rather than a cron job. Maybe we're actually abusing the job infrastructure for something that should be scheduled as something else.

Change #1146566 merged by Kamila Souฤkovรก:

[operations/puppet@production] mw::maintenance: migrate growthexperiments-updateIsActiveFlagForMentees

https://gerrit.wikimedia.org/r/1146566

@Clement_Goubert FWIW, the refreshLinkRecommendations job is closer to a daemon/service rather than a cron job. Maybe we're actually abusing the job infrastructure for something that should be scheduled as something else.

That's a fair question to ask. Short-term the schedule for these jobs needs to be reconsidered on an individual basis (rather than setting the same short interval for all shards) given the timings that @Clement_Goubert has pointed out here. The concurrency work is necessary in parallel but some of these jobs have been executing a lot less than they are configured to.

We are kind of abusing CronJobs if they are meant to run continuously, but as of now we have not developed a way to run continuous periodic jobs.
This will be coming, most probably in the form of a bunch of Deployment kubernetes objects with Pods having a restartPolicy: Always configuration so they run to completion and automatically restart. This also requires different alerting than CronJobs, and the deployment strategy needs to be discussed. For example, CronJobs pick up the new images and mediawiki code when they restart on schedule, but should service-style Deployments restart on a scap deployment?
For now, we should find a way that ensures jobs run to completion *as CronJobs* in order to complete the migration, then we will test and migrate service-style Jobs to this future system.

Change #1146010 merged by Clรฉment Goubert:

[operations/puppet@production] mw::periodic_job: add concurrency parameter to k8s jobs

https://gerrit.wikimedia.org/r/1146010

Change #1146569 merged by Kamila Souฤkovรก:

[operations/puppet@production] mw::maintenance: migrate growthexperiments-refreshPraiseworthyMentees

https://gerrit.wikimedia.org/r/1146569

Now that we've largely unblocked the continuously running jobs, the refreshLinkRecommendations shards are ready to be migrated. However, the s3 job tends to take 7+ hours to complete and it can be difficult to find a time where an instance of the job isn't running. Can it be safely stopped? Alternatively we can disable the job after its next execution and then merge changes to migrate it.

Clement_Goubert changed the task status from Open to In Progress.May 20 2025, 3:27 PM
Clement_Goubert triaged this task as High priority.

Now that we've largely unblocked the continuously running jobs, the refreshLinkRecommendations shards are ready to be migrated. However, the s3 job tends to take 7+ hours to complete and it can be difficult to find a time where an instance of the job isn't running. Can it be safely stopped? Alternatively we can disable the job after its next execution and then merge changes to migrate it.

s3 is described on https://noc.wikimedia.org/db.php as:

Any wiki not hosted on the other sections.

So, that's probably a lot of wikis, even if the job for each individual wiki does not take long.

That being said, it should be safe to stop that job, and we can deal with it not running for a (short) while. This job fills a database table with new link-recommendations and many wikis should have a decent backlog.

Change #1143529 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate all refreshLinkRecommendations jobs to k8s

https://gerrit.wikimedia.org/r/1143529

Now that we've largely unblocked the continuously running jobs, the refreshLinkRecommendations shards are ready to be migrated. However, the s3 job tends to take 7+ hours to complete and it can be difficult to find a time where an instance of the job isn't running. Can it be safely stopped? Alternatively we can disable the job after its next execution and then merge changes to migrate it.

s3 is described on https://noc.wikimedia.org/db.php as:

Any wiki not hosted on the other sections.

So, that's probably a lot of wikis, even if the job for each individual wiki does not take long.

That being said, it should be safe to stop that job, and we can deal with it not running for a (short) while. This job fills a database table with new link-recommendations and many wikis should have a decent backlog.

Thanks! One run of the s3 job has been stopped towards the end of its run, but all jobs are now running in Kubernetes. We'll keep an eye on them.

Change #1148914 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::maintenance: migrate all remaining growthexperiments jobs

https://gerrit.wikimedia.org/r/1148914

Change #1148914 merged by Hnowlan:

[operations/puppet@production] mw::maintenance: migrate all remaining growthexperiments jobs

https://gerrit.wikimedia.org/r/1148914

Hey @hnowlan or @Clement_Goubert, we are getting an alert about the growthexperiments-refreshlinkrecommendations-s3 having failed. Though from the logs, we can't tell the cause.

Also, when running kubectl get jobs -l team=growth,cronjob=growthexperiments-refreshlinkrecommendations-s3 --field-selector status.successful=0 on the server as instructed yields curious results:

NAME                                                       COMPLETIONS   DURATION   AGE
growthexperiments-refreshlinkrecommendations-s3-29131167   0/1           10h        10h
growthexperiments-refreshlinkrecommendations-s3-29131587   0/1           3h3m       3h3m

Why are there 2? And also, when running the same command later, both DURATION and AGE increase. If they have failed, thus no longer running, shouldn't the value for the duration remain the same regardless when command is executed?

If you look at kubectl describe job for these two:

cgoubert@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-cron$ kubectl describe job growthexperiments-refreshlinkrecommendations-s3-29131167 | grep Status
Pods Statuses:    0 Active / 0 Succeeded / 1 Failed
cgoubert@deploy1003:/srv/deployment-charts/helmfile.d/services/mw-cron$ kubectl describe job growthexperiments-refreshlinkrecommendations-s3-29131587 | grep Status
Pods Statuses:    1 Active / 0 Succeeded / 0 Failed

A failed job is considered "not complete", and its duration keeps going up even if not running. To reset the alert, the failed job needs to be deleted (which I'll do, we still need to add a wrapper script or something so it's easy for y'all to do).

It looks like one of the runs did fail - the logs can be seen in kubectl logs growthexperiments-refreshlinkrecommendations-s3-29131167-sfgdv mediawiki-main-app:

... lots of output ...
xmfwiki     fetching 500 tasks...
RuntimeException from line 328 of /srv/mediawiki/php-1.45.0-wmf.1/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php: Search error: Search is currently too busy. Please try again later.
#0 /srv/mediawiki/php-1.45.0-wmf.1/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php(435): GrowthExperiments\Maintenance\RefreshLinkRecommendations->findArticlesInTopic('engineering')
#1 /srv/mediawiki/php-1.45.0-wmf.1/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php(148): GrowthExperiments\Maintenance\RefreshLinkRecommendations->refreshViaOresTopics(false)
#2 /srv/mediawiki/php-1.45.0-wmf.1/maintenance/includes/MaintenanceRunner.php(694): GrowthExperiments\Maintenance\RefreshLinkRecommendations->execute()
#3 /srv/mediawiki/php-1.45.0-wmf.1/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#4 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
#5 {main}

Could this error have been happening previously but silently?

The two jobs you are seeing in that output are the presently running job and the previously failed job.

Before I do delete it, here's the error

xmfwiki     fetching 500 tasks...
RuntimeException from line 328 of /srv/mediawiki/php-1.45.0-wmf.1/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php: Search error: Search is currently too busy. Please try again later.
#0 /srv/mediawiki/php-1.45.0-wmf.1/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php(435): GrowthExperiments\Maintenance\RefreshLinkRecommendations->findArticlesInTopic('engineering')
#1 /srv/mediawiki/php-1.45.0-wmf.1/extensions/GrowthExperiments/maintenance/refreshLinkRecommendations.php(148): GrowthExperiments\Maintenance\RefreshLinkRecommendations->refreshViaOresTopics(false)
#2 /srv/mediawiki/php-1.45.0-wmf.1/maintenance/includes/MaintenanceRunner.php(694): GrowthExperiments\Maintenance\RefreshLinkRecommendations->execute()
#3 /srv/mediawiki/php-1.45.0-wmf.1/maintenance/run.php(51): MediaWiki\Maintenance\MaintenanceRunner->run()
#4 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')
#5 {main}
hnowlan claimed this task.
hnowlan updated the task description. (Show Details)