Page MenuHomePhabricator

Implement periodic maintenance scripts for mw-on-k8s
Closed, ResolvedPublic

Description

Right now our periodic MediaWiki maintenance scripts run as systemd timers on the mwmaint servers.

We need to convert all of them to become Kubernetes cronjobs.

Most of the work needed for running these will probably be shared with the work to allow running one-off jobs in T341553, although for cronjobs this should be easier as we don't have the problem of random helm release to manage, as we will probably just need a single one.

One big detail we have to consider is what will happen on a deployment of mediawiki. Would it kill any running cronjob? The answer is no per https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#modifying-a-cronjob

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
operations/puppetproduction+109 -45
operations/puppetproduction+1 -25
operations/deployment-chartsmaster+0 -3
operations/deployment-chartsmaster+1 -0
operations/puppetproduction+10 -1
operations/deployment-chartsmaster+3 -0
operations/puppetproduction+71 -86
operations/puppetproduction+9 -4
operations/puppetproduction+9 -1
operations/puppetproduction+54 -5
operations/puppetproduction+1 -1
operations/puppetproduction+34 -4
operations/deployment-chartsmaster+5 -9
operations/puppetproduction+2 -5
operations/deployment-chartsmaster+2 -2
operations/deployment-chartsmaster+3 -1
operations/deployment-chartsmaster+5 -4
operations/deployment-chartsmaster+105 -0
operations/deployment-chartsmaster+3 -7
operations/deployment-chartsmaster+328 -17
operations/puppetproduction+48 -9
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
Resolveddancy
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
InvalidClement_Goubert
InvalidClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
DuplicateClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolvedhnowlan
OpenPRODUCTION ERRORNone
ResolvedPRODUCTION ERRORMichael
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolvedhnowlan
ResolvedScott_French
ResolvedClement_Goubert
ResolvedClement_Goubert
Resolvedhnowlan
ResolvedClement_Goubert
ResolvedJoe
Resolvedhashar
OpenNone
ResolvedClement_Goubert
ResolvedScott_French
ResolvedClement_Goubert
DuplicateClement_Goubert
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedScott_French
ResolvedScott_French
Resolvedhnowlan
ResolvedClement_Goubert
ResolvedNone
Resolvedhnowlan
Resolvedhnowlan
Resolvedhnowlan
ResolvedClement_Goubert
ResolvedNone
ResolvedClement_Goubert
ResolvedScott_French
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedScott_French
Resolvedjijiki
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1132630 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Add labels to CronJobs

https://gerrit.wikimedia.org/r/1132630

Change #1132636 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Fix jobConfig scope rewrite

https://gerrit.wikimedia.org/r/1132636

Change #1132630 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Add labels to CronJobs

https://gerrit.wikimedia.org/r/1132630

Change #1132636 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Fix jobConfig scope rewrite

https://gerrit.wikimedia.org/r/1132636

Change #1133864 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw::periodic_jobs: Pass command through untouched

https://gerrit.wikimedia.org/r/1133864

Change #1133865 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Fix mwcron command invocation

https://gerrit.wikimedia.org/r/1133865

Change #1133872 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mwcron: Import all periodic_jobs resources

https://gerrit.wikimedia.org/r/1133872

Change #1133864 merged by Clément Goubert:

[operations/puppet@production] mw::periodic_jobs: Pass command through untouched

https://gerrit.wikimedia.org/r/1133864

Change #1133865 merged by Clément Goubert:

[operations/deployment-charts@master] mediawiki: Fix mwcron command invocation

https://gerrit.wikimedia.org/r/1133865

Change #1133872 merged by Clément Goubert:

[operations/puppet@production] mwcron: Import all periodic_jobs resources

https://gerrit.wikimedia.org/r/1133872

Change #1135002 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes_periodic_job: Lowercase job name

https://gerrit.wikimedia.org/r/1135002

Change #1135002 merged by Clément Goubert:

[operations/puppet@production] kubernetes_periodic_job: Lowercase job name

https://gerrit.wikimedia.org/r/1135002

Change #1135936 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] mw:periodic_jobs: Add mw-cron boilerplate

https://gerrit.wikimedia.org/r/1135936

Change #1135936 merged by Clément Goubert:

[operations/puppet@production] mw:periodic_jobs: Add mw-cron boilerplate

https://gerrit.wikimedia.org/r/1135936

Change #1137227 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] sharded_periodic_jobs: Kubernetes CronJob compat

https://gerrit.wikimedia.org/r/1137227

Change #1137228 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] updatequerypages: Move to sharded_periodic_job

https://gerrit.wikimedia.org/r/1137228

Change #1137261 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] updatequerypages: Move deadendpages-s3 to kubernetes

https://gerrit.wikimedia.org/r/1137261

Change #1137227 merged by Clément Goubert:

[operations/puppet@production] sharded_periodic_jobs: Kubernetes CronJob compat

https://gerrit.wikimedia.org/r/1137227

Change #1137228 merged by Clément Goubert:

[operations/puppet@production] updatequerypages: Move deadendpages to sharded_periodic_job

https://gerrit.wikimedia.org/r/1137228

Change #1137306 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] updatequerypages: Move all to sharded_periodic_job

https://gerrit.wikimedia.org/r/1137306

Change #1137261 merged by Clément Goubert:

[operations/puppet@production] updatequerypages: Move deadendpages-s3 to kubernetes

https://gerrit.wikimedia.org/r/1137261

Change #1137306 merged by Clément Goubert:

[operations/puppet@production] updatequerypages: Move all to sharded_periodic_job

https://gerrit.wikimedia.org/r/1137306

Thanks for the work that people are doing on this!

I just have a comment from a volunteer point of view about the automatic @phaultfinder tasks that get created when a cron-job fails. Speaking personally, these sorts of tasks (e.g. T392441, T392443) are difficult to triage as a volunteer without Logstash access, as there doesn't seem to be any information provided about what might have caused the script in question to fail. E.g., as a volunteer developer, I wouldn't currently know how the two tasks might be able to be fixed/what might be responsible for the script's failures (or even whether those two tasks are duplicates of each other), as the tasks don't appear to contain enough information to allow me to make that judgement.

For components with an active team steward, this might not be much of an issue (as the component’s stewards would be able to investigate what went wrong using Logstash themselves, and refer the issue to another team - e.g. serviceops - where appropriate). However, for components that don’t have a steward assigned - e.g. MediaWiki-Special-pages - this might be more of a problem, as (I believe, correct me if I'm wrong!) there wouldn’t be any people with Logstash access who would be specifically responsible for triaging that component's cron-job failures.

Information/stack traces not being retrieved from Logstash has already been an occasional problem with regards to community-reported production errors (see e.g. the comments in T391206: Propose to add "stack trace requested" column to #wikimedia-production-error), but I guess I just worry that the current way in which these automatic cron-job-failure tasks are created(/the lack of information currently contained within them) has the potential to worsen this issue a bit. (Don't get me wrong, I think it's important to track when scripts are failing/aren't working as they should! However, based on what I can currently see, I believe that the current way in which these problems are reported has the potential to cause an issue for unstewarded components.)


(As a side note, should these cron-job-failure tasks be tagged with Wikimedia-production-error?)

Change #1138352 had a related patch set uploaded (by Kamila Součková; author: Kamila Součková):

[operations/deployment-charts@master] mw-cron: set php.version

https://gerrit.wikimedia.org/r/1138352

Change #1138352 merged by jenkins-bot:

[operations/deployment-charts@master] mw-cron: set php.version

https://gerrit.wikimedia.org/r/1138352

Change #1139004 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mediawiki::periodic_job: allow for use of a migration title for long job names

https://gerrit.wikimedia.org/r/1139004

Change #1139004 merged by Hnowlan:

[operations/puppet@production] mediawiki::periodic_job: allow for use of a migration title for long job names

https://gerrit.wikimedia.org/r/1139004

Change #1143517 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-cron: disable mcrouter container

https://gerrit.wikimedia.org/r/1143517

Change #1143520 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/deployment-charts@master] mw-cron: enable monitoring

https://gerrit.wikimedia.org/r/1143520

Change #1143520 abandoned by Effie Mouzeli:

[operations/deployment-charts@master] mw-cron: enable monitoring

Reason:

wrong assumptions

https://gerrit.wikimedia.org/r/1143520

Thanks for the work that people are doing on this!

Hi, thanks for the feedback.

I just have a comment from a volunteer point of view about the automatic @phaultfinder tasks that get created when a cron-job fails. Speaking personally, these sorts of tasks (e.g. T392441, T392443) are difficult to triage as a volunteer without Logstash access, as there doesn't seem to be any information provided about what might have caused the script in question to fail. E.g., as a volunteer developer, I wouldn't currently know how the two tasks might be able to be fixed/what might be responsible for the script's failures (or even whether those two tasks are duplicates of each other), as the tasks don't appear to contain enough information to allow me to make that judgement.

We do not have a good way to embed log information at the moment in tasks created by @phaultfinder. The reason is the task is created through AlertManager with information from Prometheus, which does not contain (and can't handle) logs.

As far as duplicates go, as long as the name of the task isn't changed, and it is not closed, a single task with the title MediaWikiCronJobFailed is created by component, and future firings jobs in the same component are added to the task description.

For components with an active team steward, this might not be much of an issue (as the component’s stewards would be able to investigate what went wrong using Logstash themselves, and refer the issue to another team - e.g. serviceops - where appropriate). However, for components that don’t have a steward assigned - e.g. MediaWiki-Special-pages - this might be more of a problem, as (I believe, correct me if I'm wrong!) there wouldn’t be any people with Logstash access who would be specifically responsible for triaging that component's cron-job failures.

Did volunteers for components without an active team steward have a way to know these jobs failed in the old system? Was there a way they accessed logs without Logstash? If so (for instance by being members of the restricted shell access group), we may be able to work something out for CLI access to kubernetes logs

From what I understand, up until now, these failures only ended up in SRE alerting dashboards as warnings that a systemd job had failed, and were mostly debugged by serviceops before being either assigned to the right component if needed, or directly fixed.

Information/stack traces not being retrieved from Logstash has already been an occasional problem with regards to community-reported production errors (see e.g. the comments in T391206: Propose to add "stack trace requested" column to #wikimedia-production-error), but I guess I just worry that the current way in which these automatic cron-job-failure tasks are created(/the lack of information currently contained within them) has the potential to worsen this issue a bit. (Don't get me wrong, I think it's important to track when scripts are failing/aren't working as they should! However, based on what I can currently see, I believe that the current way in which these problems are reported has the potential to cause an issue for unstewarded components.)

We are currently working on migrating to the new system and at least reach functional parity with the old system, we will then work on improving the alerting, even though as mentioned we are limited in the amount of information we are able to convey through this system.

One thing we will prioritize will be to create task indexed on the periodic job, and not on the component, to give more granularity and easier access to that information.

Was a process to request stack traces agreed upon following T391206: Propose to add "stack trace requested" column to #wikimedia-production-error?

(As a side note, should these cron-job-failure tasks be tagged with Wikimedia-production-error?)

They probably should be, it would increase the chances for the unstewarded ones of someone with logstash access seeing them and adding information, and I think it also makes sense in general.

Thank you for the detailed response @Clement_Goubert!

As far as duplicates go, as long as the name of the task isn't changed, and it is not closed, a single task with the title MediaWikiCronJobFailed is created by component, and future firings jobs in the same component are added to the task description.

Thanks for the note -- on this occasion, I only realised this after having already changed one of the tasks' titles to be more informative! (xref T392441#10762663)
It'd be slightly nicer from a quality-of-life perspective if the software behind @phaultfinder was smart enough to know/keep track of the tasks that it's filed; but hey, I guess we have to deal with what we've got :)

Did volunteers for components without an active team steward have a way to know these jobs failed in the old system?

I can't answer this question fully (due to a lack of personal knowledge on the matter); but, from my perspective (& in the context of the MediaWiki-Special-pages maintenance reports), I can imagine that there could theoretically be a task filed by an end-user for the (user-facing) impact of a job having failed -- e.g., something like "Special:BrokenRedirects hasn't updated since 3 months ago", or something like that.

Was there a way they [volunteers] accessed logs without Logstash?

I'm afraid I'll have to pass on this question — I’m not personally knowledgeable enough to know whether or not this has previously been possible.

Was a process to request stack traces agreed upon following T391206: Propose to add "stack trace requested" column to #wikimedia-production-error?

No (xref T391206#10717057) - I believe the thinking from folks in that task was that the issue with stack-traces occasionally not being provided/filled-in might be more of a social one, than one that'd necessarily be fixable with more docs/tags.
In my eyes, the most ideal situation would probably be that every ticket that arrives in the Wikimedia-production-error queue is - at some point (& before the logs expire) - looked over by someone with Logstash access, who'd then be able to add a stack trace where one is needed/missing.

(As a side note, should these cron-job-failure tasks be tagged with Wikimedia-production-error?)

They probably should be, it would increase the chances for the unstewarded ones of someone with logstash access seeing them and adding information, and I think it also makes sense in general.

Thanks for the confirmation! :)
Would it be possible to configure the software behind @phaultfinder to automatically add the Wikimedia-production-error tag to these types of tasks (at least, in components that don't have an active steward), so that it doesn't rely on a human noticing the task & doing so themselves?

Thank you for the detailed response @Clement_Goubert!

As far as duplicates go, as long as the name of the task isn't changed, and it is not closed, a single task with the title MediaWikiCronJobFailed is created by component, and future firings jobs in the same component are added to the task description.

Thanks for the note -- on this occasion, I only realised this after having already changed one of the tasks' titles to be more informative! (xref T392441#10762663)
It'd be slightly nicer from a quality-of-life perspective if the software behind @phaultfinder was smart enough to know/keep track of the tasks that it's filed; but hey, I guess we have to deal with what we've got :)

I've changed alerting to include the CronJob name in the task title for most alerts.

[...]

(As a side note, should these cron-job-failure tasks be tagged with Wikimedia-production-error?)

They probably should be, it would increase the chances for the unstewarded ones of someone with logstash access seeing them and adding information, and I think it also makes sense in general.

Thanks for the confirmation! :)
Would it be possible to configure the software behind @phaultfinder to automatically add the Wikimedia-production-error tag to these types of tasks (at least, in components that don't have an active steward), so that it doesn't rely on a human noticing the task & doing so themselves?

That's now done for every unstewarded task alert.

Thank you for the detailed response @Clement_Goubert!

As far as duplicates go, as long as the name of the task isn't changed, and it is not closed, a single task with the title MediaWikiCronJobFailed is created by component, and future firings jobs in the same component are added to the task description.

Thanks for the note -- on this occasion, I only realised this after having already changed one of the tasks' titles to be more informative! (xref T392441#10762663)
It'd be slightly nicer from a quality-of-life perspective if the software behind @phaultfinder was smart enough to know/keep track of the tasks that it's filed; but hey, I guess we have to deal with what we've got :)

I've changed alerting to include the CronJob name in the task title for most alerts.

[...]

(As a side note, should these cron-job-failure tasks be tagged with Wikimedia-production-error?)

They probably should be, it would increase the chances for the unstewarded ones of someone with logstash access seeing them and adding information, and I think it also makes sense in general.

Thanks for the confirmation! :)
Would it be possible to configure the software behind @phaultfinder to automatically add the Wikimedia-production-error tag to these types of tasks (at least, in components that don't have an active steward), so that it doesn't rely on a human noticing the task & doing so themselves?

That's now done for every unstewarded task alert.

Brilliant, thanks so much! :D

Change #1150594 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] mw::periodic_job: clean up migration_title parameter

https://gerrit.wikimedia.org/r/1150594

Change #1150594 abandoned by Hnowlan:

[operations/puppet@production] mw::periodic_job: clean up migration_title parameter

Reason:

Won't do this while beta metal jobs are required

https://gerrit.wikimedia.org/r/1150594

All jobs have been migrated \o/