Page MenuHomePhabricator

Toolforge: add Debian Buster to the grid and eliminate Debian Stretch
Closed, ResolvedPublic

Description

Migrate tools/toolsbeta grid to Debian Buster since Stretch is deprecated. This is mostly just adding Buster queues, but then Stretch queues need to go away.

About the migration itself, this will be handled by (@komla):

Details

Other Assignee
taavi
ProjectBranchLines +/-Subject
operations/puppetproduction+0 -13
operations/software/tools-webservicemaster+2 -2
labs/toollabsmaster+1 -1
operations/puppetproduction+6 -6
operations/puppetproduction+2 -0
operations/puppetproduction+11 -0
labs/toollabsmaster+15 -15
operations/software/tools-webservicemaster+2 -2
operations/puppetproduction+47 -0
operations/puppetproduction+20 -0
operations/puppetproduction+8 -8
operations/cookbookswmcs+154 -3
operations/cookbookswmcs+120 -0
operations/puppetproduction+8 -9
operations/puppetproduction+1 -1
operations/puppetproduction+10 -0
operations/puppetproduction+299 -140
operations/puppetproduction+256 -151
operations/puppetproduction+57 -60
operations/puppetproduction+11 -11
operations/puppetproduction+11 -18
operations/puppetproduction+2 -0
operations/puppetproduction+0 -3
Show related patches Customize query in gerrit

Related Objects

StatusSubtypeAssignedTask
OpenAndrew
Resolvedtaavi
Resolvedtaavi
Resolvedtaavi
OpenNone
In Progresstaavi
StalledNone
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedaborrero
Resolvedtaavi
Resolvedaborrero
Resolvedtaavi
DuplicateNone
StalledNone
Resolvedtaavi
DeclinedNone
Resolvedaborrero
StalledNone
Resolvedaborrero
Opentaavi
Resolvedtaavi
Resolvednskaggs
Opentaavi

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For this task at least I lean towards 'skip to Bullseye' unless we have specific concerns about breaking components. Since the migration path requires user cooperation we'll keep a lot more tools alive by minimizing the number of steps the users have to take.

Skipping to bullseye may or may not be more difficult for our users (greater jump in version numbers for the underlying runtimes), work has already been required just to get buster working correctly with grid stuff (and that may need to be done again for bullseye). I haven't even checked for gridengine binaries on bullseye (they are probably there with zero patches as usual).

I don't see much reason to do that. It is far more likely that we will want to spin up bullseye VMs in addition to buster VMs on a separate ticket. The grid VMs are execution environments, like the k8s images. Any valid execution environment is worth supporting. Skipping a version would be weird if not done for a very good reason. They should be able to co-exist fine as long as development continues to never happen on gridengine (and Debian packages remain available).

I said it back when the grid wasn't upgraded very quickly to buster, and I'll say it again on the notion of bullseye: better to turn off the grid if you want to be thinking about future proofing. Hrm. I think the name of the ticket is confusing. Updating it.

Bstorm renamed this task from Toolforge: migrate grid to Debian Buster to Toolforge: add Debian Buster to the grid and eliminate Debian Stretch.Sep 27 2021, 5:45 PM
Bstorm updated the task description. (Show Details)

The grid was only all-or-nothing when moving from ancient versions of Ubuntu to Debian because of the switch in gridengine versions. It's not actively developed in the open source version, so there's really no danger in returning to the old practice of having multiple release version queues and keeping them there...with a default.

aborrero changed the task status from Stalled to In Progress.Dec 15 2021, 10:37 AM
aborrero claimed this task.

Change 747503 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P::toolforge::grid: fix tomcat on buster

https://gerrit.wikimedia.org/r/747503

Change 747503 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] P::toolforge::grid: fix tomcat on buster

https://gerrit.wikimedia.org/r/747503

Mentioned in SAL (#wikimedia-cloud) [2021-12-21T11:06:13Z] <arturo> bump quotas, instances from 50 to 55, CPU from 100 to 150, RAM from 200GB to 250GB (T277653)

Change 749221 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/cookbooks@wmcs] wmcs: toolforge: grid: add cookbook to depool a node

https://gerrit.wikimedia.org/r/749221

Change 749222 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/cookbooks@wmcs] wmcs: toolforge: grid: add remove instance cookbook

https://gerrit.wikimedia.org/r/749222

Change 749221 merged by Arturo Borrero Gonzalez:

[operations/cookbooks@wmcs] wmcs: toolforge: grid: add cookbook to depool a node

https://gerrit.wikimedia.org/r/749221

Change 749222 merged by Arturo Borrero Gonzalez:

[operations/cookbooks@wmcs] wmcs: toolforge: grid: add remove instance cookbook

https://gerrit.wikimedia.org/r/749222

Change 753446 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: grid: weblight: support debian buster

https://gerrit.wikimedia.org/r/753446

Change 753446 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: grid: weblight: support debian buster

https://gerrit.wikimedia.org/r/753446

Mentioned in SAL (#wikimedia-cloud) [2022-01-20T12:56:50Z] <arturo> scaling up the grid with 10 buster exec nodes (T277653)

Mentioned in SAL (#wikimedia-cloud) [2022-01-24T15:22:59Z] <arturo> scaling up the grid with 10 buster exec nodes (T277653)

Mentioned in SAL (#wikimedia-cloud) [2022-01-26T13:55:33Z] <arturo> scaling up the buster web grid with 5 lighttd and 2 generic nodes (T277653)

Change 757499 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: automated-tests: introduce check to verify default grid release

https://gerrit.wikimedia.org/r/757499

Change 757499 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: automated-tests: introduce check to verify default grid release

https://gerrit.wikimedia.org/r/757499

aborrero updated the task description. (Show Details)
aborrero updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2022-02-09T18:56:13Z] <wm-bot> pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] (T277653) - cookbook ran by arturo@nostromo

Change 762811 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811

Change 762811 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] toolforge: automated-tests: introduce some TJF checks

https://gerrit.wikimedia.org/r/762811

On 1st April, emails were sent to individual maintainers who still have projects running on Stretch GridEngine.

Some have started migrating away from Stretch GridEngine.

One user sent a technical enquiry to the Cloud mailing list but the post is currently being held for moderation because of size.
Can this be reviewed?

One user sent a technical enquiry to the Cloud mailing list but the post is currently being held for moderation because of size.
Can this be reviewed?

apparently somebody already let the message in.

I am seeing services that have been migrated from Stretch grid still appearing on grid-deprecation.toolforge.org portal.
Example services: srwiki and linedwell

Does this mean the maintainers have left these services running on Stretch grid or is it in the way the data is fetched?

I am seeing services that have been migrated from Stretch grid still appearing on grid-deprecation.toolforge.org portal.
Example services: srwiki and linedwell

Does this mean the maintainers have left these services running on Stretch grid or is it in the way the data is fetched?

  • The srwiki tool's crontab does not specify a -release for either job, so it is still running its jobs on the stretch grid. I don't see any sign of that tool attempting to use the new kubernetes job service.
  • The linedwell tool's report shows its grid engine webservice job shutting down on 2022-04-03 09:32. The grid-deprecation app has some code (R2043:b18c8d8eff90: app: remove workloads migrated to Kubernetes) that tries to remove things that are now seen on running on Kubernetes, but apparently the matching logic is not currently working for webservices.

If you have other tools like this, I don't think there is any way to know if it is a tool issue or a reporting issue until someone goes an manually checks each tool.

Thanks!
I will remind maintainers to make sure they shutdown all services on Stretch grid after migration.

Some maintainers are asking for their projects(either abandoned or very old) to be shut down.
Is this something I can do? Or how do I mark such projects for shutdown?

Some maintainers are asking for their projects(either abandoned or very old) to be shut down.
Is this something I can do? Or how do I mark such projects for shutdown?

Change 779462 had a related patch set uploaded (by Majavah; author: Majavah):

[labs/toollabs@master] jsub: set buster as default release

https://gerrit.wikimedia.org/r/779462

Mentioned in SAL (#wikimedia-cloud) [2022-04-29T14:22:35Z] <andrewbogott> changing login.toolforge.org, bastion.toolforge.org, and dev.toolforge.org dns entries to refer to the new Buster bastions T277653 https://wikitech.wikimedia.org/wiki/News/Toolforge_Stretch_deprecation#Timeline

Mentioned in SAL (#wikimedia-cloud) [2022-05-30T08:24:26Z] <taavi> depool tools-sgeexec-[0901-0909] (7 nodes total) T277653

Change 801385 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] sonofgridengine: grid_configurator: make the grid master a submit host

https://gerrit.wikimedia.org/r/801385

Change 802470 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/software/tools-webservice@master] gridengine: default to buster

https://gerrit.wikimedia.org/r/802470

Change 802470 merged by jenkins-bot:

[operations/software/tools-webservice@master] gridengine: default to buster

https://gerrit.wikimedia.org/r/802470

Mentioned in SAL (#wikimedia-cloud) [2022-06-02T10:16:57Z] <taavi> publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653

Change 779462 merged by jenkins-bot:

[labs/toollabs@master] jsub: set buster as default release

https://gerrit.wikimedia.org/r/779462

Mentioned in SAL (#wikimedia-cloud) [2022-06-02T11:17:36Z] <taavi> publish jobutils 1.44 that updates the grid default from stretch to buster T277653

Change 807168 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::checker: add buster endpoints

https://gerrit.wikimedia.org/r/807168

Change 807169 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] icinga::monitor::toollabs: replace stretch with buster

https://gerrit.wikimedia.org/r/807169

Change 807170 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::checker: remove stretch endpoints

https://gerrit.wikimedia.org/r/807170

Change 807168 merged by David Caro:

[operations/puppet@production] P:toolforge::checker: add buster endpoints

https://gerrit.wikimedia.org/r/807168

Change 807182 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::checker: add missing endpoint config

https://gerrit.wikimedia.org/r/807182

Change 807184 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/software/tools-webservice@master] Remove stretch support

https://gerrit.wikimedia.org/r/807184

Change 807185 had a related patch set uploaded (by Majavah; author: Majavah):

[labs/toollabs@master] jsub: Remove stretch support

https://gerrit.wikimedia.org/r/807185

Change 807182 merged by David Caro:

[operations/puppet@production] P:toolforge::checker: add missing endpoint config

https://gerrit.wikimedia.org/r/807182

Change 807169 merged by Filippo Giunchedi:

[operations/puppet@production] icinga::monitor::toollabs: replace stretch with buster

https://gerrit.wikimedia.org/r/807169

The following SGE jobs are still running on the exec grid:

tools.alchimista 8787914 tlgm
tools.ammarbot 9058295 cine_review
tools.avicbot 400742 uaaby5min
tools.avicbot 6444029 avicbotirc
tools.cewbot 3280296 cron-tools.cewbot-IRC
tools.earwigbot 9782320 earwigbot
tools.germancon-mobile 400540 germancon-mobile-parser
tools.hat-collector 3965351 go
tools.listeria 3216144 bot
tools.nokib-bot 1593642 daily
tools.patest 6882116 script_wui
tools.phetools 2742445 ws_ocr_daemon
tools.phetools 2742447 verify_match
tools.phetools 2742450 extract_text_layer
tools.sdzerobot 6303775 stream
tools.signature-checker 2297341 cron-20170515.signature_check.wikinews
tools.signature-checker 2297344 cron-20170515.signature_check.wikisource
tools.signature-checker 2297345 cron-20170515.signature_check.wikiversity
tools.signature-checker 4615899 cron-20170515.signature_check.zh
tools.signature-checker 5086089 cron-20170515.signature_check.simple
tools.signature-checker 5969401 cron-20170515.signature_check.wiktionary
tools.signature-checker 6091396 cron-20170515.signature_check.zh-classical
tools.signature-manquante-bot 4769872 signature-manquante
tools.telegram-wikilinksbot 5699376 wikilinksbot
tools.toc 1826241 cron-20170915.topic_list.ja
tools.toc 3609056 cron-20170915.topic_list.wikinews
tools.toc 3609088 cron-20170915.topic_list.wikisource
tools.toc 3609172 cron-20170915.topic_list.wikiversity
tools.toc 5086111 cron-20170915.topic_list.wiktionary
tools.toc 5287268 cron-20170915.topic_list.moegirl
tools.urbanecmbot 2811129 patrolAfterPatrol
tools.urbanecmbot 6443111 patrolSandbox
tools.wikidata-todo 2840123 wc_all
tools.wikidata-todo 2840124 wc_rc
tools.wikitanvirbot 1366931 doubredi
tools.wikitanvirbot 5512585 sand-wbbn
tools.wikitanvirbot 5512746 corona
tools.wikitanvirbot 5816767 sand-wtbn

I'm going to kill those sometime tomorrow (so 2022-06-23).

Mentioned in SAL (#wikimedia-cloud) [2022-06-23T13:59:45Z] <taavi> removing remaining continuous jobs from the stretch grid T277653

Change 807185 merged by jenkins-bot:

[labs/toollabs@master] jsub: Remove stretch support

https://gerrit.wikimedia.org/r/807185

Change 807184 merged by jenkins-bot:

[operations/software/tools-webservice@master] Remove stretch support

https://gerrit.wikimedia.org/r/807184

taavi updated Other Assignee, added: taavi.

Change 807170 merged by David Caro:

[operations/puppet@production] P:toolforge::checker: remove stretch endpoints

https://gerrit.wikimedia.org/r/807170