Page MenuHomePhabricator

Migrate Cumin hosts to Buster
Closed, ResolvedPublic

Description

Tentatively scheduled for next quarter (as it's blocked on the decision whether to proceed with Mariadb 10.4).

High level steps involved:

  • Reimage cumin2001
  • Test workflows and adapt any potential tweaks on the OS/Python side
  • Tell everyone to move to cumin2001 and reimage cumin1001

Event Timeline

jbond triaged this task as Medium priority.Feb 13 2020, 11:47 AM
jbond subscribed.

Cumin should for the most part be able to be upgraded to 10.4 with ease as it only holds the client, not the server, and that is why easier to upgrade (and it should be transparent).

The main issue on db/backups side is that snapshots are scheduled from cumin, and we should make sure those continue uninterrupted (mostly transfer.py and the maria backup scripts).

I have checked the 10.4.12 mariadb client (and previos 10.4.11) on buster (on db1107) for the last few weeks without encountering any issues.

Noticed in T236576 that cumin is not packaged for buster. Is that going to change soon or should I make a new stretch instance?

@Krenair sure, we'll need to make the buster package anyway for the prod migration. If you have a date in mind for your migration let me know so that I can adjust on when doing the buster package. Should be pretty trivial.

Thanks Volans. I'm planning to do it as soon as the package is available.

CI on WMCS uses a cumin master. Krenair already created the Buster instance integration-cumin-02.integration.eqiad.wmflabs. We can use it to experiment :]

I'm working on the cumin release, I need to find some solution for some dependency incompatibility between code and the versions in Debian. I'll update the task with more information in the next few days.

Change 603965 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Use the cumin profile in role::cluster::management

https://gerrit.wikimedia.org/r/603965

Change 603965 merged by Muehlenhoff:
[operations/puppet@production] Use the cumin profile in role::cluster::management

https://gerrit.wikimedia.org/r/603965

Change 604023 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] On buster install python3-tqdm from the spicerack component

https://gerrit.wikimedia.org/r/604023

@Krenair @MoritzMuehlenhoff I've finally! built and uploaded cumin 4.0.0~rc1 to our APT in wikimedia-buster.
The package is installable and from some quick test on a local setup it should work as expected.
The only problem is that we depend on a specific version of tqdm and hence that must be installed and pinned from the one in the component/spicerack APT component. The version in buster has incompatibility issues. So @Krenair a patch like [1] will be needed also for the WMCS setup of cumin in puppet.

FYI I didn't have yet tested the puppetdb or openstack backends.

[1] https://gerrit.wikimedia.org/r/c/operations/puppet/+/604023

Mentioned in SAL (#wikimedia-operations) [2020-06-10T09:35:47Z] <volans> imported 0.0.38-1+deb10u1 into buster-wikimedia APT - T245114

Change 604023 merged by Muehlenhoff:
[operations/puppet@production] On buster install python3-tqdm from the spicerack component

https://gerrit.wikimedia.org/r/604023

Change 604715 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/homer/deploy@master] Add support for buster in the build process

https://gerrit.wikimedia.org/r/604715

Change 604715 merged by Volans:
[operations/software/homer/deploy@master] Add support for buster in the build process

https://gerrit.wikimedia.org/r/604715

Change 605545 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Readd the spicerack component on Stretch

https://gerrit.wikimedia.org/r/605545

Change 605545 merged by Muehlenhoff:
[operations/puppet@production] Readd the spicerack component on Stretch

https://gerrit.wikimedia.org/r/605545

Change 605558 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] homer: fix initial clone of private repo

https://gerrit.wikimedia.org/r/605558

Mentioned in SAL (#wikimedia-operations) [2020-06-15T10:54:08Z] <moritzm> imported python-phabricator 0.7.0-2~wmf2 to apt.wikimedia.org/buster-wikimedia T245114

Change 605558 merged by Volans:
[operations/puppet@production] homer: fix initial clone of private repo

https://gerrit.wikimedia.org/r/605558

Change 605564 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] homer: enforce resource order

https://gerrit.wikimedia.org/r/605564

Change 605583 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] git::clone: allow to pass environment variables

https://gerrit.wikimedia.org/r/605583

Change 605583 merged by Volans:
[operations/puppet@production] git::clone: allow to pass environment variables

https://gerrit.wikimedia.org/r/605583

Change 605564 merged by Volans:
[operations/puppet@production] homer: set the keyholder env variable

https://gerrit.wikimedia.org/r/605564

Change 605590 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] homer: fix git clone URL

https://gerrit.wikimedia.org/r/605590

Change 605590 merged by Volans:
[operations/puppet@production] homer: fix git clone URL

https://gerrit.wikimedia.org/r/605590

Change 605623 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/homer/deploy@master] Include pip into the built wheels

https://gerrit.wikimedia.org/r/605623

Would it be possible to save /home directories somewhere so they are available once the host is back?
It is not a lot of data to save:

root@cumin1001:/home# du -shc .
4.9G	.
4.9G	total

Change 605623 merged by Volans:
[operations/software/homer/deploy@master] Include pip into the built wheels

https://gerrit.wikimedia.org/r/605623

Change 605827 had a related patch set uploaded (by Volans; owner: Volans):
[operations/software/homer/deploy@master] Name expansion doesn't work, make it explicit

https://gerrit.wikimedia.org/r/605827

Change 605827 merged by Volans:
[operations/software/homer/deploy@master] Name expansion doesn't work, make it explicit

https://gerrit.wikimedia.org/r/605827

Change 605828 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] homer: adapt plugin path to new installation

https://gerrit.wikimedia.org/r/605828

Change 605828 merged by Volans:
[operations/puppet@production] homer: adapt plugin path to new installation

https://gerrit.wikimedia.org/r/605828

Change 605844 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] reimage banner for cumin1001

https://gerrit.wikimedia.org/r/605844

Change 605844 merged by Muehlenhoff:
[operations/puppet@production] reimage banner for cumin1001

https://gerrit.wikimedia.org/r/605844

Mentioned in SAL (#wikimedia-operations) [2020-06-22T08:33:50Z] <moritzm> reimaging cumin1001 to buster T245114

Change 606986 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] cumin: fix python3 version on Buster

https://gerrit.wikimedia.org/r/606986

I have:

  • created deployment-cumin.deployment-prep.eqiad.wmflabs and integration-cumin.integration.eqiad.wmflabs.
  • Cherry picked a python3.7 fix from https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/606986/2 on each puppet master
  • Updated profile::openstack::eqiad1::cumin::project_masters in horizon to add the new cumin masters (which in turns allow them in the ssh pub key
    • And ran puppet on an instance of each project
  • armed keyholder

deployment-prep (FIXED)

$ sudo -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh root@deployment-deploy01.deployment-prep.eqiad.wmflabs
root@deployment-deploy01.deployment-prep.eqiad.wmflabs: Permission denied (publickey).

It rejects the IP somehow even though the log states it is permitted?

Jun 22 11:35:49 deployment-deploy01 sshd[8276]: Connection from 172.16.1.151 port 45308 on 172.16.4.18 port 22
Jun 22 11:35:49 deployment-deploy01 sshd[8276]: Authentication tried for root with correct key but not from a permitted host (host=172.16.1.151, ip=172.16.1.151).
Jun 22 11:35:49 deployment-deploy01 sshd[8276]: Failed publickey for root from 172.16.1.151 port 45308 ssh2: ED25519 SHA256:uesq4783AjnMh1XlOwJWGlyMLkxgIlaSNrvUZBhvqQs

On deployment-deploy01 there is:

/etc/ssh/userkeys/root.d/cumin
# Cumin Masters. TODO: use 'restrict' once available across the fleet (> jessie)
from="172.16.1.151,172.16.1.135,172.16.5.1,172.16.6.176,172.16.4.46,172.16.6.133",no-agent-forwarding,no-port-forwarding,no-x11-forwarding,no-user-rc ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAICcav+ECiF6hW2XRuP7R8nqDw4hPlD0OChsGvB6K27jK root@cloudinfra-internal-puppetmaster-02

from="172.16.5.1,172.16.6.176",no-agent-forwarding,no-port-forwarding,no-x11-forwarding,no-user-rc ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHMS6pXywYSw1kaZQivozB8qUx0vd1gqiAnVqJuS365B root@deployment-cumin

Cause there are bunch of hiera keys and I updated the wrong one I guess.

Change 606986 merged by Muehlenhoff:
[operations/puppet@production] cumin: fix labs install on Buster

https://gerrit.wikimedia.org/r/606986

On integration I am hitting a wall, the keyholder refuses to grant access:

For integration, the agent refuses to sign:

integration-cumin:~$ sudo -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh root@integration-agent-docker-1004.integration.eqiad.wmflabs
sign_and_send_pubkey: signing failed: agent refused operation
root@integration-agent-docker-1004.integration.eqiad.wmflabs: Permission denied (publickey).

I can't find the issue :-\

@hashar I've systemctl restart keyholder-proxy.service and ssh seems to work fine now and I can run cumin too.

Change 607009 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Move to integration-cumin (Buster)

https://gerrit.wikimedia.org/r/607009

Change 607009 merged by jenkins-bot:
[integration/config@master] Move to integration-cumin (Buster)

https://gerrit.wikimedia.org/r/607009

It is complete for integration.

On deployment-prep I kept around the old instance just in case but I guess I will delete it at the end of this week.

MoritzMuehlenhoff claimed this task.

All cumin hosts in production are now running Buster.