Page MenuHomePhabricator

Remove or replace deployment-sessionstore04.deployment-prep.eqiad1.wikimedia.cloud (Buster deprecation)
Closed, ResolvedPublic

Description

Debian Buster is well out of upstream support and all Buster VMs need to be replaced.

Event Timeline

Great, can't ssh into my new instance:

$ ssh deployment-sessionstore05.deployment-prep.eqiad1.wikimedia.cloud
Connection closed by UNKNOWN port 65535

Mentioned in SAL (#wikimedia-cloud) [2024-07-20T15:52:31Z] <Southparkfan> add deployment-sessionstore05 (bookworm) - T370461

Great, can't ssh into my new instance:

$ ssh deployment-sessionstore05.deployment-prep.eqiad1.wikimedia.cloud
Connection closed by UNKNOWN port 65535

I'm not sure what went wrong here but I forced a puppet run and switched this over to to the deployment-prep puppetserver and it will most likely accept your keys now.

Had to delete sessionstorage05 (bookworm) due to T357791, will replace with a bullseye instance for Cassandra

Puppet fails to install the Cassandra instance:

Error: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0]
Error: /Stage[main]/Cassandra/Cassandra::Instance[default]/Exec[install-/var/lib/cassandra/data]/returns: change from 'notrun' to ['0'] failed: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] (corrective)
Error: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0]
Error: /Stage[main]/Cassandra/Cassandra::Instance[default]/Exec[install-/var/lib/cassandra/data]/returns: change from 'notrun' to ['0'] failed: 'install -o cassandra -g cassandra -m 750 -d /var/lib/cassandra/data' returned 1 instead of one of [0] (corrective)

The user 'cassandra' does not exist. Asked for help in #wikimedia-sre.

I didn't get a response in -sre, but Andrew has provided me with extra information.

GID/UID on sessionstore1004:

uid=114(cassandra) gid=121(cassandra) groups=121(cassandra)

GID/UID on sessionstore04:

uid=115(cassandra) gid=122(cassandra) groups=122(cassandra)

Added user/group manually:

groupadd -g 122 cassandra
useradd cassandra -u 115 -r -s /sbin/nologin -d /var/lib/cassandra -g 122

It failed to install cassandra 3.11.14:

Error: /Stage[main]/Cassandra/Apt::Package_from_component[cassandra]/Package[cassandra]/ensure: change from 'purged' to '3.11.14' failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install cassandra=3.11.14' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
W: --force-yes is deprecated, use one of the options starting with --allow instead.
E: Version '3.11.14' for 'cassandra' was not found

Per T313814, we should be using cassandra 4.x. For deployment-prep, this was done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/939750. However, sessionstore was left behind on 3.x. The component cassandra41 is in buster-wikimedia and bullseye-wikimedia, so I'll see if I can get the Buster machine upgraded to cassandra 4.x, before migrating to a Bullseye machine where 4.x is a mandatory choice.

Mentioned in SAL (#wikimedia-cloud) [2024-07-23T15:34:41Z] <Southparkfan> starting kask maintenance - T370461

Couldn't upgrade Buster to 4.x, because there are no packages in buster-wikimedia. Installing Cassandra was a rather interesting process.

The bootstrap failed:

Error: Execution of '/usr/bin/scap deploy-local --repo cassandra/logstash-logback-encoder -D log_json:False' returned 70: 
Error: /Stage[main]/Cassandra::Logging/Scap::Target[cassandra/logstash-logback-encoder]/Package[cassandra/logstash-logback-encoder]/ensure: change from 'absent' to 'present' failed: Execution of '/usr/bin/scap deploy-local --repo cassandra/logstash-logback-encoder -D log_json:False' returned 70: 
[...]
scap.runcmd.FailedCommand: Command 'git remote set-url origin http://deployment-deploy04.deployment-prep.eqiad1.wikimedia.cloud/cassandra/logstash-logback-encoder/.git' failed with exit code 128;
stdout:

stderr:
error: could not lock config file .git/config: Permission denied

Fixed manually by running:

root@deployment-sessionstore06:/srv/deployment# chown -R deploy-service cassandra/
root@deployment-sessionstore06:/srv/deployment# sudo -u deploy-service scap deploy-local --repo cassandra/logstash-logback-encoder -D log_json:False

Cassandra failed to start properly due to the lack of /etc/cassandra/service-enabled, had to touch this file.

Afterwards, Kask started to complain about keyspaces:

Jul 23 16:41:32 deployment-sessionstore06 docker-mediawiki-services-kask[92037]: {"msg":"error: failed to connect to \"[HostInfo hostname=\\\"172.16.2.> [...] =\\\"v4.1.5\\\" state=UP num_tokens=256]\" due to error: Keyspace 'sessions' does not exist","appname":"sessions","time":"2024-07-23T16:41:32Z","level">

Thought I'd have it fixed by creating the schema:

CQLSH_HOST=172.16.2.225 cqlsh -f cassandra_schema.cql -u cassandra -p cassandra

However, that uses the keyspace kask, not sessions. After a bit of fiddling, I have adjusted the new node's hiera to reflect the new keyspace.

And finally: the container has started.

I have ran out of time, so will not be migrating to the new node today.

Mentioned in SAL (#wikimedia-cloud) [2024-07-23T16:55:49Z] <Southparkfan> cancel kask maintenance, not going to perform switchover yet, see https://phabricator.wikimedia.org/T370461

Change #1056513 had a related patch set uploaded (by Southparkfan; author: Southparkfan):

[operations/mediawiki-config@master] LabsServices: convert more services to svc records

https://gerrit.wikimedia.org/r/1056513

Change #1056513 merged by Andrew Bogott:

[operations/mediawiki-config@master] LabsServices: convert more services to svc records

https://gerrit.wikimedia.org/r/1056513

Mentioned in SAL (#wikimedia-cloud) [2024-07-24T16:02:45Z] <Southparkfan> moved sessionstorage/kask from sessionstorage04 to sessionstorage06 T370461

sessionstorage04 is no longer.