Page MenuHomePhabricator

Reimage WMCS db proxies to Bullseye
Closed, ResolvedPublic3 Estimated Story Points

Description

We are in process of reinstalling our production dbproxies to Bullseye.

Please re-install dbproxy1018 and dbproxy1019 (wiki-replicas proxies) to Bullseye.

Progress

  • dbproxy1018
  • dbproxy1019

Note: {T284923} is still pending, which would be invalid if this one is done first.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
taavi renamed this task from Reimage WMCS to Bullseye to Reimage WMCS db proxies to Bullseye.Jan 11 2022, 6:47 AM
Marostegui renamed this task from Reimage WMCS db proxies to Bullseye to Reimage wiki-replicas dbproxy* to Bullseye.Jan 11 2022, 6:47 AM
taavi renamed this task from Reimage wiki-replicas dbproxy* to Bullseye to Reimage WMCS db proxies to Bullseye.Jan 11 2022, 6:47 AM
taavi added a project: Data-Services.
Aklapper added a parent task: Restricted Task.Jan 11 2022, 8:43 AM

Any ETA on this task or {T273278}?

Change 779915 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use both dbproxy101[89] servers for both wikireplica services

https://gerrit.wikimedia.org/r/779915

If we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915/ that @BTullis and I came up with, we can do the following to reimage these hosts.

First, we can first pool dbproxy1019 into wikireplicas-a:

sudo confctl select service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=yes

Then depool dbproxy1018:

sudo confctl select service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet set/pooled=no

Briefly dbproxy1019 will serve both analytics and web loads. At this point we can reimage dbproxy1018 (as I understand the server is not running any persistent-storage services so it would only need a standard sudo cookbook sre.hosts.reimage --os bullseye -t T298940 dbproxy1018)

Once that's all done, we can do the opposite, pool dbproxy1018 for both and depool 1019:

sudo confctl select service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet set/pooled=yes
sudo confctl select service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=no
sudo confctl select service=wikireplicas-b,name=dbproxy1018.eqiad.wmnet set/pooled=yes
sudo confctl select service=wikireplicas-b,name=dbproxy1019.eqiad.wmnet set/pooled=no

Reimage 1019, repool it into wikireplicas-b:

sudo confctl select service=wikireplicas-b,name=dbproxy1019.eqiad.wmnet set/pooled=yes

and finally depool dbproxy1018 from wikireplicas-b:

sudo confctl select service=wikireplicas-b,name=dbproxy1018.eqiad.wmnet set/pooled=no

and we're back to where we started, with minimal downtime and hosts reimaged.

Change 779915 merged by Razzi:

[operations/puppet@production] Use both dbproxy101[89] servers for both wikireplica services

https://gerrit.wikimedia.org/r/779915

I merged the related patch, but when I restarted pybal it caused an alert, so I'm waiting for input from the traffic team before proceeding: https://phabricator.wikimedia.org/T308174

Mentioned in SAL (#wikimedia-analytics) [2022-05-12T14:35:31Z] <razzi> razzi@cumin1001:~$ sudo confctl select service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=yes # for T298940

Claiming this ticket as Razzi is leaving the Foundation.

Apologies for the delay in getting this done. I'm hoping to work on this ticket shortly.

BTullis set the point value for this task to 3.Jul 4 2022, 11:04 AM
BTullis moved this task from Backlog to Sprint 01 on the Data-Engineering-Planning board.
EChetty raised the priority of this task from Medium to High.Jul 4 2022, 11:38 AM

I'm planning to begin work on this ticket today.

I will reimage dbproxy1018 first, which is the analytics proxy and serves less traffic than dbproxy1019.

In order to bring dbproxy1018 out of service, we will use the following commands from puppetmaster1001.

sudo -i confctl select cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=yes

...wait 30 seconds

sudo -i confctl select cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet set/pooled=no

Actually, I think it would be better for me to make sure that I have access to the wikireplicas via a toolforge account or similar, before I go about this work.

I have now joined toolforge and have verified that I have access to the wikireplicas via that interface, so when pooling dbproxy1019 and depooling dbproxy1019 I can now verify that the analutics replicas are still working by using the command:

mysql -h commonswiki.analytics.db.svc.wikimedia.cloud -e "SHOW DATABASES"

Proceeding to pool dbproxy1019 for wikireplicas-a now.

Sadly, it didn't work. When I pooled dbproxy1019 for the wikireplicas-a service it didn't respond to queries. My test command showed errors:

btullis@tools-sgebastion-10:~$ mysql -h commonswiki.analytics.db.svc.wikimedia.cloud -e "SHOW DATABASES"
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11

I'm looking into why now.

Sincere thanks to @taavi who rapidly identified the cause and provided a fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/819090

I have now successfully depooled dbproxy1018 and I've verified that the analytics wikireplica service is still available from toolforge.

I'll proceed with the reimage of dbproxy1018 now.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbproxy1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbproxy1018.eqiad.wmnet with OS bullseye completed:

  • dbproxy1018 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208011541_btullis_1141032_dbproxy1018.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

The reimage of dbproxy1018 is complete and it is back in service. I will now proceed to reimage dbproxy1019.

These are the confctl operations that I carried out.

Verify the current state of the wikireplicas-b cluster

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b get
{"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}
{"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}

Pool dbproxy1018 for the wikireplicas-b cluster

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet set/pooled=yes
eqiad/wikireplicas-b/wikireplicas-b/dbproxy1018.eqiad.wmnet: pooled changed inactive => yes
WARNING:conftool.announce:conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet

Verify the current state again, to check that both servers are pooled

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b get
{"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}
{"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}

Depool dbproxy1019 from this cluster

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet set/pooled=no
eqiad/wikireplicas-b/wikireplicas-b/dbproxy1019.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet

Verify the current state a final time.

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b get
{"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}
{"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}

Check from toolforge that the realtime service is still working and is temporarily using the analytics back-end servers.

btullis@tools-sgebastion-10:~$ mysql -h commonswiki.web.db.svc.wikimedia.cloud -e "SELECT @@hostname"
+-------------+
| @@hostname  |
+-------------+
| clouddb1019 |
+-------------+

All fine, so I will now start the reimage of dbproxy1019.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbproxy1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbproxy1019.eqiad.wmnet with OS bullseye completed:

  • dbproxy1019 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208020936_btullis_1327955_dbproxy1019.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
BTullis moved this task from In progress to Done on the Data-Engineering-Planning (Sprint 02) board.

The reimage of dbproxy1019 also completed without incident. It has been put back into service and dbproxy1018 has been set back to inactive as before.
Marking this ticket as done.