Reimage WMCS db proxies to Bullseye
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	Marostegui
	Jan 11 2022, 6:45 AM

Description

We are in process of reinstalling our production dbproxies to Bullseye.

Please re-install dbproxy1018 and dbproxy1019 (wiki-replicas proxies) to Bullseye.

Progress

dbproxy1018
dbproxy1019

Note: {T284923} is still pending, which would be invalid if this one is done first.

Details

	Subject	Repo	Branch	Lines +/-
	Use both dbproxy101[89] servers for both wikireplica services	operations/puppet	production	+4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	Legoktm	T281515 June 2021 Datacenter switchover
Resolved	Marostegui	T285139 DB maintenance work to do while eqiad is passive (June 2021)
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Marostegui	T298585 Upgrade WMF database-and-backup-related hosts to bullseye
Resolved	Marostegui	T298586 Upgrade all dbproxy hosts to Bullseye
		Restricted Task
		Restricted Task
		Restricted Task
Resolved	BTullis	T298940 Reimage WMCS db proxies to Bullseye

Event Timeline

Marostegui created this task.Jan 11 2022, 6:45 AM

Restricted Application edited projects, added cloud-services-team (Kanban); removed cloud-services-team. · View Herald TranscriptJan 11 2022, 6:45 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Marostegui mentioned this in T298586: Upgrade all dbproxy hosts to Bullseye.Jan 11 2022, 6:45 AM

taavi renamed this task from Reimage WMCS to Bullseye to Reimage WMCS db proxies to Bullseye.Jan 11 2022, 6:47 AM

Marostegui renamed this task from Reimage WMCS db proxies to Bullseye to Reimage wiki-replicas dbproxy* to Bullseye.Jan 11 2022, 6:47 AM

taavi renamed this task from Reimage wiki-replicas dbproxy* to Bullseye to Reimage WMCS db proxies to Bullseye.Jan 11 2022, 6:47 AM

taavi added a project: Data-Services.

Marostegui updated the task description. (Show Details)Jan 11 2022, 8:35 AM

Aklapper added a parent task: Restricted Task.Jan 11 2022, 8:43 AM

LSobanski subscribed.Jan 19 2022, 10:21 AM

Marostegui added a project: Data-Engineering.Mar 21 2022, 6:09 AM

Any ETA on this task or {T273278}?

• EChetty triaged this task as Medium priority.Mar 24 2022, 3:16 PM

• EChetty moved this task from Incoming (new tickets) to Event Platform Backlog on the Data-Engineering board.

• EChetty edited projects, added Data-Engineering-Radar; removed Data-Engineering.Mar 24 2022, 5:37 PM

Change 779915 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Use both dbproxy101[89] servers for both wikireplica services

https://gerrit.wikimedia.org/r/779915

gerritbot added a project: Patch-For-Review.Apr 13 2022, 5:09 PM

taavi moved this task from Backlog to Wiki replicas on the Data-Services board.Apr 21 2022, 6:11 AM

If we merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/779915/ that @BTullis and I came up with, we can do the following to reimage these hosts.

First, we can first pool dbproxy1019 into wikireplicas-a:

sudo confctl select service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=yes

Then depool dbproxy1018:

sudo confctl select service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet set/pooled=no

Briefly dbproxy1019 will serve both analytics and web loads. At this point we can reimage dbproxy1018 (as I understand the server is not running any persistent-storage services so it would only need a standard sudo cookbook sre.hosts.reimage --os bullseye -t T298940 dbproxy1018)

Once that's all done, we can do the opposite, pool dbproxy1018 for both and depool 1019:

sudo confctl select service=wikireplicas-a,name=dbproxy1018.eqiad.wmnet set/pooled=yes
sudo confctl select service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=no
sudo confctl select service=wikireplicas-b,name=dbproxy1018.eqiad.wmnet set/pooled=yes
sudo confctl select service=wikireplicas-b,name=dbproxy1019.eqiad.wmnet set/pooled=no

Reimage 1019, repool it into wikireplicas-b:

sudo confctl select service=wikireplicas-b,name=dbproxy1019.eqiad.wmnet set/pooled=yes

and finally depool dbproxy1018 from wikireplicas-b:

sudo confctl select service=wikireplicas-b,name=dbproxy1018.eqiad.wmnet set/pooled=no

and we're back to where we started, with minimal downtime and hosts reimaged.

• razzi claimed this task.Apr 28 2022, 4:10 PM

• razzi edited projects, added Data-Engineering-Kanban; removed Data-Engineering-Radar.

• razzi moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

• razzi moved this task from In Progress to In Code Review on the Data-Engineering-Kanban board.May 11 2022, 4:05 PM

Change 779915 merged by Razzi:

[operations/puppet@production] Use both dbproxy101[89] servers for both wikireplica services

https://gerrit.wikimedia.org/r/779915

I merged the related patch, but when I restarted pybal it caused an alert, so I'm waiting for input from the traffic team before proceeding: https://phabricator.wikimedia.org/T308174

Mentioned in SAL (#wikimedia-analytics) [2022-05-12T14:35:31Z] <razzi> razzi@cumin1001:~$ sudo confctl select service=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=yes # for T298940

Claiming this ticket as Razzi is leaving the Foundation.

Maintenance_bot removed a project: Patch-For-Review.May 19 2022, 6:30 PM

BTullis claimed this task.May 25 2022, 4:07 PM

BTullis moved this task from In Code Review to In Progress on the Data-Engineering-Kanban board.

BTullis moved this task from In Progress to Next Up on the Data-Engineering-Kanban board.Jun 1 2022, 4:06 PM

Marostegui mentioned this in T298585: Upgrade WMF database-and-backup-related hosts to bullseye.Jun 30 2022, 7:55 AM

RhinosF1 subscribed.Jun 30 2022, 8:48 AM

JArguello-WMF edited projects, added Data-Engineering-Planning; removed Data-Engineering-Kanban.Jun 30 2022, 3:12 PM

Apologies for the delay in getting this done. I'm hoping to work on this ticket shortly.

BTullis set the point value for this task to 3.Jul 4 2022, 11:04 AM

BTullis moved this task from Backlog to Sprint 01 on the Data-Engineering-Planning board.

BTullis edited projects, added Data-Engineering-Planning (Sprint 01); removed Data-Engineering-Planning.

• EChetty raised the priority of this task from Medium to High.Jul 4 2022, 11:38 AM

BTullis updated the task description. (Show Details)Jul 4 2022, 12:24 PM

• EChetty edited projects, added Data-Engineering-Planning (Sprint 02); removed Data-Engineering-Planning (Sprint 01).Jul 25 2022, 9:57 AM

• EChetty moved this task from Ready to Next Up on the Data-Engineering-Planning (Sprint 02) board.Jul 25 2022, 3:29 PM

BTullis moved this task from Next Up to In progress on the Data-Engineering-Planning (Sprint 02) board.Aug 1 2022, 1:34 PM

I'm planning to begin work on this ticket today.

I will reimage dbproxy1018 first, which is the analytics proxy and serves less traffic than dbproxy1019.

In order to bring dbproxy1018 out of service, we will use the following commands from puppetmaster1001.

sudo -i confctl select cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet set/pooled=yes

...wait 30 seconds

sudo -i confctl select cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet set/pooled=no

Actually, I think it would be better for me to make sure that I have access to the wikireplicas via a toolforge account or similar, before I go about this work.

I have now joined toolforge and have verified that I have access to the wikireplicas via that interface, so when pooling dbproxy1019 and depooling dbproxy1019 I can now verify that the analutics replicas are still working by using the command:

mysql -h commonswiki.analytics.db.svc.wikimedia.cloud -e "SHOW DATABASES"

Proceeding to pool dbproxy1019 for wikireplicas-a now.

Sadly, it didn't work. When I pooled dbproxy1019 for the wikireplicas-a service it didn't respond to queries. My test command showed errors:

btullis@tools-sgebastion-10:~$ mysql -h commonswiki.analytics.db.svc.wikimedia.cloud -e "SHOW DATABASES"
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11

I'm looking into why now.

Sincere thanks to @taavi who rapidly identified the cause and provided a fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/819090

I have now successfully depooled dbproxy1018 and I've verified that the analytics wikireplica service is still available from toolforge.

I'll proceed with the reimage of dbproxy1018 now.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbproxy1018.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbproxy1018.eqiad.wmnet with OS bullseye completed:

dbproxy1018 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208011541_btullis_1141032_dbproxy1018.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

The reimage of dbproxy1018 is complete and it is back in service. I will now proceed to reimage dbproxy1019.

BTullis updated the task description. (Show Details)Aug 2 2022, 9:30 AM

These are the confctl operations that I carried out.

Verify the current state of the wikireplicas-b cluster

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b get
{"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}
{"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}

Pool dbproxy1018 for the wikireplicas-b cluster

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet set/pooled=yes
eqiad/wikireplicas-b/wikireplicas-b/dbproxy1018.eqiad.wmnet: pooled changed inactive => yes
WARNING:conftool.announce:conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet

Verify the current state again, to check that both servers are pooled

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b get
{"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}
{"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}

Depool dbproxy1019 from this cluster

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet set/pooled=no
eqiad/wikireplicas-b/wikireplicas-b/dbproxy1019.eqiad.wmnet: pooled changed yes => no
WARNING:conftool.announce:conftool action : set/pooled=no; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet

Verify the current state a final time.

btullis@puppetmaster1001:~$ sudo -i confctl select cluster=wikireplicas-b get
{"dbproxy1018.eqiad.wmnet": {"weight": 0, "pooled": "yes"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}
{"dbproxy1019.eqiad.wmnet": {"weight": 0, "pooled": "no"}, "tags": "dc=eqiad,cluster=wikireplicas-b,service=wikireplicas-b"}

Check from toolforge that the realtime service is still working and is temporarily using the analytics back-end servers.

btullis@tools-sgebastion-10:~$ mysql -h commonswiki.web.db.svc.wikimedia.cloud -e "SELECT @@hostname"
+-------------+
| @@hostname  |
+-------------+
| clouddb1019 |
+-------------+

All fine, so I will now start the reimage of dbproxy1019.

Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host dbproxy1019.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host dbproxy1019.eqiad.wmnet with OS bullseye completed:

dbproxy1019 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208020936_btullis_1327955_dbproxy1019.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

The reimage of dbproxy1019 also completed without incident. It has been put back into service and dbproxy1018 has been set back to inactive as before.
Marking this ticket as done.

BTullis closed this task as Resolved.Aug 3 2022, 4:02 PM

Reimage WMCS db proxies to BullseyeClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reimage WMCS db proxies to Bullseye
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...