Maniphest T193264

Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• chasemp
	Apr 27 2018, 6:50 PM

Description

Replace physical hosts labsdb100[45] (ToolsDB) and labsdb100[67] (OpenStreetMaps) with 4 virtualized hosts created manually on cloudvirt10(19|20).

When it came time to refresh these physical hosts several of us met and decided the most practical option was to order large dedicated cloudvirt hosts and to replace these physical hosts with large instances that we could manage more efficiently. The DBA team agreed to continue helping manage these in the virtual context. This means floating IPs so that tendril can manage these like other databases in production.

Details

Subject	Repo	Branch	Lines +/-
osm: Add a cloud-internal address for the osmdb cluster	operations/puppet	production	+2 -0
osmdb: Switch the replica to the VM that needs to become the master	operations/puppet	production	+2 -2
wikilabels: Update toolschecker to monitor the live DB	operations/puppet	production	+4 -4
osmdb: Switch the replica to the VM that needs to become the master	operations/puppet	production	+2 -2
osmdb: refactor the password framework to not use the module	operations/puppet	production	+14 -10
osmdb: stage the roles and profiles for virtualizing the servers	operations/puppet	production	+160 -4
wikilabels: stage the postgres roles for virtualizing the database	operations/puppet	production	+81 -0
cloudvirt1020: Network config	operations/puppet	production	+4 -1
toolsdb: Enable monitoring	operations/puppet	production	+6 -0
toolsdb: Point tools-db.eqiad.wmflabs to clouddb1001	operations/puppet	production	+1 -1
toolschecker: Replace labsdb1005 with clouddb1001	operations/puppet	production	+5 -5
maintain_dbusers: add the new database VM	operations/puppet	production	+2 -2
cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs	operations/puppet	production	+1 -1
wmcs: introduce new toolsdb primary role	operations/puppet	production	+31 -0
toolsdb: refactoring some of the mariadb things for toolsdb	operations/puppet	production	+49 -0
hiera: cloudvirt1009: fix interface name	operations/puppet	production	+3 -3
cloudvirt1019 - Fix network config	operations/puppet	production	+4 -1
cloudvirt1019/1020 - Reimage with Stretch	operations/puppet	production	+0 -4
labvirt partman: Move labvirt1019-1022 to the standard labvirt partman recipe	operations/puppet	production	+3 -2

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	None	T207536 Move various support services for Cloud VPS currently in prod into their own instances
		Unknown Object (Task)
Resolved	• chasemp	T172538 rack/setup/install labvirt10(19\|20).eqiad.wmnet
Resolved	• Bstorm	T216208 ToolsDB overload and cleanup
Declined	None	T216173 labsdb1005/6 - Upgrade to Stretch
Resolved	• Bstorm	T193264 Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
		Unknown Object (Task)
Resolved	• Cmjohnson	T196507 Degraded RAID on cloudvirt1019
Resolved	• Cmjohnson	T194855 Degraded RAID on cloudvirt1020
Resolved	aborrero	T216353 toolsdb: firewalling changes for new setup (temporal mysql replication)
Declined	None	T216373 CloudVPS: run maintain-dbusers inside Toolforge
Declined	None	T208754 rename cloudvirt1019 and cloudvirt1020 to cloudvirtdb1001 and cloudvirtdb1002
Resolved	Jclark-ctr	T216749 Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet
Resolved	Halfak	T217922 Migrate Wikilabels from labsdb1004 to clouddb1002
Resolved	Halfak	T219563 Add a DNS alias for the wikilabels database (wikilabels.db.svc.eqiad.wmflabs)
Resolved	• Bstorm	T219652 Final migration of osmdb.eqiad.wmnet into Cloud VPS instances
Resolved	Jclark-ctr	T220144 Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud) [2019-02-16T13:54:54Z] <arturo> T193264 create 'clouddb10' puppet prefix to store puppet/hiera config for database servers in this project

Mentioned in SAL (#wikimedia-cloud) [2019-02-16T13:59:18Z] <arturo> T193264 switched clouddb1001/1004 to the new project local puppetmaster

• Marostegui subscribed.Feb 16 2019, 4:52 PM

Change 491003 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: introduce new toolsdb primary role

https://gerrit.wikimedia.org/r/491003

Change 491005 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/491005

Change 491013 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] maintain_dbusers: add the new database VM

https://gerrit.wikimedia.org/r/491013

Change 491003 merged by Bstorm:
[operations/puppet@production] wmcs: introduce new toolsdb primary role

https://gerrit.wikimedia.org/r/491003

aborrero closed subtask T216353: toolsdb: firewalling changes for new setup (temporal mysql replication) as Resolved.Feb 17 2019, 1:16 PM

• Bstorm reopened subtask T216353: toolsdb: firewalling changes for new setup (temporal mysql replication) as Open.Feb 17 2019, 4:56 PM

Mentioned in SAL (#wikimedia-cloud) [2019-02-17T18:54:15Z] <arturo> T193264 create VM clouddb-services-01 for PoC of running maintain-dbusers from here

Mentioned in SAL (#wikimedia-cloud) [2019-02-17T19:16:19Z] <arturo> T193264 delete VM clouddb-services-01

aborrero closed subtask T216353: toolsdb: firewalling changes for new setup (temporal mysql replication) as Resolved.Feb 18 2019, 12:31 PM

I have been talking to @aborrero about the new instance on clouddb1001 - and I have been taking a general look.
While comparing the grants, I have realised that clouddb1001 is missing a grant for the following user: s52716 (that grant exists on labsdb1005) it could be a new user. I can easily copy that grant over to clouddb1001, but I want the green light from @Bstorm just in case this has something to do with maintain-dbusers or something :-)

aborrero added a subtask: T208754: rename cloudvirt1019 and cloudvirt1020 to cloudvirtdb1001 and cloudvirtdb1002.Feb 18 2019, 5:02 PM

In T193264#4961658, @Marostegui wrote:

I have been talking to @aborrero about the new instance on clouddb1001 - and I have been taking a general look.
While comparing the grants, I have realised that clouddb1001 is missing a grant for the following user: s52716 (that grant exists on labsdb1005) it could be a new user. I can easily copy that grant over to clouddb1001, but I want the green light from @Bstorm just in case this has something to do with maintain-dbusers or something :-)

This should actually be a good test of switching maintain-dbusers to managing the new clouddb1001 grants. We should see it automatically detect the missing grant and create it.

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T17:55:13Z] <arturo> (jaime T193264) set clouddb1001 in read_only=1

aborrero changed the status of subtask T216373: CloudVPS: run maintain-dbusers inside Toolforge from Open to Stalled.Feb 18 2019, 5:56 PM

Change 491005 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/491005

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:12:05Z] <arturo> T193264 pointing tools.db.svc.eqiad.wmflabs to clouddb1001

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:26:22Z] <arturo> (jaime T193264) setting clouddb1001 in read_write mode

Change 491013 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] maintain_dbusers: add the new database VM

https://gerrit.wikimedia.org/r/491013

Change 491290 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolschecker: Replace labsdb1005 with clouddb1001

https://gerrit.wikimedia.org/r/491290

Change 491290 merged by GTirloni:
[operations/puppet@production] toolschecker: Replace labsdb1005 with clouddb1001

https://gerrit.wikimedia.org/r/491290

Change 491294 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolsdb: Enable monitoring

https://gerrit.wikimedia.org/r/491294

Change 491296 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolsdb: Point tools-db.eqiad.wmflabs to clouddb1001

https://gerrit.wikimedia.org/r/491296

Change 491296 merged by GTirloni:
[operations/puppet@production] toolsdb: Point tools-db.eqiad.wmflabs to clouddb1001

https://gerrit.wikimedia.org/r/491296

Change 491294 merged by GTirloni:
[operations/puppet@production] toolsdb: Enable monitoring

https://gerrit.wikimedia.org/r/491294

toolsdb is now being monitored

• GTirloni closed subtask T194855: Degraded RAID on cloudvirt1020 as Resolved.Feb 20 2019, 5:43 PM

Change 491825 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvirt1020: Network config

https://gerrit.wikimedia.org/r/491825

Change 491825 merged by GTirloni:
[operations/puppet@production] cloudvirt1020: Network config

https://gerrit.wikimedia.org/r/491825

cloudvirt1020 has been reimaged with Stretch and RAID configuration contains 2 spares now.

• Bstorm reopened subtask T194855: Degraded RAID on cloudvirt1020 as Open.Feb 21 2019, 5:46 PM

Cloudvirt1020 is showing the degraded RAID again. Reopening that.

Andrew closed subtask T194855: Degraded RAID on cloudvirt1020 as Resolved.Feb 21 2019, 6:29 PM

All better, moving on!

Created clouddb1002/3 on cloudvirt1020

Created the data volume on clouddb1002 and /srv/labsdb so that the replica data can be loaded in there.

• Bstorm claimed this task.Feb 21 2019, 11:56 PM

• Bstorm moved this task from Epics to Doing on the cloud-services-team (Kanban) board.

Ok, it is quite interesting to note that the LVM configuration of labsdb1004 and labsdb1005 are quite dissimilar.
On labsdb1005:

/dev/mapper/tank-data   19T  1.9T   17T  11% /srv

On labsdb1004:

/dev/mapper/labsdb1004--vg-postgres 1008G  258G  700G  27% /srv/postgres
/dev/mapper/labsdb1004--vg-labsdb    2.3T  1.5T  888G  63% /srv/labsdb

clouddb1001 has 3.36t. I'd originally provisioned clouddb1002 the same way (replacement for labsdb1004), but it looks like that isn't a good match. I'll remove the puppet role and redo the LVM on clouddb1002 to be a bit closer. I'd rather not have wikilabels impacted by toolsdb and vice versa. This does make it seem like wikilabels ought to be on its own server, though.

Ah, another interesting problem. Since we are now pulling two disks out for spares, this virt doesn't have as much space as it did. The data volume of one of these DB instances was originally slated to be ~3 TB. That might get tight now. @Andrew how much of the disk is copy-on-write for this stuff? I don't want to rely too much on oversubscription around DBs.

I think I need to make a "smalldb" flavor for this that has smaller disks for the OSMdb replacements to make it all work. That only has to be 1.6 TB:
/dev/mapper/labsdb1006--vg-srv 1.6T 1.2T 375G 76% /srv

That should leave enough as long as I *don't* move wikilabels to a different server. I do hate that the replica is so different from the toolsdb primary, though.

Made a new mediumdb flavor for the osmdb servers. clouddb1003 and clouddb1004 should be rebuilt before moving the database using that flavor.

The new VMs are up on the right places with the right flavor.

Ok, clouddb1002 is configured to roughly match labsdb1004 now. I've also added 7.5 GB of swap from LVM for both 1 and 2 in the cluster as a safety measure (and using a similar configuration to the original hardware).

Mentioned in SAL (#wikimedia-operations) [2019-02-27T19:43:36Z] <bstorm_> downtimed labsdb1004 to stop mysql for transferring data for T193264

Mentioned in SAL (#wikimedia-operations) [2019-02-27T19:49:31Z] <bstorm_> stopped slave on labsbd1004 for T193264

Stopped slave at

Relay_Master_Log_File: log.167066
Exec_Master_Log_Pos: 104666280

syncing over the data to clouddb1002 from labsdb1004

Mentioned in SAL (#wikimedia-operations) [2019-02-28T02:08:48Z] <bstorm_> clouddb1002 is now in place to replace labsdb1004 as replica for toolsdb but not wikilabels postgres yet T193264

Change 493608 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikilabels: stage the postgres roles for virtualizing the database

https://gerrit.wikimedia.org/r/493608

@Halfak I am going to need to move the wikilabels database soon to the new server (moving from labsdb1004 -> clouddb1002.clouddb-services.eqiad.wmflabs). The server will be prepped once I sort out the puppet stuff I just put up for review. It's not ready just yet, so this is just a heads up that I'll need to sort out a time for the DB to be down for a while to dump it out, transfer it, and stand it back up at the other end. I imagine this might require some changes in wikilabels as well.

Does that only depend on DNS for labsdb1004 directly?

Change 493608 merged by Bstorm:
[operations/puppet@production] wikilabels: stage the postgres roles for virtualizing the database

https://gerrit.wikimedia.org/r/493608

Postgres is now running on clouddb1002 on the correct data volume. It just needs the data moved over (and any changes on the apps etc).

Change 493769 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: stage the roles and profiles for virtualizing the servers

https://gerrit.wikimedia.org/r/493769

Change 493769 merged by Bstorm:
[operations/puppet@production] osmdb: stage the roles and profiles for virtualizing the servers

https://gerrit.wikimedia.org/r/493769

• Bstorm added a parent task: T172538: rack/setup/install labvirt10(19|20).eqiad.wmnet.Mar 6 2019, 3:37 PM

• Bstorm added a project: Wikilabels.Mar 6 2019, 3:51 PM

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptMar 6 2019, 3:51 PM

Also adding the wikilabels tag since this is going to need coordination to move that database.

Change 494771 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osm: Add a cloud-internal address for the osmdb cluster

https://gerrit.wikimedia.org/r/494771

Change 494843 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: refactor the password framework to not use the module

https://gerrit.wikimedia.org/r/494843

Change 494843 merged by Bstorm:
[operations/puppet@production] osmdb: refactor the password framework to not use the module

https://gerrit.wikimedia.org/r/494843

Change 495290 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/495290

@Bstorm what is left to be able to decommission labsdb1004 and labsdb1005? Is it: T217922: Migrate Wikilabels from labsdb1004 to clouddb1002 or is there something else missing?
Just trying to understand what are those hosts still used for :-)

@Marostegui, It's just that and any activity remaining for T216441.

Halfak closed subtask T217922: Migrate Wikilabels from labsdb1004 to clouddb1002 as Resolved.Mar 19 2019, 2:16 PM

Harej raised the priority of this task from Medium to High.Mar 19 2019, 9:11 PM

Harej moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.

• GTirloni unsubscribed.Mar 21 2019, 9:06 PM

• Bstorm changed the status of subtask T216749: Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet from Stalled to Open.Mar 26 2019, 5:01 PM

Got the role working on the replacement osmdb primary. Going to try to establish replication to it.

Change 494771 had a related patch set uploaded (by Andrew Bogott; owner: Bstorm):
[operations/puppet@production] osm: Add a cloud-internal address for the osmdb cluster

https://gerrit.wikimedia.org/r/494771

Change 495290 merged by Bstorm:
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/495290

Reverted that change since I forgot to complete a pg_basebackup first. That is now running.

Mentioned in SAL (#wikimedia-cloud) [2019-03-28T19:32:34Z] <bstorm_> pg_basebackup started on clouddb1003 for osmdb T193264

Change 499910 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikilabels: Update toolschecker to monitor the live DB

https://gerrit.wikimedia.org/r/499910

Change 499910 merged by Bstorm:
[operations/puppet@production] wikilabels: Update toolschecker to monitor the live DB

https://gerrit.wikimedia.org/r/499910

Change 499940 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/499940

Change 499940 merged by Bstorm:
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/499940

Mentioned in SAL (#wikimedia-cloud) [2019-03-28T23:17:06Z] <bstorm_> clouddb1003 is now a full osmdb replica T193264

Change 494771 merged by Bstorm:
[operations/puppet@production] osm: Add a cloud-internal address for the osmdb cluster

https://gerrit.wikimedia.org/r/494771

Mentioned in SAL (#wikimedia-cloud) [2019-03-29T00:00:41Z] <bstorm_> T193264 Added osm.db.svc.eqiad.wmflabs to cloud DNS

osmdb is now on VMs.

• Bstorm closed subtask T219652: Final migration of osmdb.eqiad.wmnet into Cloud VPS instances as Resolved.Apr 5 2019, 4:46 PM

This is done!

bd808 awarded a token.May 23 2019, 8:46 PM

• Bstorm closed subtask T216373: CloudVPS: run maintain-dbusers inside Toolforge as Declined.May 24 2019, 11:41 PM

Papaul closed subtask T216749: Decommission labsdb1004.eqiad.wmnet and labsdb1005.eqiad.wmnet as Resolved.Oct 17 2019, 11:22 PM

• Cmjohnson closed subtask T220144: Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet as Resolved.May 21 2020, 10:42 PM

ayounsi mentioned this in T264993: Audit cloud-in4 ACL.Oct 8 2020, 8:52 AM

Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
Closed, ResolvedPublic
Actions

Related Objects
Search...