Page MenuHomePhabricator

Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
Closed, ResolvedPublic

Description

Replace physical hosts labsdb100[45] (ToolsDB) and labsdb100[67] (OpenStreetMaps) with 4 virtualized hosts created manually on cloudvirt10(19|20).

When it came time to refresh these physical hosts several of us met and decided the most practical option was to order large dedicated cloudvirt hosts and to replace these physical hosts with large instances that we could manage more efficiently. The DBA team agreed to continue helping manage these in the virtual context. This means floating IPs so that tendril can manage these like other databases in production.

Details

Related Gerrit Patches:
operations/puppet : productionosm: Add a cloud-internal address for the osmdb cluster
operations/puppet : productionosmdb: Switch the replica to the VM that needs to become the master
operations/puppet : productionwikilabels: Update toolschecker to monitor the live DB
operations/puppet : productionosmdb: Switch the replica to the VM that needs to become the master
operations/puppet : productionosmdb: refactor the password framework to not use the module
operations/puppet : productionosmdb: stage the roles and profiles for virtualizing the servers
operations/puppet : productionwikilabels: stage the postgres roles for virtualizing the database
operations/puppet : productioncloudvirt1020: Network config
operations/puppet : productiontoolsdb: Enable monitoring
operations/puppet : productiontoolsdb: Point tools-db.eqiad.wmflabs to clouddb1001
operations/puppet : productiontoolschecker: Replace labsdb1005 with clouddb1001
operations/puppet : productionmaintain_dbusers: add the new database VM
operations/puppet : productioncloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs
operations/puppet : productionwmcs: introduce new toolsdb primary role
operations/puppet : productiontoolsdb: refactoring some of the mariadb things for toolsdb
operations/puppet : productionhiera: cloudvirt1009: fix interface name
operations/puppet : productioncloudvirt1019 - Fix network config
operations/puppet : productioncloudvirt1019/1020 - Reimage with Stretch
operations/puppet : productionlabvirt partman: Move labvirt1019-1022 to the standard labvirt partman recipe

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
Openaborrero
Openaborrero
Resolvedchasemp
ResolvedBstorm
DeclinedNone
ResolvedBstorm
ResolvedCmjohnson
ResolvedCmjohnson
Resolvedaborrero
DeclinedNone
DeclinedNone
ResolvedJclark-ctr
ResolvedHalfak
ResolvedHalfak
ResolvedBstorm
OpenNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-cloud) [2019-02-16T13:43:07Z] <arturo> T193264 create 'clouddb-services-puppetmaster-01' instance

Mentioned in SAL (#wikimedia-cloud) [2019-02-16T13:47:08Z] <arturo> T193264 create 'clouddb-services-puppetmaster' puppet prefix to store puppet/hiera config for this project puppetmaster

Mentioned in SAL (#wikimedia-cloud) [2019-02-16T13:54:54Z] <arturo> T193264 create 'clouddb10' puppet prefix to store puppet/hiera config for database servers in this project

Mentioned in SAL (#wikimedia-cloud) [2019-02-16T13:59:18Z] <arturo> T193264 switched clouddb1001/1004 to the new project local puppetmaster

Change 491003 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] wmcs: introduce new toolsdb primary role

https://gerrit.wikimedia.org/r/491003

Change 491005 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/491005

Change 491013 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] maintain_dbusers: add the new database VM

https://gerrit.wikimedia.org/r/491013

Change 491003 merged by Bstorm:
[operations/puppet@production] wmcs: introduce new toolsdb primary role

https://gerrit.wikimedia.org/r/491003

Mentioned in SAL (#wikimedia-cloud) [2019-02-17T18:54:15Z] <arturo> T193264 create VM clouddb-services-01 for PoC of running maintain-dbusers from here

Mentioned in SAL (#wikimedia-cloud) [2019-02-17T19:16:19Z] <arturo> T193264 delete VM clouddb-services-01

I have been talking to @aborrero about the new instance on clouddb1001 - and I have been taking a general look.
While comparing the grants, I have realised that clouddb1001 is missing a grant for the following user: s52716 (that grant exists on labsdb1005) it could be a new user. I can easily copy that grant over to clouddb1001, but I want the green light from @Bstorm just in case this has something to do with maintain-dbusers or something :-)

I have been talking to @aborrero about the new instance on clouddb1001 - and I have been taking a general look.
While comparing the grants, I have realised that clouddb1001 is missing a grant for the following user: s52716 (that grant exists on labsdb1005) it could be a new user. I can easily copy that grant over to clouddb1001, but I want the green light from @Bstorm just in case this has something to do with maintain-dbusers or something :-)

This should actually be a good test of switching maintain-dbusers to managing the new clouddb1001 grants. We should see it automatically detect the missing grant and create it.

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T17:55:13Z] <arturo> (jaime T193264) set clouddb1001 in read_only=1

Change 491005 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloudvps: refresh FQDN A record for tools.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/491005

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:12:05Z] <arturo> T193264 pointing tools.db.svc.eqiad.wmflabs to clouddb1001

Mentioned in SAL (#wikimedia-cloud) [2019-02-18T18:26:22Z] <arturo> (jaime T193264) setting clouddb1001 in read_write mode

Change 491013 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] maintain_dbusers: add the new database VM

https://gerrit.wikimedia.org/r/491013

Change 491290 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolschecker: Replace labsdb1005 with clouddb1001

https://gerrit.wikimedia.org/r/491290

Change 491290 merged by GTirloni:
[operations/puppet@production] toolschecker: Replace labsdb1005 with clouddb1001

https://gerrit.wikimedia.org/r/491290

Change 491294 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolsdb: Enable monitoring

https://gerrit.wikimedia.org/r/491294

Change 491296 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] toolsdb: Point tools-db.eqiad.wmflabs to clouddb1001

https://gerrit.wikimedia.org/r/491296

Change 491296 merged by GTirloni:
[operations/puppet@production] toolsdb: Point tools-db.eqiad.wmflabs to clouddb1001

https://gerrit.wikimedia.org/r/491296

Change 491294 merged by GTirloni:
[operations/puppet@production] toolsdb: Enable monitoring

https://gerrit.wikimedia.org/r/491294

Change 491825 had a related patch set uploaded (by GTirloni; owner: GTirloni):
[operations/puppet@production] cloudvirt1020: Network config

https://gerrit.wikimedia.org/r/491825

Change 491825 merged by GTirloni:
[operations/puppet@production] cloudvirt1020: Network config

https://gerrit.wikimedia.org/r/491825

cloudvirt1020 has been reimaged with Stretch and RAID configuration contains 2 spares now.

All better, moving on!

Created clouddb1002/3 on cloudvirt1020

Created the data volume on clouddb1002 and /srv/labsdb so that the replica data can be loaded in there.

Bstorm claimed this task.Feb 21 2019, 11:56 PM
Bstorm moved this task from Epics to Doing on the cloud-services-team (Kanban) board.

Ok, it is quite interesting to note that the LVM configuration of labsdb1004 and labsdb1005 are quite dissimilar.
On labsdb1005:

/dev/mapper/tank-data   19T  1.9T   17T  11% /srv

On labsdb1004:

/dev/mapper/labsdb1004--vg-postgres 1008G  258G  700G  27% /srv/postgres
/dev/mapper/labsdb1004--vg-labsdb    2.3T  1.5T  888G  63% /srv/labsdb

clouddb1001 has 3.36t. I'd originally provisioned clouddb1002 the same way (replacement for labsdb1004), but it looks like that isn't a good match. I'll remove the puppet role and redo the LVM on clouddb1002 to be a bit closer. I'd rather not have wikilabels impacted by toolsdb and vice versa. This does make it seem like wikilabels ought to be on its own server, though.

Bstorm added a subscriber: Andrew.EditedFeb 22 2019, 12:43 AM

Ah, another interesting problem. Since we are now pulling two disks out for spares, this virt doesn't have as much space as it did. The data volume of one of these DB instances was originally slated to be ~3 TB. That might get tight now. @Andrew how much of the disk is copy-on-write for this stuff? I don't want to rely too much on oversubscription around DBs.

I think I need to make a "smalldb" flavor for this that has smaller disks for the OSMdb replacements to make it all work. That only has to be 1.6 TB:
/dev/mapper/labsdb1006--vg-srv 1.6T 1.2T 375G 76% /srv

That should leave enough as long as I *don't* move wikilabels to a different server. I do hate that the replica is so different from the toolsdb primary, though.

Made a new mediumdb flavor for the osmdb servers. clouddb1003 and clouddb1004 should be rebuilt before moving the database using that flavor.

The new VMs are up on the right places with the right flavor.

Ok, clouddb1002 is configured to roughly match labsdb1004 now. I've also added 7.5 GB of swap from LVM for both 1 and 2 in the cluster as a safety measure (and using a similar configuration to the original hardware).

Mentioned in SAL (#wikimedia-operations) [2019-02-27T19:43:36Z] <bstorm_> downtimed labsdb1004 to stop mysql for transferring data for T193264

Mentioned in SAL (#wikimedia-operations) [2019-02-27T19:49:31Z] <bstorm_> stopped slave on labsbd1004 for T193264

Stopped slave at

Relay_Master_Log_File: log.167066
Exec_Master_Log_Pos: 104666280

syncing over the data to clouddb1002 from labsdb1004

Mentioned in SAL (#wikimedia-operations) [2019-02-28T02:08:48Z] <bstorm_> clouddb1002 is now in place to replace labsdb1004 as replica for toolsdb but not wikilabels postgres yet T193264

Change 493608 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikilabels: stage the postgres roles for virtualizing the database

https://gerrit.wikimedia.org/r/493608

Bstorm added a subscriber: Halfak.Mar 1 2019, 4:33 PM

@Halfak I am going to need to move the wikilabels database soon to the new server (moving from labsdb1004 -> clouddb1002.clouddb-services.eqiad.wmflabs). The server will be prepped once I sort out the puppet stuff I just put up for review. It's not ready just yet, so this is just a heads up that I'll need to sort out a time for the DB to be down for a while to dump it out, transfer it, and stand it back up at the other end. I imagine this might require some changes in wikilabels as well.

Does that only depend on DNS for labsdb1004 directly?

Change 493608 merged by Bstorm:
[operations/puppet@production] wikilabels: stage the postgres roles for virtualizing the database

https://gerrit.wikimedia.org/r/493608

Bstorm added a comment.Mar 1 2019, 9:25 PM

Postgres is now running on clouddb1002 on the correct data volume. It just needs the data moved over (and any changes on the apps etc).

Change 493769 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: stage the roles and profiles for virtualizing the servers

https://gerrit.wikimedia.org/r/493769

Change 493769 merged by Bstorm:
[operations/puppet@production] osmdb: stage the roles and profiles for virtualizing the servers

https://gerrit.wikimedia.org/r/493769

Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptMar 6 2019, 3:51 PM
Bstorm added a comment.Mar 6 2019, 3:51 PM

Also adding the wikilabels tag since this is going to need coordination to move that database.

Change 494771 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osm: Add a cloud-internal address for the osmdb cluster

https://gerrit.wikimedia.org/r/494771

Change 494843 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: refactor the password framework to not use the module

https://gerrit.wikimedia.org/r/494843

Change 494843 merged by Bstorm:
[operations/puppet@production] osmdb: refactor the password framework to not use the module

https://gerrit.wikimedia.org/r/494843

Change 495290 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/495290

@Bstorm what is left to be able to decommission labsdb1004 and labsdb1005? Is it: T217922: Migrate Wikilabels from labsdb1004 to clouddb1002 or is there something else missing?
Just trying to understand what are those hosts still used for :-)

Bstorm added a comment.EditedMar 12 2019, 3:36 PM

@Marostegui, It's just that and any activity remaining for T216441.

Harej raised the priority of this task from Medium to High.Mar 19 2019, 9:11 PM
Harej moved this task from Untriaged to Monitor on the Scoring-platform-team board.
GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM

Got the role working on the replacement osmdb primary. Going to try to establish replication to it.

Change 494771 had a related patch set uploaded (by Andrew Bogott; owner: Bstorm):
[operations/puppet@production] osm: Add a cloud-internal address for the osmdb cluster

https://gerrit.wikimedia.org/r/494771

Change 495290 merged by Bstorm:
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/495290

Reverted that change since I forgot to complete a pg_basebackup first. That is now running.

Mentioned in SAL (#wikimedia-cloud) [2019-03-28T19:32:34Z] <bstorm_> pg_basebackup started on clouddb1003 for osmdb T193264

Change 499910 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wikilabels: Update toolschecker to monitor the live DB

https://gerrit.wikimedia.org/r/499910

Change 499910 merged by Bstorm:
[operations/puppet@production] wikilabels: Update toolschecker to monitor the live DB

https://gerrit.wikimedia.org/r/499910

Change 499940 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/499940

Change 499940 merged by Bstorm:
[operations/puppet@production] osmdb: Switch the replica to the VM that needs to become the master

https://gerrit.wikimedia.org/r/499940

Mentioned in SAL (#wikimedia-cloud) [2019-03-28T23:17:06Z] <bstorm_> clouddb1003 is now a full osmdb replica T193264

Change 494771 merged by Bstorm:
[operations/puppet@production] osm: Add a cloud-internal address for the osmdb cluster

https://gerrit.wikimedia.org/r/494771

Mentioned in SAL (#wikimedia-cloud) [2019-03-29T00:00:41Z] <bstorm_> T193264 Added osm.db.svc.eqiad.wmflabs to cloud DNS

Bstorm added a comment.EditedApr 4 2019, 6:49 PM

osmdb is now on VMs.

Bstorm closed this task as Resolved.May 22 2019, 5:11 PM

This is done!