Page MenuHomePhabricator

Final migration of osmdb.eqiad.wmnet into Cloud VPS instances
Closed, ResolvedPublic

Description

As of now, a new DNS address of osm.db.svc.eqiad.wmflabs is pointed at clouddb1003 which is a read replica of osmdb.eqiad.wmnet.

It should work for read-only loads already for all intents and purposes. Now we need to cut over in order to decommission the old hardware.

This is currently scheduled to take place on 20190404@1700 UTC

The plan looks like this:

  • Stage DNS change so that osmdb.eqiad.wmnet will be a CNAME pointing at osm.db.svc.eqiad.wmflabs
  • Announce the plan to switch over with impact to a couple tools (notably those owned by @Kolossos and possibly @aude, which have some read-write access) from DNS changing to a read replica temporarily -- with a few days lead time so most people see it
  • Switch DNS with a merge and update
  • <wait for TTL, which is 5min>
  • Stop postgres on master, which will no longer be in use.
  • Touch trigger file on clouddb1003, and ensure postgres is now running as the rw primary
  • Announce the change with impact is over.
  • Switch the puppet role to make this server the primary for purposes of sync jobs, etc.
  • Start work on bringing up the new replica on clouddb1004 <-- At this point we are ready to decom the old servers

Will try some quick tests to ensure no network issues before proceeding. As of now for read-access osm.db.svc.eqiad.wmflabs should work as a read replica and copy of osmdb.eqiad.wmnet (if folks on the maps project or similar want to test that theory -- @Awjrichards @Chippyy @cmarqu @dschwen @jeremyb @MaxSem @Multichill @Nosy @TheDJ -- since I think that project uses osmdb).

Event Timeline

Bstorm created this task.Mar 29 2019, 5:16 PM
Bstorm triaged this task as High priority.

Change 500086 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] osmdb: set the CNAME for osmdb to the new instance in Cloud VPS

https://gerrit.wikimedia.org/r/500086

TheDJ added a comment.Mar 29 2019, 8:25 PM

As of now for read-access osm.db.svc.eqiad.wmflabs should work as a read replica and copy of osmdb.eqiad.wmnet

I can confirm read access from maps-tiles1 eqiad-r

Bstorm added a comment.Apr 1 2019, 7:31 PM

Great! Thanks.

Bstorm added a comment.Apr 1 2019, 8:14 PM

Setting schedule for change on Thursday with announcement going out today, then.

Bstorm updated the task description. (Show Details)Apr 1 2019, 8:15 PM
Bstorm updated the task description. (Show Details)Apr 1 2019, 8:19 PM
MSantos moved this task from All map-related tasks to Tracking on the Maps board.Apr 2 2019, 2:25 PM
Bstorm added a comment.Apr 4 2019, 5:12 PM

Starting on the DNS change

Change 500086 merged by Bstorm:
[operations/dns@master] osmdb: set the CNAME for osmdb to the new instance in Cloud VPS

https://gerrit.wikimedia.org/r/500086

Bstorm added a comment.Apr 4 2019, 5:31 PM

That tooks a bit because it required local rebase fussing to merge.

Bstorm added a comment.Apr 4 2019, 5:32 PM

Now we wait 5min at least.

Bstorm added a comment.Apr 4 2019, 6:00 PM

@TheDJ and others, sorry the database crashed at one point because of permissions on a failover file. I'll see about fixing that in puppet and documenting a warning for future failovers. Some maps services may need a restart.

Mentioned in SAL (#wikimedia-cloud) [2019-04-04T18:00:52Z] <bstorm_> T219652 clouddb1003 is now the OSMdb primary

Bstorm added a comment.Apr 4 2019, 6:04 PM
NOTE: the crash was caused by readonly permissions on recovery.conf, so when running su postgres -c 'pg_ctl promote -D /srv/postgres/9.6/main', the postgres user cannot rename the file as recovery.done like it normally would.

This may be because other opsen have run that as root instead of postgres in the past, but I was following common practice, so I will either change the perms in puppet for our postgres setup or simply add a warning in our WMCS documentation for the failover of this database.

Change 501371 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] postgresql: set recovery.conf to writeable by postgres user

https://gerrit.wikimedia.org/r/501371

Mentioned in SAL (#wikimedia-cloud) [2019-04-04T18:35:56Z] <bstorm_> T219652 postgresql on clouddb1003 had to be restarted to get it to accept WAL connections

Change 501384 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] postgresql: set max_wal_senders on slave conf

https://gerrit.wikimedia.org/r/501384

Mentioned in SAL (#wikimedia-cloud) [2019-04-04T18:46:55Z] <bstorm_> T219652 pg_basebackup started on clouddb1004

Change 501371 merged by Bstorm:
[operations/puppet@production] postgresql: set recovery.conf to writeable by postgres user

https://gerrit.wikimedia.org/r/501371

Bstorm closed this task as Resolved.Apr 5 2019, 4:46 PM

Change 501384 merged by Bstorm:
[operations/puppet@production] postgresql: set max_wal_senders on slave conf

https://gerrit.wikimedia.org/r/501384