Maniphest T219652

Final migration of osmdb.eqiad.wmnet into Cloud VPS instances
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Bstorm
	Mar 29 2019, 5:16 PM

Description

As of now, a new DNS address of osm.db.svc.eqiad.wmflabs is pointed at clouddb1003 which is a read replica of osmdb.eqiad.wmnet.

It should work for read-only loads already for all intents and purposes. Now we need to cut over in order to decommission the old hardware.

This is currently scheduled to take place on 20190404@1700 UTC

The plan looks like this:

Stage DNS change so that osmdb.eqiad.wmnet will be a CNAME pointing at osm.db.svc.eqiad.wmflabs
Announce the plan to switch over with impact to a couple tools (notably those owned by @Kolossos and possibly @aude, which have some read-write access) from DNS changing to a read replica temporarily -- with a few days lead time so most people see it
Switch DNS with a merge and update
<wait for TTL, which is 5min>
Stop postgres on master, which will no longer be in use.
Touch trigger file on clouddb1003, and ensure postgres is now running as the rw primary
Announce the change with impact is over.
Switch the puppet role to make this server the primary for purposes of sync jobs, etc.
Start work on bringing up the new replica on clouddb1004 <-- At this point we are ready to decom the old servers

Will try some quick tests to ensure no network issues before proceeding. As of now for read-access osm.db.svc.eqiad.wmflabs should work as a read replica and copy of osmdb.eqiad.wmnet (if folks on the maps project or similar want to test that theory -- @Awjrichards @Chippyy @cmarqu @dschwen @jeremyb @MaxSem @Multichill @Nosy @TheDJ -- since I think that project uses osmdb).

Details

Subject	Repo	Branch	Lines +/-
postgresql: set max_wal_senders on slave conf	operations/puppet	production	+8 -4
postgresql: set recovery.conf to writeable by postgres user	operations/puppet	production	+2 -2
osmdb: set the CNAME for osmdb to the new instance in Cloud VPS	operations/dns	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Restricted Task
Resolved	None	T207536 Move various support services for Cloud VPS currently in prod into their own instances
		Unknown Object (Task)
Resolved	• chasemp	T172538 rack/setup/install labvirt10(19\|20).eqiad.wmnet
Resolved	• Bstorm	T216208 ToolsDB overload and cleanup
Declined	None	T216173 labsdb1005/6 - Upgrade to Stretch
Resolved	• Bstorm	T193264 Replace labsdb100[4567] with instances on cloudvirt1019 and cloudvirt1020
Resolved	• Bstorm	T219652 Final migration of osmdb.eqiad.wmnet into Cloud VPS instances

Event Timeline

• Bstorm triaged this task as High priority.Mar 29 2019, 5:16 PM

• Bstorm created this task.

Change 500086 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] osmdb: set the CNAME for osmdb to the new instance in Cloud VPS

https://gerrit.wikimedia.org/r/500086

gerritbot added a project: Patch-For-Review.Mar 29 2019, 5:41 PM

As of now for read-access osm.db.svc.eqiad.wmflabs should work as a read replica and copy of osmdb.eqiad.wmnet

I can confirm read access from maps-tiles1 eqiad-r

Great! Thanks.

Setting schedule for change on Thursday with announcement going out today, then.

• Bstorm updated the task description. (Show Details)Apr 1 2019, 8:15 PM

• Bstorm updated the task description. (Show Details)Apr 1 2019, 8:19 PM

MSantos moved this task from All map-related tasks to Tracking on the Maps board.Apr 2 2019, 2:25 PM

Starting on the DNS change

Change 500086 merged by Bstorm:
[operations/dns@master] osmdb: set the CNAME for osmdb to the new instance in Cloud VPS

https://gerrit.wikimedia.org/r/500086

That tooks a bit because it required local rebase fussing to merge.

Now we wait 5min at least.

@TheDJ and others, sorry the database crashed at one point because of permissions on a failover file. I'll see about fixing that in puppet and documenting a warning for future failovers. Some maps services may need a restart.

Mentioned in SAL (#wikimedia-cloud) [2019-04-04T18:00:52Z] <bstorm_> T219652 clouddb1003 is now the OSMdb primary

NOTE: the crash was caused by readonly permissions on recovery.conf, so when running su postgres -c 'pg_ctl promote -D /srv/postgres/9.6/main', the postgres user cannot rename the file as recovery.done like it normally would.

This may be because other opsen have run that as root instead of postgres in the past, but I was following common practice, so I will either change the perms in puppet for our postgres setup or simply add a warning in our WMCS documentation for the failover of this database.

Change 501371 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] postgresql: set recovery.conf to writeable by postgres user

https://gerrit.wikimedia.org/r/501371

Mentioned in SAL (#wikimedia-cloud) [2019-04-04T18:35:56Z] <bstorm_> T219652 postgresql on clouddb1003 had to be restarted to get it to accept WAL connections

Change 501384 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] postgresql: set max_wal_senders on slave conf

https://gerrit.wikimedia.org/r/501384

Mentioned in SAL (#wikimedia-cloud) [2019-04-04T18:46:55Z] <bstorm_> T219652 pg_basebackup started on clouddb1004

Change 501371 merged by Bstorm:
[operations/puppet@production] postgresql: set recovery.conf to writeable by postgres user

https://gerrit.wikimedia.org/r/501371

• Bstorm closed this task as Resolved.Apr 5 2019, 4:46 PM

Change 501384 merged by Bstorm:
[operations/puppet@production] postgresql: set max_wal_senders on slave conf