Page MenuHomePhabricator

labsdb1006/1007 (postgresql) maintenance
Closed, DuplicatePublic

Description

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks @MaxSem AFAIK we can take this down with proper notice (and really we must). My thinking is to send a general 1 week notice to labs-announce and post here to do this maint. @jcrespo how does that sound?

@chasemp yes a few days downtime should be OK. I have a cache layer that should serve most of the requests.

We need to take labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet offline to update them from Ubuntu Precise to Debian Jessie on 2017-03-08.

This is in coordination with an operations wide effort to deprecate Ubuntu Precise this quarter: T94790, T143349, and T123525.

Watch for updates on task https://phabricator.wikimedia.org/T157359.  This maintenance window will be 48 hours, and will begin at 8 AM UTC. The the service may be unavailable for several periods during that time.

These are the PostgresQL ("OSM") servers.  We believe this directly impacts at least two projects:

https://wikitech.wikimedia.org/wiki/Nova_Resource:Maps
https://wikitech.wikimedia.org/wiki/Nova_Resource:Maps-team

Come see us on the task or IRC at #wikimedia-labs with questions or concerns.

-- 
Thank you,

The Labs & DBA teams

My idea would be as follows:

  • stop labsdb1006 -> copy /srv/postgres to dbstore1001 -> reimage -> copy the data back.

If that works fine, repeat with labsdb1007. I am not completely aware of the impact of one of them just being down and the other up. Anyone has any thoughts on the above idea?

The most important step, and why we need to copy that data away in case something goes wrong is the postgres upgrade from 9.1 (precise) to 9.4 (jessie). We cannot make a copy in place because we do not have enough space.

My idea would be as follows:

  • stop labsdb1006 -> copy /srv/postgres to dbstore1001 -> reimage -> copy the data back.

Actually that will not work. Postgres on-disk format is not compatible between versions and needs to be upgrade. The most supported operation is *gasp* dump and reimport. https://www.postgresql.org/docs/9.4/static/upgrading.html. It's one of the main pain points for ages now and also mentioned in the (in?)famous now https://eng.uber.com/mysql-migration/. There is also this days pg_upgrade which with --link mode which should in theory help avoid that problem, but I 've never tested it in a 9.1 => 9.4 setup and with that amount of data.

If that works fine, repeat with labsdb1007. I am not completely aware of the impact of one of them just being down and the other up. Anyone has any thoughts on the above idea?

Actually I had a different plan in mind. So, labsdb1007 is a read-only slave of labsdb1006. My proposed scheme would be to reinstall the slave as a master, let it resync with OSM (that's the gis database, should take about a day or so) and then copy over the few (3?) databases that are generated by the admittedly few users via a pg_dump process. Or we could even try the pg_upgrade --link approach as per the manual. I am fine with both approaches, even feel a bit adventurous to try the latter first. Then just switch osmdb.eqiad.wmnet, the canonical DNS people should be using to point to labsdb1007 instead. Then labsdb1006 should be reimaged as slave of labsdb1007. That should allows us to minimize the maintenance downtime to a few secs/mins along with the stress and discomfort to both ops and users.

How does that sound?

There is also this days pg_upgrade which with --link mode which should in theory help avoid that problem, but I 've never tested it in a 9.1 => 9.4 setup and with that amount of data.

I did my research, people has done it successfully in the past with those specific versions, and was my first trial to avoid a full reimport. We asked for extended downtime precisely to be able to have unexpected results and not worry about it.

There is also this days pg_upgrade which with --link mode which should in theory help avoid that problem, but I 've never tested it in a 9.1 => 9.4 setup and with that amount of data.

I did my research, people has done it successfully in the past with those specific versions, and was my first trial to avoid a full reimport. We asked for extended downtime precisely to be able to have unexpected results and not worry about it.

Great then! Happy to learn this actually works

Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the slow route, which is basically what you suggested. We are a bit constrained by size and disk speed, so there maybe many reasons why that may not work.

Actually I had a different plan in mind. So, labsdb1007 is a read-only slave of labsdb1006. My proposed scheme would be to reinstall the slave as a master, let it resync with OSM (that's the gis database, should take about a day or so) and then copy over the few (3?) databases that are generated by the admittedly few users via a pg_dump process. Or we could even try the pg_upgrade --link approach as per the manual. I am fine with both approaches, even feel a bit adventurous to try the latter first. Then just switch osmdb.eqiad.wmnet, the canonical DNS people should be using to point to labsdb1007 instead. Then labsdb1006 should be reimaged as slave of labsdb1007. That should allows us to minimize the maintenance downtime to a few secs/mins along with the stress and discomfort to both ops and users.

I am happy to go with your approach, as you certainly have A LOT MORE knowledge of postgres and the context of those two servers than myself. I would be happy if you want to take the lead on this and help us on Wed/Thu with the migration.

Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the slow route, which is basically what you suggested. We are a bit constrained by size and disk speed, so there maybe many reasons why that may not work.

OK sounds sane enough. +1

Actually I had a different plan in mind. So, labsdb1007 is a read-only slave of labsdb1006. My proposed scheme would be to reinstall the slave as a master, let it resync with OSM (that's the gis database, should take about a day or so) and then copy over the few (3?) databases that are generated by the admittedly few users via a pg_dump process. Or we could even try the pg_upgrade --link approach as per the manual. I am fine with both approaches, even feel a bit adventurous to try the latter first. Then just switch osmdb.eqiad.wmnet, the canonical DNS people should be using to point to labsdb1007 instead. Then labsdb1006 should be reimaged as slave of labsdb1007. That should allows us to minimize the maintenance downtime to a few secs/mins along with the stress and discomfort to both ops and users.

I am happy to go with your approach, as you certainly have A LOT MORE knowledge of postgres and the context of those two servers than myself. I would be happy if you want to take the lead on this and help us on Wed/Thu with the migration.

Yeah I can do that if it makes you two happier.

Yeah I can do that if it makes you two happier.

That would be appreciated. Not happier per se, but a lot more confident (I am speaking for myself now) in the process that data will not be destroyed and will actually be usable once the migration is done :)

@akosiaris Do you know if labsdb1007 is actively in use? If not at all, we could start doing it now, ahead of the maintenance window...

It is not. It is a read-only slave not really being used by anyone currently so we are free to start the process on it well ahead of the maint window.

Thanks, I will start "breaking" it tomorrow Tuesday during EU morning- do not worry, I can take care of this- you probably have more urgent things to take care. I would take your offer for help, however, should something unexpected happens.

Shall I start copying labsdb1007 to dbstore1001 or are you "breaking" at the moment it @jcrespo?

We can start this now. I was about to do it.

We can start this now. I was about to do it.

Ah, go ahead if you like then.

I am with labsdb1004 now, please shutdown postgres and copy it.

Mentioned in SAL (#wikimedia-operations) [2017-03-08T10:31:28Z] <marostegui> Shutdown postgresql on labsdb1007 for maintenance - T157359

Transfer started and the file will be located at: dbstore1001:/srv/tmp/labsdb1007.tar.gz

Change 341776 had a related patch set uploaded (by Marostegui):
[operations/puppet] linux-host-entries: No more precise for labsdb1007

https://gerrit.wikimedia.org/r/341776

Change 341776 merged by Marostegui:
[operations/puppet] linux-host-entries: Remove precise: labsdb1006,7

https://gerrit.wikimedia.org/r/341776

labsdb1007:/etc directory has been copied to dbstore1001:/srv/tmp/labsdb1007_etc.tar.gz

Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['labsdb1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703081207_jynus_4522.log.

The installer asked for verification to delete all partitions- this should be changed on the recipe (or use the db one, where this was fixed long time ago).

Change 341786 had a related patch set uploaded (by Jcrespo):
[operations/puppet] partman-labsdb-osm: Make partman auto-confirm partition/RAID/LVM

https://gerrit.wikimedia.org/r/341786

Completed auto-reimage of hosts:

['labsdb1007.eqiad.wmnet']

and were ALL successful.

It installed ok, but /dev/tank/data has only 500GB. How was that partitioned @akosiaris ?

This is 1006:

PV /dev/md1   VG labsdb1006-vg   lvm2 [1.64 TiB / 412.46 GiB free]
Total: 1 [1.64 TiB] / in use: 1 [1.64 TiB] / in no VG: 0 [0   ]

This is 1007 now:

PV /dev/sda6   VG tank   lvm2 [514.20 GiB / 0    free]
Total: 1 [514.20 GiB] / in use: 1 [514.20 GiB] / in no VG: 0 [0   ]

It's missing the rest of the disks in the md RAID5 (and RAID1 for /boot) for some reason. Maybe some difference between jessie and precise partman (there are a few)

Nope. not that. What's actually the culprit is rOPUP1b3633e where labsdb1007 gets matched incorrectly

Thanks, I can amend that and try again.

Change 341788 had a related patch set uploaded (by Jcrespo):
[operations/puppet] autoinstall: Fix bug on 1b3633e4aa3f

https://gerrit.wikimedia.org/r/341788

Change 341788 merged by Marostegui:
[operations/puppet] autoinstall: Fix bug on 1b3633e4aa3f

https://gerrit.wikimedia.org/r/341788

Change 341786 merged by Marostegui:
[operations/puppet] partman-labsdb-osm: Make partman auto-confirm partition/RAID/LVM

https://gerrit.wikimedia.org/r/341786

Merged both patches from @jcrespo and ran puppet on install1002, going to try to reimage the server again

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['labsdb1007.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703081333_marostegui_1716.log.

Looks good now:

root@labsdb1007:~# pvs
  PV         VG            Fmt  Attr PSize PFree
  /dev/md1   labsdb1007-vg lvm2 a--  1.64t    0

Completed auto-reimage of hosts:

['labsdb1007.eqiad.wmnet']

and were ALL successful.

So, for the record we saw something with the logical volume:

root@labsdb1007:~# lvs
  LV           VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  _placeholder labsdb1007-vg -wi-a-----   1.59t
  root         labsdb1007-vg -wi-ao----  46.56g
  swap         labsdb1007-vg -wi-ao---- 952.00m

Jaime found this

˜/jynus 16:00> "_placeholder is an LV that fills up the rest of space."
˜/jynus 16:00> "This is a hack to keep the zookeepeer partition from using up all free space."

So we decided to rename it:

root@labsdb1007:~# lvrename /dev/labsdb1007-vg/_placeholder /dev/labsdb1007-vg/srv
  Renamed "_placeholder" to "srv" in volume group "labsdb1007-vg"
root@labsdb1007:~# lvs
  LV   VG            Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  root labsdb1007-vg -wi-ao----  46.56g
  srv  labsdb1007-vg -wi-a-----   1.59t
  swap labsdb1007-vg -wi-ao---- 952.00m

Created the ext3 FS and mounted it on /srv/

root@labsdb1007:/srv# cat /etc/fstab | grep srv
/dev/mapper/labsdb1007--vg-srv /srv ext3    errors=remount-ro 0       1
root@labsdb1007:/srv# df -hT /srv/
Filesystem                     Type  Size  Used Avail Use% Mounted on
/dev/mapper/labsdb1007--vg-srv ext3  1.6T   15G  1.5T   1% /srv

Hey, I am not saying it is going to work 100% sure- I am just suggesting to try it first, and then go the slow route, which is basically what you suggested. We are a bit constrained by size and disk speed, so there maybe many reasons why that may not work.

I can admit it, I failed to upgrade it inplace- the old cluster fails to start up and gets stuck on recovery mode, is is either config or the data is considered as corrupted- or the recovery will take ages.

9.4 works ok, though, so let's import whatever is necessary instead.

Can we use this opportunity to reimport the data from scratch to get rid of possible accumulated OSM replication errors?

It has actually done some recovering as the file it is scanning now has changed since last night:

postgres  7189  0.0  0.2 203784 139620 ?       Ss   Mar08   0:38 postgres: startup process   recovering 0000000100001B8000000057

However it doesn't look up yet, so as @jcrespo said, it is probably better to do the dump as @akosiaris suggested.

Not sure from which time this is:

FATAL:  the database system is starting up
FATAL:  terminating walreceiver process due to administrator command
LOG:  invalid record length at 1B7F/347E8C78
LOG:  streaming replication successfully connected to primary
FATAL:  terminating walreceiver process due to administrator command
LOG:  invalid record length at 1B7F/347E9E70
LOG:  streaming replication successfully connected to primary

That is me killing the replication, which will not work anyway. @akosiaris can you point us to the osm load process, do you have that puppetized or scripted somewhere, I need your help.

Change 341991 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/puppet] Promote labsdb1007 to osm::master.

https://gerrit.wikimedia.org/r/341991

Change 341991 merged by Alexandros Kosiaris:
[operations/puppet] Promote labsdb1007 to osm::master.

https://gerrit.wikimedia.org/r/341991

We 've ended up promoting labsdb1007 to master, resyncing from planet.osm and pg_dump/pg_restore the various databases/tables. That process is now almost done (that few large tables getting imported). I 'll now switchover the DNS from labsdb1006 to labsdb1007.

Change 342458 had a related patch set uploaded (by Alexandros Kosiaris):
[operations/dns] Switch osmdb.eqiad.wmnet to use labsdb1007

https://gerrit.wikimedia.org/r/342458

Change 342458 merged by Alexandros Kosiaris:
[operations/dns] Switch osmdb.eqiad.wmnet to use labsdb1007

https://gerrit.wikimedia.org/r/342458

And we are done. The rest of the databases/tables have been copied over, the DNS record has been updated and DNS caches cleared. labsdb1007 is now the active one for osmdb.eqiad.wmnet. I 've shutdown the postgresql database on labsdbb1006 on purpose to force any remaining open connections to fall over to labsdb1007. Any application using labsdb1006 directly was configured wronly and should be updated to use osmdb.eqiad.wmnet.

The difficult part of this task is resolved. Let's go for the reimage of labsd1006 and it's setting up as postgresql slave of labsdb1007 which requires no downtime.

@aude @MaxSem @Kolossos Can you verify your applications (e.g. restarting them) and see that they work as expected to be 100% the maintenance and upgrade did not cause any issue? This is now Postgresql 9.4.

@aude @MaxSem @Kolossos Can you verify your applications (e.g. restarting them) and see that they work as expected to be 100% the maintenance and upgrade did not cause any issue? This is now Postgresql 9.4.

friendly ping for @aude @MaxSem @Kolossos

Nope. My stuff fails now with:

ERROR:  function setsrid(box3d, integer) does not exist
LINE 6:             SetSRID('BOX3D(-12681020.832929 5647769.0043352,...
                    ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

After fixing those to ST_SetSRID my PostGIS query now fails with

ERROR:  Operation on mixed SRID geometries

That is probably a conquequence of a PostGIS upgrade. Can you post/point to the whole query/relevant code? I am afraid some adjustments may be needed- sorry about that.

Sadly, keeping the old version was not a possibility (and we kept it for as long as we cood) because of os stopping getting security updates. Sorry :-(

This could be the reason- requiring to setup the SRID for constants: http://gis.stackexchange.com/questions/68711/postgis-geometry-query-returns-error-operation-on-mixed-srid-geometries-only If it is not that, maybe there is some kind of discrepancy between the SRIDs on the database(e.g. on import).

Looks like the OSM data uses SRID 3857 and I compare to a Bounding Box with SRID 900913

Ok, next issue is that suddenly the column the_geom does not exist anymore in the land_polygons and coastlines tables....

...apparently it is now called geom

Good to know- feel free to test and communicate issues- we will soon otherwise irrevocably delete the previous instance. Thanks for taking the time to adapt to it- this should give you (I assume) several years until the next upgrade.

Change 343285 had a related patch set uploaded (by Jcrespo):
[operations/puppet] osm: Change osm master rsync check with different parameter order

https://gerrit.wikimedia.org/r/343285

Change 343285 merged by Jcrespo:
[operations/puppet] osm: Change osm master rsync check with different parameter order

https://gerrit.wikimedia.org/r/343285

Unless anyone else says so, I will reimage the old server on Monday. Last chance to check data and functionality works on the new server!

The new server works for me. The upgrade also resolved T145599. Thank you!

Change 343610 had a related patch set uploaded (by Jcrespo):
[operations/puppet] osm: convert labsdb1006 into an osm slave of labsdb1007

https://gerrit.wikimedia.org/r/343610

Change 343610 merged by Jcrespo:
[operations/puppet] osm: convert labsdb1006 into an osm slave of labsdb1007

https://gerrit.wikimedia.org/r/343610

Script wmf_auto_reimage was launched by jynus on neodymium.eqiad.wmnet for hosts:

['labsdb1006.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201703201123_jynus_18598.log.

Completed auto-reimage of hosts:

['labsdb1006.eqiad.wmnet']

Of which those FAILED:

set(['labsdb1006.eqiad.wmnet'])
jcrespo moved this task from Backlog to In progress on the DBA board.

I am kickstarting the replication right now- this required different pupetization of the repliation "grants". I will change puppet once I confirm the import worked.

I have disabled puppet on labsdb1007 meanwhile.

Change 345847 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] osm-labs: Fix replication credentials

https://gerrit.wikimedia.org/r/345847

Ok, now puppet works, but either it puppet needs more work or it fails silently-this needs more researech. Replication is not working yet.

Ok, now puppet works, but either it puppet needs more work or it fails silently-this needs more researech. Replication is not working yet.

Is this still valid or can this ticket be closed?

This is still valid, labsdb1006 is still not setup and labsdb1007 is a single point of failure.

Change 345847 abandoned by Jcrespo:
osm-labs: Fix replication credentials

Reason:
Probably not relevant due to T197246

https://gerrit.wikimedia.org/r/345847