Upgrade/reboot labsdb* servers
Closed, ResolvedPublic

Description

Task to coordinate work with DBAs to upgrade/reboot these servers:

  • labsdb1004.eqiad.wmnet (jessie) scheduled 2018-11-20 17:15 UTC without announcement
  • labsdb1005.eqiad.wmnet (jessie) scheduled 2018-11-27 17:30 UTC with announcement
  • labsdb1006.eqiad.wmnet (stretch) scheduled 2018-11-20 17:30 UTC with announcement
  • labsdb1007.eqiad.wmnet (stretch)
  • labsdb1009.eqiad.wmnet (stretch) scheduled 2018-11-28 10:00 UTC without announcement
  • labsdb1010.eqiad.wmnet (stretch) scheduled 2018-11-29 10:00 UTC without announcement
  • labsdb1011.eqiad.wmnet (stretch) scheduled 2018-11-21 13:00 UTC without announcement
GTirloni created this task.Wed, Nov 14, 5:41 PM
GTirloni triaged this task as Normal priority.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Nov 14, 5:41 PM

@Halfak labsdb1004/5 would affect wikilabels. We may just do reboots in place like last time due to the tables that don't replicate per: https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#ToolsDB_Backups_and_Replication

  • labsdb1004 is the replica for most tables on 1005, but it is the only server for wikilabels (just so that information is out there).

@akosiaris labsdb1006/7 are involved -- and if we do this with a failover, I want to make sure the additional tables for T201544 are replicated? Can we confirm that?

I'll work with you on this

Take the opportunity to upgrade MySQL too!
Thanks

Banyek moved this task from Triage to In progress on the DBA board.Thu, Nov 15, 11:33 AM

Would next Monday 13:00 UTC 19th Nov work for a couple of reboots? Let's say labsdb1010.eqiad.wmnet and labsdb1011.eqiad.wmnet since these weren't mentioned by @Bstorm to require any special coordination with external people.
(I'm not even sure we should inform our users about the reboots).

@Banyek could you please check MySQL packages beforehand and install/upgrade as required before that date and confirm the date/schedule is good for you?

@akosiaris labsdb1006/7 are involved -- and if we do this with a failover, I want to make sure the additional tables for T201544 are replicated? Can we confirm that?

I don't think it's worth to do the failover dance just for a reboot. It will probably consume more time than the time the master will take to reboot and cause more operational frustration than it's worth. But to answer your question, yes the tables between labsdb1006 and labsdb1007 are in sync

akosiaris@labsdb1007:~$ echo '\d' | sudo -u postgres psql -d gis
                     List of relations
 Schema |            Name            |   Type   |  Owner   
--------+----------------------------+----------+----------
 public | coastlines                 | table    | postgres
 public | coastlines_gid_seq         | sequence | postgres
 public | geography_columns          | view     | postgres
 public | geometry_columns           | view     | postgres
 public | land_polygons              | table    | postgres
 public | land_polygons_gid_seq      | sequence | postgres
 public | osmcounts_ecus2012         | table    | osm
 public | osmcounts_ecus2012_id_seq  | sequence | osm
 public | osmcounts_us_cbsa          | table    | osm
 public | osmcounts_us_cbsa_gid_seq  | sequence | osm
 public | osmcounts_us_state         | table    | osm
 public | osmcounts_us_state_gid_seq | sequence | osm
 public | planet_osm_line            | table    | postgres
 public | planet_osm_nodes           | table    | postgres
 public | planet_osm_point           | table    | postgres
 public | planet_osm_polygon         | table    | postgres
 public | planet_osm_rels            | table    | postgres
 public | planet_osm_roads           | table    | postgres
 public | planet_osm_ways            | table    | postgres
 public | raster_columns             | view     | postgres
 public | raster_overviews           | view     | postgres
 public | spatial_ref_sys            | table    | postgres
 public | wiwosm                     | table    | osm
 public | wiwosm_wikidata_languages  | table    | osm
 public | wp_coords_red0             | table    | osm
 public | wp_coords_red2             | table    | osm
 public | wp_coords_red3             | table    | osm
 public | wp_coords_red4             | table    | osm
(28 rows)

Would next Monday 13:00 UTC 19th Nov work for a couple of reboots? Let's say labsdb1010.eqiad.wmnet and labsdb1011.eqiad.wmnet since these weren't mentioned by @Bstorm to require any special coordination with external people.
(I'm not even sure we should inform our users about the reboots).

@Banyek could you please check MySQL packages beforehand and install/upgrade as required before that date and confirm the date/schedule is good for you?

The time is perfect, but I can only do the upgrades when we depooled the hosts.
If somebody (let's say @Bstorm) prepares the gerrit patches, I can do the depooling and the mysql upgrade, and when your tasks are finished I can repool the hosts too. I'd say we should to labsdb1009.eqiad.wmnet too

@aborrero I'd say it's worthy of notifying users for toolsdb/wikilabels (labsdb1004/5) and possibly osmdb (labsdb1006/7) masters but not the wiki replicas or the secondaries (except wikilabels). The users won't see any significant issue on the replicas.

Thanks @akosiaris for checking. We can just do an in-place.

@Banyek shall we do the replicas on Monday (11/19)? I can throw patches up for the depooling and all that. On at least one of these rounds, the DBA team just kind of did the reboots on those because as long as there is one web and one analytics replica up, we are happy.

We can do labsdb1007 anytime @aborrero -- just make sure postgresql service is down before the upgrades and reboot (I think I'll get that today). For labsdb1006, it's the same with an announcement of brief outage for the reboot--so maybe do it Tues(11/20) with an announce.

For labsdb1004 (secondary), we should make sure that's announced for wikilabels and postgresql is down as well as the mysql upgrades are done with DBAs. labsdb1005 is the toolsdb brief outage (and other readwrite DB outage for things like templatetiger) so that needs an announce. Maybe do labsdb1004 on next Tues(11/20) and 1005 on next Wed (11/21)? @Banyek you can stay up late and work with me, or maybe Arturo wants to work with you during your day on those two? If we like those dates, cloud team can send announcements. It gives better lead time to wait until Thurs, but then we are coming up against holidays.

Bstorm updated the task description. (Show Details)Thu, Nov 15, 4:43 PM

So far it looks like replication is picking up where it left off nicely on labsdb1007 (done).

@Bstorm sure, I can do it all, I can do the OS on those hosts

awight added a subscriber: awight.Thu, Nov 15, 4:58 PM

email announcement has been sent for labsdb1006 reboot next Tuesday 2018-11-20 at 17:30 UTC.

aborrero updated the task description. (Show Details)Thu, Nov 15, 5:03 PM

Due to T209604, the Wikilabels web service needs to be manually restarted whenever labsdb1004 (pgsql.eqiad.wmnet) is rebooted. Please ping me here when the schedule is set, or I'll just keep following the comments.

Thanks @awight. Does 11/20 @ 17:15 UTC sound good? I can work on that reboot while @aborrero does 1006.
@Banyek, will you be around for mysql upgrades or whatever?

We could schedule the non-wikilabels toolsdb master, labsdb1005, for 11/21 @ 17:15 UTC then for consistency. If that one works, I'll get an announcement out.

Thanks @awight. Does 11/20 @ 17:15 UTC sound good? I can work on that reboot while @aborrero does 1006.

That would be great for me, thanks for the note! To be honest, the wikilabels service is non-critical, but we like to keep uptime out of respect for our small band of loyal labelers.

So, speaking about 1009-1011....are you guys sure you want to reboot all the servers the same day?
It wouldn't be the first time we see issues after reboots so my suggestion would be to spread the reboots and instead of doing all of them at the same time, just leave 24h between reboots. Keep in mind that a mysql_upgrade on those hosts take a big while to finish plus the fact that we'd be leaving all the hosts in a cold state all at once.
Feel free to ignore me if you think it is not needed :-)

So, speaking about 1009-1011....are you guys sure you want to reboot all the servers the same day?
It wouldn't be the first time we see issues after reboots so my suggestion would be to spread the reboots and instead of doing all of them at the same time, just leave 24h between reboots. Keep in mind that a mysql_upgrade on those hosts take a big while to finish plus the fact that we'd be leaving all the hosts in a cold state all at once.
Feel free to ignore me if you think it is not needed :-)

Sure. Thanks for the advice, problem is that I ignore what they run or do :-)

Will this new schedule work?

  • labsdb1011.eqiad.wmnet scheduled 2018-11-19 13:00 UTC without announcement
  • labsdb1010.eqiad.wmnet scheduled 2018-11-20 13:00 UTC without announcement
  • labsdb1009.eqiad.wmnet scheduled 2018-11-21 13:00 UTC without announcement

So, speaking about 1009-1011....are you guys sure you want to reboot all the servers the same day?
It wouldn't be the first time we see issues after reboots so my suggestion would be to spread the reboots and instead of doing all of them at the same time, just leave 24h between reboots. Keep in mind that a mysql_upgrade on those hosts take a big while to finish plus the fact that we'd be leaving all the hosts in a cold state all at once.
Feel free to ignore me if you think it is not needed :-)

Sure. Thanks for the advice, problem is that I ignore what they run or do :-)

Will this new schedule work?

  • labsdb1011.eqiad.wmnet scheduled 2018-11-19 13:00 UTC without announcement
  • labsdb1010.eqiad.wmnet scheduled 2018-11-20 13:00 UTC without announcement
  • labsdb1009.eqiad.wmnet scheduled 2018-11-21 13:00 UTC without announcement

I will let @Banyek decide on the actual day/time :-)

aborrero claimed this task.Fri, Nov 16, 9:00 AM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

We stopped supporting mariadb on jessie some months ago- I am not sure you will have packages to upgrade to.

Yeah, I was talking about 1009-1011 not toolsdb :-)

We stopped supporting mariadb on jessie some months ago- I am not sure you will have packages to upgrade to.

Well that makes it "easier" to do toolsdb. Hope to get that on stretch soon. In that case, my only concern is having a DBA around in case it goes badly.

@Banyek Just looking to confirm that you will be available during the Toolsdb primary and secondary reboots as support to verify things are working correctly and help if not for 11/20 @ 17:15 for labsdb1004 and 11/21 @ 17:15 for labsdb1005.

I want to announce that asap.

@Bstorm I am available, sorry for the late anwser I was ooo

Mentioned in SAL (#wikimedia-operations) [2018-11-19T11:33:45Z] <gtirloni> labsdb1011 upgraded packages on labsdb1011 (pre-work T209517)

Will this new schedule work?

  • labsdb1011.eqiad.wmnet scheduled 2018-11-19 13:00 UTC without announcement
  • labsdb1010.eqiad.wmnet scheduled 2018-11-20 13:00 UTC without announcement
  • labsdb1009.eqiad.wmnet scheduled 2018-11-21 13:00 UTC without announcement

Nobody confirmed this schedule yet. Please @Banyek or @Marostegui comment, confirm or propose a new one :-)

Sorry I missed this, for some reasons there was a different time in my mind, so this is entirely my fault. What shall we do now?

can we do the labsdb1011.eqiad.wmnet today in a later time? The others I put into my calendar to avoid missing those

@Banyek I think as long as it works for you, and they are all on different days, it's fine for the wiki replicas.

For labsdb1004 and 1005, 11/20 @ 17:15 for labsdb1004 work for you?
I just want DBA support around, and need lead time--but 04 is still the secondary. At this point, I'm still happy to do labsdb1004, but I think I'm pushing off labsdb1005 until after the holiday, since that one is the largest outage for our users (lots of folks will likely need to restart their apps), and it needs lead time for people to notice.

I can announce labsdb1005 for 17:30 UTC on Tues Nov 27th if that works for you.

yes, those dates are good for me, I'll put that to the calendar.
On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then

On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then

This works for me. Could we do both labsdb1011 and labsdb1010 on the same slot?

aborrero updated the task description. (Show Details)Mon, Nov 19, 4:34 PM

On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then

This works for me. Could we do both labsdb1011 and labsdb1010 on the same slot?

same slot = you mean same day?

On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then

This works for me. Could we do both labsdb1011 and labsdb1010 on the same slot?

same slot = you mean same day?

Yes, same day and same 'window'. I.e, reboot one, check if that was fine, and then reboot the other.
The scheduling we are talking about is 2018-11-21 13:00 UTC for labsdb1010/1011.

Bstorm updated the task description. (Show Details)Mon, Nov 19, 4:48 PM

On the missed labsdb 1011 I propse 2018-11-21 13:00 UTC then

This works for me. Could we do both labsdb1011 and labsdb1010 on the same slot?

same slot = you mean same day?

Yes, same day and same 'window'. I.e, reboot one, check if that was fine, and then reboot the other.
The scheduling we are talking about is 2018-11-21 13:00 UTC for labsdb1010/1011.

I would advise against that - these servers are pretty big, we are running multisource there and we will be upgrading to a new version of MySQL, aside from the fact that running mysql_upgrade there takes a long while, I would like to leave at least 24h (if not 48h) before reboots to make sure everything is ok and to make sure the buffer pool gets a bit warmer so we don't end up with two hosts completely cold the same day.

The scheduling we are talking about is 2018-11-21 13:00 UTC for labsdb1010/1011.

I would advise against that - these servers are pretty big, we are running multisource there and we will be upgrading to a new version of MySQL, aside from the fact that running mysql_upgrade there takes a long while, I would like to leave at least 24h (if not 48h) before reboots to make sure everything is ok and to make sure the buffer pool gets a bit warmer so we don't end up with two hosts completely cold the same day.

Oh, you already mentioned that. Sorry for the noise. We will be doing then only one, labsdb1011. Will schedule others for others days.

aborrero updated the task description. (Show Details)Mon, Nov 19, 4:53 PM
Marostegui added a comment.EditedMon, Nov 19, 4:53 PM

Thanks for understanding :-)

Bstorm updated the task description. (Show Details)Mon, Nov 19, 4:57 PM

Change 474751 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool labsdb1011

https://gerrit.wikimedia.org/r/474751

Mentioned in SAL (#wikimedia-operations) [2018-11-20T13:06:00Z] <banyek> depooling labsdb1011 (T209517)

Change 474751 merged by Banyek:
[operations/puppet@production] wiki replicas: depool labsdb1011

https://gerrit.wikimedia.org/r/474751

wmf-pt-kill was not able to start on labsdb1011 after reboot, I needed to create a new package for it with DSN F=/dev/null

Mentioned in SAL (#wikimedia-operations) [2018-11-20T15:41:14Z] <banyek> uploaded wmf-pt-kill_2.2.20-1+wmf5 packages to stretch-wikimedia (T209517)

Mentioned in SAL (#wikimedia-operations) [2018-11-20T15:51:42Z] <banyek> repooling labsdb1011 (T209517)

Banyek updated the task description. (Show Details)Tue, Nov 20, 3:54 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-20T17:25:37Z] <bstorm_> rebooting labsdb1004 for upgrades T209517

Mentioned in SAL (#wikimedia-operations) [2018-11-20T18:03:38Z] <bstorm_> rebooting labsdb1006 for upgrades T209517

aborrero updated the task description. (Show Details)Tue, Nov 20, 6:04 PM

Note: labsdb1004's remote serial terminal seems broken. lasdb1006 looked bad, but recovered after reboot.

I also see permissions errors on labsdb1006 for many tables @akosiaris. It predates the reboot and activity. I don't know what is getting blocked. Puppet was broken for a while because of https://gerrit.wikimedia.org/r/#/c/474955/

I don't know if that could have affected this server. The earliest logs that mentioned the issue were from yesterday, but it seems to trim the log pretty quickly. It may be that this is normal :)

Banyek moved this task from Backlog to In progress on the User-Banyek board.Wed, Nov 21, 9:28 AM
Banyek updated the task description. (Show Details)Mon, Nov 26, 5:03 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-27T17:25:19Z] <arturo> T209517 icinga downtime labsdb1005

awight removed a subscriber: awight.Tue, Nov 27, 5:28 PM

Mentioned in SAL (#wikimedia-operations) [2018-11-27T17:30:49Z] <bstorm_> T209517 icinga downtime labsdb1004

Mentioned in SAL (#wikimedia-operations) [2018-11-27T17:40:38Z] <bstorm_> T209517 rebooted labsdb1005 after upgrades

Bstorm updated the task description. (Show Details)Tue, Nov 27, 5:52 PM

Change 476079 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool labsdb1009 for upgrades

https://gerrit.wikimedia.org/r/476079

Mentioned in SAL (#wikimedia-operations) [2018-11-28T10:00:04Z] <banyek> depooling labsdb1009 (T209517)

Change 476079 merged by Banyek:
[operations/puppet@production] wiki replicas: depool labsdb1009 for upgrades

https://gerrit.wikimedia.org/r/476079

Banyek updated the task description. (Show Details)Wed, Nov 28, 11:20 AM

Change 476412 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] wiki replicas: depool labsdb1010 for upgrades

https://gerrit.wikimedia.org/r/476412

Mentioned in SAL (#wikimedia-operations) [2018-11-29T10:01:26Z] <banyek> depooling labsdb1010 due of maintenance - T209517

Change 476412 merged by Banyek:
[operations/puppet@production] wiki replicas: depool labsdb1010 for upgrades

https://gerrit.wikimedia.org/r/476412

Mentioned in SAL (#wikimedia-operations) [2018-11-29T12:01:00Z] <banyek> repooling labsdb1010 after upgrades - T209517

Banyek updated the task description. (Show Details)Thu, Nov 29, 12:10 PM

All servers are rebooted & upgraded, I guess the task is able to close, but I leave this honor for others.

Thanks for taking care of this!