Page MenuHomePhabricator

Labsdb* servers need to be rebooted
Closed, ResolvedPublic

Description

These labsdb boxes need to be rebooted to upgrade to the new kernels:

  • labsdb1001
  • labsdb1003
  • labsdb1004
  • labsdb1005
  • labsdb1006
  • labsdb1007
  • labsdb1009
  • labsdb1010
  • labsdb1011

Duration: 1 hour of maintenance window, with 5 minutes of service unavailability

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hi @madhuvishy

I can take care of:
labsdb1009
labsdb1010
labsdb1011

However, for labsdb1001 and labsdb1003 you guys are probably in a better position to take care of those as you'd need to let the users know there will be a short downtime.
For labsdb1004 and 1005 (toolsdb) I guess it would be the same thing, you'd need to let the users know that there will go read-only while the maintenance happens (unless you do a failover - I haven't touched toolsdb much so I am probably not the best person to advise here @jcrespo probably knows better).

For labsdb1006 and 1007 @akosiaris can probably advise.

I don't think labsdb1008 exists

labsdb1004 is used by wikilabels (which is used by Ores). The reboot should be synced with Aaron Halfaker, the last time he added a note for external consumers of wikilabels to notify them of the maintenance.

labsdb100[67] could simply be rebooted without further precautions.

labsdb1004 is also a slave for tools db, labsdb1005 connections can be redirected to it while the other reboots. labsdb1 and 3 need also mysql upgrade, but that will take 30+ minutes. There is also a high chance that they will not come up after reboot.

JFTR, fixed kernels/libc/libffi are installed on all of those hosts.

JFTR, fixed kernels/libc/libffi are installed on all of those hosts.

So only reboots pending then?

labsdb1004 is used by wikilabels (which is used by Ores). The reboot should be synced with Aaron Halfaker, the last time he added a note for external consumers of wikilabels to notify them of the maintenance.

@Halfak ^. Downtime should be a few minutes.

labsdb100[67] could simply be rebooted without further precautions.

Yes that's true these days. It's effectively used only by the maps labs project and IIRC all the apps in it can survive a few minutes downtime

Yes, only needs reboots (and in the case of 1004 some coordination with Aaron)

I do not mind rebooting labs1 and 3 too myself, but dns changes and announcements has to be handled by cloud team.

6 didn't finish setup- it won't probably happen before stretch, given that will likely require another rebuild T157359.

Mentioned in SAL (#wikimedia-operations) [2017-06-22T08:20:47Z] <marostegui> Stop MySQL and reboot labsdb1009 - T168584

Mentioned in SAL (#wikimedia-operations) [2017-06-22T08:44:53Z] <marostegui> Stop MySQL and reboot labsdb1010 - T168584

Mentioned in SAL (#wikimedia-operations) [2017-06-22T08:55:55Z] <marostegui> Stop MySQL and reboot labsdb1011 - T168584

Following hosts rebooted and running: 4.9.0-0.bpo.3-amd64

labsdb1009
labsdb1010
labsdb1011
Andrew updated the task description. (Show Details)

Hi all,

So current status is:

  • labsdb1001 and 1003: Cloud team needs to announce user maintenance, and handle dns switchover during reboots(I'm not sure what this entails). Jaime can do the reboots. It looks like we are also considering doing mysql upgrades in the same window - this seems to me like combining two things, but if y'all say it's the best time to do it, +1
  • labsdb1004: Cloud team needs to announce user maintenance for toolsdb, and coordinate with Aaron for wikilabels. Dba's will handle the reboot.
  • labsdb1005: Cloud team needs to announce user maintenance for toolsdb, toolsdb connections (will go to read only?) can be pointed to 1004 during this reboot (cloud team), and dba's will do the reboot.

@jcrespo I can send out an announcement today and verify with Aaron, do you have a preferred time window? I propose Monday after ops meeting (17:00 UTC).

I propose Monday after ops meeting (17:00 UTC).

I am sorry, but that is outside of my working hours.

@jcrespo no problem! let me know what time works for you :) I can do earlier on Monday too. Would 14:00 UTC work? Feel free to propose a suitable time if not. Thanks so much!

So there are several things here- the dns change and the actual reboot. There should be a time between them. I say you (as in, anyone on your team) change dns day 1 at the end of your day (but with enough room to monitor the databases don't break), and then I reboot them at the beginning of mine, after all (or most) connections have been dropped. We should ping Chris so he is available those days because as I said, the servers may not boot back up.

The following day or some days after that (it may take more than one day to warmup, we do the same, but with the other host.

A similar thing would happen with labsdb1005, but that should be in theory faster, so it could be done in sync and faster. Not all databases are replicated, these 3 databases are unavailable on server switch: T127164.

One heads up- some tool users do not try to reconnect on connection failure- that is a problem with their applications -they should retry reconnection once because of failover- and there is not much we can do about it.

Also user databases on replicas are not lost, but cannot be made available during switch- that is why using toolsdb is preferred.

Hey folks! I've been traveling and just getting caught up. It looks like we ought to make a quick announcement for Wiki labels. I can get a notice out as soon as we settle on timing. I want 1-2 days notice before we reboot to get the notice out. Generally, I don't mind waking up early/staying up late to match timing with whoever is doing the reboot. Just let me know what time works for you.

Thanks for the detailed explanation @jcrespo. For labsdb1001 and 1003, I'll check with Chris and schedule the dns switchover and the reboots to happen this week/early next week.

For labsdb1004 and 5, we discussed this today at team meeting, and are wondering if it would be easier to just reboot the boxes without switching connections over to the slave. Tools will fail anyway when we switch connections, and we'd have to manually reboot things, and do it again when we switch back. It may be less impactful for tools to announce an expected toolsdb outage window and just reboot. What do you think?

@Cmjohnson Hi! We are looking at rebooting labsdb1001 and 1003, and it seems like these boxes may not come up automatically on reboot. Jaime recommended that it would be better to have you on site when we do this. The reboots for the boxes would happen on 2 separate days. What days are you in the DC this week and next, so we can schedule these? Let me know if you have any other scheduling constraints too. Thank you :)

it would be easier to just reboot the boxes

I am ok with that if you are ok with that. Announcement should be done, though- on last upgrade people got upset even if we had only seconds during the switch. for 4/5 it should be relatively fast to reboot this time.

@jcrespo Apologies for the delay. Can we start with just labsdb1005 first, and attempt to do it Wednesday July 5, and labsdb1004 on Thursday July 6, provided the first one goes well? I'll let you pick the time, me/folks from cloud team can be available as early as 14 UTC.

We rebooted the labstore(NFS) boxes yesterday - which did not go well (T169289) - and I'm thinking we get any kinks worked out just in case when @MoritzMuehlenhoff is back, before doing labsdb1001 and 1003 - even though the kernel issues that surfaced for the NFS servers may be unrelated.

@madhuvishy I am out all next week and will be back July 11.

@Cmjohnson Okay thanks for letting me know, I'll schedule the labsdb1001 and 1003 reboots (the ciscos), for after you are back then. When are you in the DC (from 11th to 14th)?

@madhuvishy I typically get the DC around 1400UTC (10am EST).

I would do labsdb1004 first, which is the slave for the toolsdb, and labsdb1005- I didn't want to pressure you because I knew you had other concerns. I would say Tuesday and Wednesday 1400UTC. Can you announce it?

@jcrespo, okay, I'll do the announcements.

@Halfak We are proposing labsdb1004 reboot (wikilabels db server) for Tuesday 11 July at 1400 UTC. Would that work for you?

Someone announced 60 seconds of downtime, which I do not think is reasonable- rebooting fully a server and all its services takes around 3-5 minutes, and that is assuming everything goes well, and without taking into account the time for services to get up and down, which is quite high because of the caches writing to disk.

Normally a 1 hour of maintenance window is announced, with 5 minutes of service unavailability (we should always announce the worse case scenario, not the best).

Announcements have been updated. Thanks for the note.

Shall we always announce a 1 hour maintenance window for DB maintenance?

It varies from maintanance to maintenance, depending on the work to be done. Some take more some take less- the "normally" was meant as "Normally we should announce a larger window of maintenance than the one needed- for example in this case 1h/5m would be more reasonable".

Halfak updated the task description. (Show Details)

Gotcha. Next time, we should add these details to the task description and I'll pick them up from there when making announcement. :) In this case, I think we're all set.

Mentioned in SAL (#wikimedia-operations) [2017-07-11T14:21:23Z] <jynus> rebooting labsdb1004 for kernel upgrade T168584

Status: labsdb1005 reboot is scheduled for July 12 at 1400 UTC.

We've decided to wait on labsdb1001 and 1003 reboots for now - given that these boxes haven't been rebooted in over a year, and are set up with raid 0 and we have no cushion to handle a disk failure on reboot. The plan is wait for labsdb1009-11 to be available as functional alternatives for users, and then when the usage for 1001 and 1003 has dropped somewhat, we schedule reboots for 1001 and 1003. @Marostegui is working on the new servers, and the hope is to be able to do the reboots in the next 4-6 weeks.

How do you guys want to proceed with this in the end? Is it worth the risk?

@Marostegui We talked about this today in our meeting, and think that since we don't have significant user traffic moved over from 1001/3 to the new WikiReplica servers yet, we should hold off from rebooting these server for longer, given that Moritz mentioned during our last discussion that we can afford to hold off, and the immediate attack vectors have already been plugged in place.

[14:50:28] <moritzm>	 so, these servers have the immediate attack vectors plugged (glibc ld.so, exim and sudo), the reboot is required to fix the underlying problem on the kernel level
[14:51:13] <madhuvishy>	 right, okay
[14:51:44] <moritzm>	 if there's a risk of some data loss (and raid0 for the data partion sounds a bit like it), we can also hold this back, but it would be good if we setup teh new servers in a way that it allows us to also reboot these servers in the future
[14:52:09] <chasemp>	 moritzm: sounds good, thank you
[14:52:17] <jynus>	 moritzm: the new ones
[14:52:21] <jynus>	 have that
[14:52:40] <jynus>	 we have haproxy in the middle, and it is extremely simple
[14:52:47] <moritzm>	 lets avoid trouble, then and fix this by migrating to the new servers

I vote we close this as "resolved" with a note that 1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm.

@Marostegui We talked about this today in our meeting, and think that since we don't have significant user traffic moved over from 1001/3 to the new WikiReplica servers yet, we should hold off from rebooting these server for longer, given that Moritz mentioned during our last discussion that we can afford to hold off, and the immediate attack vectors have already been plugged in place.

Sounds good to me!

I vote we close this as "resolved" with a note that 1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm.

fine by me!

1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm.

JFTR, since I didn't see it mentioned neither here nor in T142807, how impending is that decomm? Days/weeks/months?

No definite date has been set as we are working on T173511 as the precursor to moving over quarry (and probably PAWS). I think we are talking months as long as we can do it gracefully.

I do not think we should postpone the reboots too much, my proposal would be to:

0) document access to the new hosts (bare essentials)

  1. announce the upcoming changes and encouraging users to test the new hosts
  2. announce the reboot, and use it as a "excuse" to encourage users to test the new hosts. Tell users to backup things that cannot be lost (which shouldn't be on replica labsdb in the first place)
  3. perform the maintenance

That should be doable within a month, even if that doesn't mean we fully and definitely decom labsdb1001/3.

@jcrespo's plan sounds like a good one. Working on the announce of the new cluster was already on my todo list for today, so I'll add foreshadowing of the current cluster reboot. If we do lose a disk on 1001/3 to the powercycle though it will be hard to recover so we should figure out T173511: Implement technical details and process for "datasets_p" on wikireplica hosts before we actually reboot either of them.

Reopening since we are scheduling the labsdb1001 and 1003 reboots over the next couple weeks.

Proposed timing for the 2 reboots:

labsdb1001: Monday Oct 30 2017, 14:30 UTC (16:30 Madrid, 10:30 EST, 07:30 PT)
labsdb1003: Tuesday Nov 07 2017, 14:30 UTC (16:30 Madrid, 10:30 EST, 07:30 PT)

I've verified with @Cmjohnson that he can be around at the DC during these days and times.

Looks good to me! Thanks for getting this arranged

I installed the latest trusty kernels on labsdb1001/1003.

If we do lose a disk on 1001/3 to the powercycle though it will be hard to recover so we should figure out T173511: Implement technical details and process for "datasets_p" on wikireplica hosts before we actually reboot either of them.

Per T173511#3713170, having a long term solution for curated datasets is no longer a blocker for the reboots.

Change 386660 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] labsdb: Switchover dns for labsdb1001 shards to labsdb1003

https://gerrit.wikimedia.org/r/386660

Mentioned in SAL (#wikimedia-operations) [2017-10-30T13:21:00Z] <marostegui> Set innodb_max_dirty_pages_pct = 10 on labsdb1001 so it powers off a bit faster - T168584

Change 386660 merged by Madhuvishy:
[operations/puppet@production] labsdb: Switchover dns for labsdb1001 shards to labsdb1003

https://gerrit.wikimedia.org/r/386660

Mentioned in SAL (#wikimedia-operations) [2017-10-30T14:30:10Z] <marostegui> Stop replication and MySQL on labsdb1001 - T168584

Mentioned in SAL (#wikimedia-operations) [2017-10-30T17:08:18Z] <madhuvishy> Revert dns switchover for c1 shards to c3 post labsdb1001 reboot T168584

The 1001 reboot is all done. Notes from my planning etherpad:

labsdb1001 (Planned for Oct 30 2017 14:30 UTC)

  • Pre-reboot
    • One hour before scheduled reboot (13:30 UTC/6:30 PT) - switch dns for 1001 shards over to 1003 [DONE]
      • Patch - https://gerrit.wikimedia.org/r/#/c/386660 [DONE]
      • Updating the file indicated should update the file on labservices1001 and 1002 w/ a puppet run and that should restart pdns-recursor to enable
      • foo.labsdb is a stub zone that the pdns-recursors handle so it's all there
      • dig @labs-recursor0.wikimedia.org enwiki.labsdb (test to labservices1001) [looks good]
      • dig @labs-recursor1.wikimedia.org enwiki.labsdb (test to labservices1002) [looks good]
      • dns caches may abound? We may have to clush out for tools to clear nscd caches at that time (I think it is (as root) nscd --invalidate-hosts)
      • (marostegui) Set innodb_max_dirty_pages_pct = 10 [DONE]
  • During
    • Make sure Chris is at DC [DONE]
    • Icinga downtime for labsdb1001 [DONE]
    • Announce to mailing list [DONE]
    • Status update on IRC channel [DONE]
  • Reboot
  • Post-reboot
    • Start mysql without replication: /etc/init.d/mysql start --skip-slave-start [DONE]
    • Check databases, tables, and do some selects to make sure data is there [DONE]
    • Make labsdb1001 idempotent again on mysql ( set global slave_exec_mode = IDEMPOTENT; ) [DONE]
    • Start replication: start all slaves; [DONE]
    • Revert dns switchover https://gerrit.wikimedia.org/r/#/c/386660 if things go well. [DONE]
    • Announce to mailing list [DONE]
    • Status update on IRC channel [DONE]

Please check: T179464
labsdb1001 has crashed and the storage looks totally broken, hard to say if it is because of the reboot, but I wouldn't be surprised if it is.

We should consider labsdb1001 broken for good and decommission it - we need to decide whether we want to continue with the plan and reboot labsdb1003. I wouldn't do it, to be honest.

We should consider labsdb1001 broken for good and decommission it - we need to decide whether we want to continue with the plan and reboot labsdb1003. I wouldn't do it, to be honest.

Ack, let's avoid that. It's not unlikely that the same hw error might also strike 1003.

What's the realistic time frame for 1003 to be around until it's use cases are replaced by the new-style servers?

Let's just keep 1003 running w/o reboot then.

Let's just keep 1003 running w/o reboot then.

+2. @madhuvishy and I had already decided that rebooting labsdb1003 with only 6 weeks left before decomm was too risky following the failure of labsdb1001.

fyi @Cmjohnson We are not doing the labsdb1003 reboot on Tuesday Nov 7, due to T179464.