These labsdb boxes need to be rebooted to upgrade to the new kernels:
- labsdb1001
- labsdb1003
- labsdb1004
- labsdb1005
- labsdb1006
- labsdb1007
- labsdb1009
- labsdb1010
- labsdb1011
Duration: 1 hour of maintenance window, with 5 minutes of service unavailability
These labsdb boxes need to be rebooted to upgrade to the new kernels:
Duration: 1 hour of maintenance window, with 5 minutes of service unavailability
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| labsdb: Switchover dns for labsdb1001 shards to labsdb1003 | operations/puppet | production | +6 -6 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | jcrespo | T140788 Labs databases rearchitecture (tracking) | |||
| Resolved | bd808 | T166402 Program 7 Outcome 3: data services | |||
| Resolved | bd808 | T142807 Migrate all users to new Wiki Replica cluster and decommission old hardware | |||
| Duplicate | None | T168445 Reboots of cloud servers | |||
| Resolved | • madhuvishy | T168584 Labsdb* servers need to be rebooted |
Hi @madhuvishy
I can take care of:
labsdb1009
labsdb1010
labsdb1011
However, for labsdb1001 and labsdb1003 you guys are probably in a better position to take care of those as you'd need to let the users know there will be a short downtime.
For labsdb1004 and 1005 (toolsdb) I guess it would be the same thing, you'd need to let the users know that there will go read-only while the maintenance happens (unless you do a failover - I haven't touched toolsdb much so I am probably not the best person to advise here @jcrespo probably knows better).
For labsdb1006 and 1007 @akosiaris can probably advise.
I don't think labsdb1008 exists
labsdb1004 is used by wikilabels (which is used by Ores). The reboot should be synced with Aaron Halfaker, the last time he added a note for external consumers of wikilabels to notify them of the maintenance.
labsdb100[67] could simply be rebooted without further precautions.
labsdb1004 is also a slave for tools db, labsdb1005 connections can be redirected to it while the other reboots. labsdb1 and 3 need also mysql upgrade, but that will take 30+ minutes. There is also a high chance that they will not come up after reboot.
@Halfak ^. Downtime should be a few minutes.
labsdb100[67] could simply be rebooted without further precautions.
Yes that's true these days. It's effectively used only by the maps labs project and IIRC all the apps in it can survive a few minutes downtime
I do not mind rebooting labs1 and 3 too myself, but dns changes and announcements has to be handled by cloud team.
6 didn't finish setup- it won't probably happen before stretch, given that will likely require another rebuild T157359.
Mentioned in SAL (#wikimedia-operations) [2017-06-22T08:20:47Z] <marostegui> Stop MySQL and reboot labsdb1009 - T168584
Mentioned in SAL (#wikimedia-operations) [2017-06-22T08:44:53Z] <marostegui> Stop MySQL and reboot labsdb1010 - T168584
Mentioned in SAL (#wikimedia-operations) [2017-06-22T08:55:55Z] <marostegui> Stop MySQL and reboot labsdb1011 - T168584
Following hosts rebooted and running: 4.9.0-0.bpo.3-amd64
labsdb1009 labsdb1010 labsdb1011
Hi all,
So current status is:
@jcrespo I can send out an announcement today and verify with Aaron, do you have a preferred time window? I propose Monday after ops meeting (17:00 UTC).
I propose Monday after ops meeting (17:00 UTC).
I am sorry, but that is outside of my working hours.
@jcrespo no problem! let me know what time works for you :) I can do earlier on Monday too. Would 14:00 UTC work? Feel free to propose a suitable time if not. Thanks so much!
So there are several things here- the dns change and the actual reboot. There should be a time between them. I say you (as in, anyone on your team) change dns day 1 at the end of your day (but with enough room to monitor the databases don't break), and then I reboot them at the beginning of mine, after all (or most) connections have been dropped. We should ping Chris so he is available those days because as I said, the servers may not boot back up.
The following day or some days after that (it may take more than one day to warmup, we do the same, but with the other host.
A similar thing would happen with labsdb1005, but that should be in theory faster, so it could be done in sync and faster. Not all databases are replicated, these 3 databases are unavailable on server switch: T127164.
One heads up- some tool users do not try to reconnect on connection failure- that is a problem with their applications -they should retry reconnection once because of failover- and there is not much we can do about it.
Also user databases on replicas are not lost, but cannot be made available during switch- that is why using toolsdb is preferred.
Hey folks! I've been traveling and just getting caught up. It looks like we ought to make a quick announcement for Wiki labels. I can get a notice out as soon as we settle on timing. I want 1-2 days notice before we reboot to get the notice out. Generally, I don't mind waking up early/staying up late to match timing with whoever is doing the reboot. Just let me know what time works for you.
Thanks for the detailed explanation @jcrespo. For labsdb1001 and 1003, I'll check with Chris and schedule the dns switchover and the reboots to happen this week/early next week.
For labsdb1004 and 5, we discussed this today at team meeting, and are wondering if it would be easier to just reboot the boxes without switching connections over to the slave. Tools will fail anyway when we switch connections, and we'd have to manually reboot things, and do it again when we switch back. It may be less impactful for tools to announce an expected toolsdb outage window and just reboot. What do you think?
@Cmjohnson Hi! We are looking at rebooting labsdb1001 and 1003, and it seems like these boxes may not come up automatically on reboot. Jaime recommended that it would be better to have you on site when we do this. The reboots for the boxes would happen on 2 separate days. What days are you in the DC this week and next, so we can schedule these? Let me know if you have any other scheduling constraints too. Thank you :)
it would be easier to just reboot the boxes
I am ok with that if you are ok with that. Announcement should be done, though- on last upgrade people got upset even if we had only seconds during the switch. for 4/5 it should be relatively fast to reboot this time.
@jcrespo Apologies for the delay. Can we start with just labsdb1005 first, and attempt to do it Wednesday July 5, and labsdb1004 on Thursday July 6, provided the first one goes well? I'll let you pick the time, me/folks from cloud team can be available as early as 14 UTC.
We rebooted the labstore(NFS) boxes yesterday - which did not go well (T169289) - and I'm thinking we get any kinks worked out just in case when @MoritzMuehlenhoff is back, before doing labsdb1001 and 1003 - even though the kernel issues that surfaced for the NFS servers may be unrelated.
@Cmjohnson Okay thanks for letting me know, I'll schedule the labsdb1001 and 1003 reboots (the ciscos), for after you are back then. When are you in the DC (from 11th to 14th)?
I would do labsdb1004 first, which is the slave for the toolsdb, and labsdb1005- I didn't want to pressure you because I knew you had other concerns. I would say Tuesday and Wednesday 1400UTC. Can you announce it?
Someone announced 60 seconds of downtime, which I do not think is reasonable- rebooting fully a server and all its services takes around 3-5 minutes, and that is assuming everything goes well, and without taking into account the time for services to get up and down, which is quite high because of the caches writing to disk.
Normally a 1 hour of maintenance window is announced, with 5 minutes of service unavailability (we should always announce the worse case scenario, not the best).
Announcements have been updated. Thanks for the note.
Shall we always announce a 1 hour maintenance window for DB maintenance?
It varies from maintanance to maintenance, depending on the work to be done. Some take more some take less- the "normally" was meant as "Normally we should announce a larger window of maintenance than the one needed- for example in this case 1h/5m would be more reasonable".
Gotcha. Next time, we should add these details to the task description and I'll pick them up from there when making announcement. :) In this case, I think we're all set.
Mentioned in SAL (#wikimedia-operations) [2017-07-11T14:21:23Z] <jynus> rebooting labsdb1004 for kernel upgrade T168584
Status: labsdb1005 reboot is scheduled for July 12 at 1400 UTC.
We've decided to wait on labsdb1001 and 1003 reboots for now - given that these boxes haven't been rebooted in over a year, and are set up with raid 0 and we have no cushion to handle a disk failure on reboot. The plan is wait for labsdb1009-11 to be available as functional alternatives for users, and then when the usage for 1001 and 1003 has dropped somewhat, we schedule reboots for 1001 and 1003. @Marostegui is working on the new servers, and the hope is to be able to do the reboots in the next 4-6 weeks.
@Marostegui We talked about this today in our meeting, and think that since we don't have significant user traffic moved over from 1001/3 to the new WikiReplica servers yet, we should hold off from rebooting these server for longer, given that Moritz mentioned during our last discussion that we can afford to hold off, and the immediate attack vectors have already been plugged in place.
[14:50:28] <moritzm> so, these servers have the immediate attack vectors plugged (glibc ld.so, exim and sudo), the reboot is required to fix the underlying problem on the kernel level [14:51:13] <madhuvishy> right, okay [14:51:44] <moritzm> if there's a risk of some data loss (and raid0 for the data partion sounds a bit like it), we can also hold this back, but it would be good if we setup teh new servers in a way that it allows us to also reboot these servers in the future [14:52:09] <chasemp> moritzm: sounds good, thank you [14:52:17] <jynus> moritzm: the new ones [14:52:21] <jynus> have that [14:52:40] <jynus> we have haproxy in the middle, and it is extremely simple [14:52:47] <moritzm> lets avoid trouble, then and fix this by migrating to the new servers
I vote we close this as "resolved" with a note that 1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm.
1001/3 have not been rebooted because of the fear of catastrophic hardware failure and their impending decomm.
JFTR, since I didn't see it mentioned neither here nor in T142807, how impending is that decomm? Days/weeks/months?
No definite date has been set as we are working on T173511 as the precursor to moving over quarry (and probably PAWS). I think we are talking months as long as we can do it gracefully.
I do not think we should postpone the reboots too much, my proposal would be to:
0) document access to the new hosts (bare essentials)
That should be doable within a month, even if that doesn't mean we fully and definitely decom labsdb1001/3.
@jcrespo's plan sounds like a good one. Working on the announce of the new cluster was already on my todo list for today, so I'll add foreshadowing of the current cluster reboot. If we do lose a disk on 1001/3 to the powercycle though it will be hard to recover so we should figure out T173511: Implement technical details and process for "datasets_p" on wikireplica hosts before we actually reboot either of them.
Reopening since we are scheduling the labsdb1001 and 1003 reboots over the next couple weeks.
Proposed timing for the 2 reboots:
labsdb1001: Monday Oct 30 2017, 14:30 UTC (16:30 Madrid, 10:30 EST, 07:30 PT)
labsdb1003: Tuesday Nov 07 2017, 14:30 UTC (16:30 Madrid, 10:30 EST, 07:30 PT)
I've verified with @Cmjohnson that he can be around at the DC during these days and times.
Thanks @Marostegui.
I've updated the lists, and our wiki here -https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown
Per T173511#3713170, having a long term solution for curated datasets is no longer a blocker for the reboots.
Change 386660 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] labsdb: Switchover dns for labsdb1001 shards to labsdb1003
Started a planning doc for the reboots here - https://etherpad.wikimedia.org/p/labsdb-reboots
Mentioned in SAL (#wikimedia-operations) [2017-10-30T13:21:00Z] <marostegui> Set innodb_max_dirty_pages_pct = 10 on labsdb1001 so it powers off a bit faster - T168584
Change 386660 merged by Madhuvishy:
[operations/puppet@production] labsdb: Switchover dns for labsdb1001 shards to labsdb1003
Mentioned in SAL (#wikimedia-operations) [2017-10-30T14:30:10Z] <marostegui> Stop replication and MySQL on labsdb1001 - T168584
Mentioned in SAL (#wikimedia-operations) [2017-10-30T14:36:04Z] <marostegui> Reboot labsdb1001 - T168584
Mentioned in SAL (#wikimedia-operations) [2017-10-30T17:08:18Z] <madhuvishy> Revert dns switchover for c1 shards to c3 post labsdb1001 reboot T168584
The 1001 reboot is all done. Notes from my planning etherpad:
labsdb1001 (Planned for Oct 30 2017 14:30 UTC)
Please check: T179464
labsdb1001 has crashed and the storage looks totally broken, hard to say if it is because of the reboot, but I wouldn't be surprised if it is.
We should consider labsdb1001 broken for good and decommission it - we need to decide whether we want to continue with the plan and reboot labsdb1003. I wouldn't do it, to be honest.
Ack, let's avoid that. It's not unlikely that the same hw error might also strike 1003.
What's the realistic time frame for 1003 to be around until it's use cases are replaced by the new-style servers?
We are aiming for 13th Dec to retire these two hosts: T142807 and https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown
+2. @madhuvishy and I had already decided that rebooting labsdb1003 with only 6 weeks left before decomm was too risky following the failure of labsdb1001.
fyi @Cmjohnson We are not doing the labsdb1003 reboot on Tuesday Nov 7, due to T179464.