Page MenuHomePhabricator

Reboot caches for kernel 3.19.6 globally
Closed, ResolvedPublic

Description

As we're running trunk kernel packages, these updates will cause issues for ipsec deployment until reboot (can't load new modules). The reboot generally takes a while, as we can't take down much of an entire site+cluster at any given time. We probably want a moratorium on any ongoing reboot process during Apr 28-30-ish for GTT's link outages that affect cache PoP redundancy/capacity as well.

See also: T96146

Event Timeline

BBlack raised the priority of this task from to High.
BBlack updated the task description. (Show Details)
BBlack added projects: acl*sre-team, Traffic.
BBlack subscribed.

Is there a list/etherpad/pastebin or something which are done/have to be done? Is this a thing where anyone can just do a few on the side? Other depool steps needed before typing reboot?

I'd like to get the varnish 3.0.7 package built and ready as well, to coalesce the service restart with the reboot, basically (see blocking task). Also, the caches don't have the new kernel installed yet on all of them (I need to pace out some apt-get upgrades there in general).

All of that aside: as a general rule: currently upload caches should get a depool in puppet to avoid repetitive 5xx spikes if we're doing a lot of them. The others (text, bits, mobile, parsoid, misc) can be rebooted without depool and not cause a serious 5xx issue I believe (but it would be good to check the first ones again), so long as we don't reboot a bunch per-cluster per-site at the same time and such...

BBlack lowered the priority of this task from High to Low.Apr 23 2015, 2:32 PM
BBlack raised the priority of this task from Low to High.
BBlack moved this task from Backlog to Blocked on Internal on the Traffic board.
BBlack set Security to None.

All* the caches are now caught up on basic "apt-get upgrade" traffic through today (incl the real jessie release), and have kernel 3.19.3 installed as well. I've rebooted cp3030 (text esams, newest hardware) to give 3.19.3 some bake time with live traffic as well. I'm on the fence as to whether we should go ahead and jump to kernel 4.0 before these reboots as well.

Needs some real input re: whether 3.19 is likely to move up the debian kernel stability ladder from experimental to unstable and onwards any sooner than 4.0 (in which case let's stick with what we've already tested extensively and get off "experimental" sooner) or 4.0 is effectively replacing it and 3.19 is going nowhere (in which case let's make the jump).

* - well, a few in ulsfo still running as I write this due to minor ulsfo transport issues dramatically slowing down the package fetches, but they'll complete shortly...

Had a chat with @MoritzMuehlenhoff about the kernel issues. He's convinced me we should stick with the 3.19 series for the foreseeable future (with an eye towards eventually adopting upstream 3.19-ckt series), and that for the moment we should at least build a non-trunk packaging of 3.19.3. Holding the reboots on deployment of that package to the caches first.

BBlack lowered the priority of this task from High to Medium.May 6 2015, 4:14 PM
BBlack moved this task from Upcoming to Traffic team actively servicing on the Traffic board.
BBlack renamed this task from Reboot caches for kernel 3.19.3 globally to Reboot caches for kernel 3.19.6 globally.May 6 2015, 6:49 PM

We're now basically ready to make progress on this. Needs some coordination on reboots. The necessary command to upgrade the kernel prior to reboot is apt-get -y install linux-meta

^ Unblocked from varnish update, since that's going to take longer to investigate. Also already apt-updated the kernel everywhere, just the reboots themselves remain now.

Various things have been blocking me from getting around to these reboots lately. At this point, all of ulsfo is on the new kernel, as well as cp3030 (in esams) and cp1008 (non-prod test in eqiad), and nothing looks wrong. Will try to reboot most of the rest this week.

For future reference, this is what I'm doing now for the non-upload caches:

$ for h in `cat rebooters`; do hs=${h%.*.wmnet}; echo ======================; echo === $hs @ $(date) ===; ssh root@neon.wikimedia.org "/usr/local/bin/icinga-downtime -h $hs -d 900 -r 'cache kernel reboots - automated - bblack'"; echo = Issuing reboot; ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=10 -o ServerAliveInterval=1 -o ServerAliveCountMax=3 root@$h reboot; echo === DONE, SLEEPING ==; sleep 900; done
======================
=== cp3003 @ Tue May 19 17:07:30 UTC 2015 ===
[1432055256] EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;cp3003;1432055252;1432056152;1;0;900;marvin-bot;cache kernel reboots - automated - bblack
[1432055256] HOST DOWNTIME ALERT: cp3003;STARTED; Host has entered a period of scheduled downtime
Killed by signal 1.
= Issuing reboot
Warning: Permanently added 'cp3003.esams.wmnet' (RSA) to the list of known hosts.
Timeout, server cp3003.esams.wmnet not responding.
Killed by signal 1.
=== DONE, SLEEPING ==
======================
=== cp1058 @ Tue May 19 17:23:07 UTC 2015 ===
[1432056191] EXTERNAL COMMAND: SCHEDULE_HOST_DOWNTIME;cp1058;1432056189;1432057089;1;0;900;marvin-bot;cache kernel reboots - automated - bblack
[1432056191] HOST DOWNTIME ALERT: cp1058;STARTED; Host has entered a period of scheduled downtime
Killed by signal 1.
= Issuing reboot
Warning: Permanently added 'cp1058.eqiad.wmnet' (RSA) to the list of known hosts.
Timeout, server cp1058.eqiad.wmnet not responding.
Killed by signal 1.
=== DONE, SLEEPING ==

The rebooters file is a list of the applicable machines (not ulsfo, not upload), which I shuffled up manually to ensure not too many from the same site and/or cluster appear close to each other in the list, and we're basically running 15 minute timing between reboots serially. ulsfo was done earlier, and upload tends to need depools to avoid 5xx, because it has more/longer transfers in flight.

This is now complete for most caches. The remaining ones left to go are the upload caches at esams and eqiad.

I did some testing to re-confirm upload cache behavior on reboots (while getting cp1048-cp1050 rebooted in the process):

A reboot without any depooling or only depooled in pybal results in a significant but very brief 503 spike in gdash.
Depooling only in puppet prior to reboot (and getting that puppet changed deployed at all relevant caches) eliminates the spike.

So basically, we can skip the pybal part - pybal is fast enough that it's not a major issue. However, the lack of retry5xx/503 behavior in our upload config and/or long in-flight transfers make depooling the varnish<->varnish part in puppet matter. The spike isn't large or sustained enough to matter in isolation, but in cases like these where we need to hit all upload caches globally, it's not a great idea to trigger so many such spikes all in rapid sequence over a period of hours.

So, yeah, we should push depools through puppet for the remaining 25x upload hosts.

Did some further testing and investigating on the above. I'm starting to think this isn't inherent to a difference in upload's traffic characteristic, and is instead all about the configuration difference between it and the other clusters wrt director-level retries and/or VCL-level retry503/retry5xx.... I think I may dig into this a bit more and see if it's something we can easily/safely correct before doing more reboots here...

Fixed up director-level retries in T99839, tested another upload cache reboot without depool, still same spike behavior. So that wasn't it...

https://gerrit.wikimedia.org/r/#/c/212788/ added retry503 behavior for a single request-restart in the upload-frontend case only on 503 result. This killed the spikes on upload cache reboot-without-depool. There are now 20 such hosts left to reboot (some of them were already rebooted while testing related things), and I'm scripting those to finish up today on 30 minute intervals (so, next 10 hours or so).

BBlack claimed this task.
BBlack moved this task from Traffic team actively servicing to Done on the Traffic board.