Page MenuHomePhabricator

Upgrade all HTTP frontends to Debian jessie
Closed, ResolvedPublic

Description

Our current HTTP frontend fleet (frontend & backend Varnishes for routing & caching, nginx for SSL termination) runs on Ubuntu 12.04 (precise).

Mainly motivated by HTTPS improvements (newer libssl, nginx) and IPsec rollouts, we need to start moving our fleet to a newer platform. Debian jessie is a good candidate for this, as this is the next Wikimedia OS. This work is expected to happen by the end of FY Q3 2015.

For this we'll need to:

  • Prepare infrastructure for jessie boxes (already done as part of a separate goal)
  • Upgrade one canary box to jessie (cp1008)
  • Rebuild custom-made packages for jessie, notably Varnish 3 (jessie ships with 4), varnishkafka; port upstart service files to systemd
  • Reinstall one server of each type role (text, mobile, upload, bits) in production for live testing
  • Make sure that everything works, new kernel in particular.
  • Reinstall all servers across all datacenters

Note that related to this, we'll also need the availability of jessie images for Labs so that we can perform tests and so that Beta can keep up with production. For this, the availability of jessie images in Labs, T75592, is a blocker.

Event Timeline

faidon assigned this task to BBlack.
faidon raised the priority of this task from to High.
faidon updated the task description. (Show Details)
faidon added projects: ops-core, HTTPS-by-default.
faidon added subscribers: faidon, mark.
yuvipanda set Security to None.

Currently we have reinstalled to the new jessie stack one of each type in eqiad (text -> cp1065, upload -> cp1064, bits -> cp1070, mobile -> cp1060) as well as amssq42 as text in esams for live testing. The only known major issue outstanding from that testing is a VM/kernel bug/tuning issue that affects the varnish backends for upload due to their network/disk i/o patterns. I'll open a separate ticket to track that. Once that's resolved, we'll be ready to reinstall all the existing servers to the new jessie-based software stack.

Test installs are now successful and all known issues are resolved for all cache types (e.g. systemd transitions for various daemons, varnishkafka issues, etc). Next step here is to document the reinstall process on wikitech for others to help, set up a plan for reasonable parallelism without causing performance degradation, and begin reinstalling all of the clusters. Should be done by the end of next week, approximately, although upload clusters may require more time to space them out, so as not cause performance degradation.

We're at ~15% of the cache endpoints converted and reinstalled now, and most corner-case oddities with the reinstall process are known. Next week begins mass reinstalls, expecting that almost everything with the exception of a few esams upload machines will be complete by the end of the week. esams upload is a special case that requires longer time intervals to refill disks over the network...

Change 195573 had a related patch set uploaded (by Dzahn):
depool cp1056 for reinstall

https://gerrit.wikimedia.org/r/195573

Change 195573 merged by Dzahn:
depool cp1056 for reinstall

https://gerrit.wikimedia.org/r/195573

Change 195632 had a related patch set uploaded (by Dzahn):
depool cp1057 for reinstall

https://gerrit.wikimedia.org/r/195632

Change 195632 merged by Dzahn:
depool cp1057 for reinstall

https://gerrit.wikimedia.org/r/195632

Status update: ~53% of the cluster is now converted to the new config. I think we'll make the end of the week at this point, or at least very near to it. Thanks @Dzahn for helping with some of these reinstalls :)

cp1053
cp1061
cp1052
cp1054
cp1057
cp1056

done. i stopped adding the bug number to commit messages because it would be too spammy

text-eqiad are 100% complete

At this point, we're 100% converted globally for the text, mobile, and bits clusters. There's still a few left in the eqiad and esams upload clusters, but they should be done by sometime tomorrow.

BBlack closed this task as Resolved.EditedMar 13 2015, 7:24 PM

100% complete now for all live, pooled, public cache endpoints for text, mobile, bits, upload, and misc-web.

parsoidcache has 1/2 hosts converted (putting the other off by a week or so is the current plan, to avoid losing too much parsoidcache data at once - need to coordinate with @GWicke further to confirm - either way only incidentally related to this ticket, as parsoidcache isn't a cpuload/clientperf blocker for HTTPS-by-default ).

There are still 3x current hosts (cp1047, cp3011, cp4009) that are down for hardware issues presently, and will get installed with the new jessie config once their hardware issues are resolved (those all have separate hardware tickets noted in cache.pp depool comments).

@BBlack, delaying the second Parsoid cache by a week should be fine. If things go well we should have VE use RESTBase instead for all wikipedias by the end of next week (possibly even before Wednesday), at which point the issue becomes moot. See T89066: Parsoid performance: Use RESTBase from the MediaWiki Virtual Rest Service on group1/group2 wikis.