Agreed this should be 4xx rather than 5xx
Mon, May 22
acamar hit this again on Sunday, in spite of the (working) acpi_pad blacklist. A simple reboot seems to have cleared it. The next- best advice (based on that old Dell info) would be to blacklist mei. I've rmmod'd it on acamar for now to see if it causes additional issues before we try blacklisting it on all.
Sat, May 20
@Jgreen - re: civicrm, it needs to emit the HSTS header on all HTTPS responses.
Fri, May 19
Not resolved, as the purge graphs can attest!
Thu, May 18
So from the above, apache really has 3 different modes of operation:
The tentative and limited plan for now is to deploy 3x misc/infra hosts (meaning all the hosts other than lvs and cp) at each cache site and not use virtualization. We might revisit this at a later date. The basic layout looks like:
I'm probably backtracking into territory that was once known here, but after the long delay I felt I had to go back and re-validate what's going on with the ports and the thinking.
Wed, May 17
Answering for @ema I think this mostly came up as a consequence of trying to map out the data in T150256#3271004 using lldpcli to confirm port connections. That led to an in-depth conversation about how our racks and rows and switches and vlans are set up and how and where redundancy matters, which led to him looking at our caches and their port/rack/row mapping (also in racktables) and how they're not laid out very ideally in eqiad. Basically these are just questions spawned from exploration.
I think that almost universally, HT is a win for the host as a whole. There's always more things going on than there are cpu cores. If nothing else, picture it in your head as "puppet agent and stats outputting stuff can run in that extra headroom without impacting the important stuff" or whatever.
Edited the top part, re-ran excluding virtuals.
So, a couple points:
On the reboot issue: I've tested cp4021 and the existing puppetization works fine on reboot (even given the other stuff below).
I wonder if Chrome (which is the dominant browser now, not MSIE as indicated in that nginx source comment) sends the close notify?
Also notable: lvs1009 and lvs1012 connections to row B (eth2) are using 1GbE ports rather than 10GbE?
Also while I'm thinking about it - we should validate that the sysctl setting for fq as default qdisc "sticks" on reboot and isn't affected by some kind of ordering race...
There's not a lot of good data on how BBR behaves in datacenter-like networks (high bandwidth, low latency, low loss, etc). It's not really the use case it was designed for, and the reports from others have been mixed. It probably won't turn out completely awful or anything, but I don't know if it would actually fix the port saturation problem or not.
Tue, May 16
Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 . The idea was to try our best to ensure that a given vlan/row's LVS connections are FPC-redundant between the primaries and secondaries. e.g. if lvs1007 (primary for high-traffic1) connects to row C in asw-c-eqiad FPC 5, then lvs1010 (secondary for high-traffic1) needs to connect to row C / asw-c-eqiad in some FPC other than 5. Row D connections have probably changed entirely since that last table was made, and there were some pending moves/fixups listed there as well which may or may not have already happened.
Mon, May 15
Tue, May 9
Are you sure you want the tiles public-cacheable as well? It takes load off of us, but it also puts the purging/invalidation of them on update out of our control (in the users' caches). We might want to sync up on how we want VCL to handle/mask the header as well (and all related things), maybe on IRC or Hangouts when we all get a chance.
You might want to look at the other side of the nginx proxy as well. Perhaps apache is terminating its connection to the local nginx with RST, and this causes nginx's proxy in turn to RST upstream to the actual client?
APNIC has a good writeup here (first half is TCP history redux, second half goes into interesting details and new data on BBR): https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/
@elukey - I think the only real analytics fallout here is that the data that is currently feeding to you as webrequest_maps will become data that's mixed into the existing feed of webrequest_upload. They'll still be differentiated on the request hostname (upload.wikimedia.org vs maps.wikimedia.org). At the time of deploy for the final transition commit, the data would move over smoothly from one to the other over a period of several minutes. Is this something that requires some special accommodation on analytics end first? Timeline?
Mon, May 8
Updates from IRC-only work - a significant majority of our ICMP echo volume is coming from a large number of IPs owned by Google. TODO here is compile information on that we can forwards to their abuse@.
We could do so as a goal at the end of the process, depending how we arrange things.
Fri, May 5
Hmmm another thing - when we first deployed this OCSP updating method, GlobalSign was giving us 8-hour OCSP validity windows. At present (just checked) we're getting 4-day validity from GlobalSign and 7-day validity from Digicert. Perhaps we should back off the OCSP timing from once an hour to once a day in light of this, and use cron_splay instead of fqdn_rand while we're at it?
Also note from that lengthy post - if we were willing to test the scalability of iptables on cache hosts (which we've avoided for fear that it won't scale over cores like the rest of what we're doing) and it works out, there are iptables hacks around this where you temporarily block new SYNs or dataless ACKs around the quick reload, which might work with nginx's methods.
How timely! The subject of how to do completely-seamless reloads (especially for TCP) is quite thorny. I've been pondering it and fighting with the issues for years on the UDP side for gdnsd too. And then just yesterday, an HAProxy blog post popped up that goes into the whole thing in great detail with their own struggles and final solution, which (along with lots of other interesting details0 is that nothing will be perfect unless you do SCM_RIGHTS handoff over a unix socket. I think nginx isn't doing that, they're still doing SO_REUSEPORT-based takeover. https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/
Thu, May 4
Yeah, @faidon has brought up a similar argument before on a slightly different level: that we shouldn't be using nginx-full on most hosts anyways, since we use virtually none of the plugin modules. Somewhere there's an intersection of these ideas that makes life easier.
Interesting data on the topic of BBR under datacenter conditions (low latency 100GbE), possibly supporting the idea that it's not awful to enable it everywhere: https://groups.google.com/forum/#!topic/bbr-dev/U4nlHzS-RFA
With the cable in the second port + T164444 we're blocked on getting a successful test install. @RobH is asking smart hands to swap the cable, and I'll proceed once one of those two issues is resolved.
ok I'm installing jessie onto cp4021 now (just to test configuration issues and patch up puppet for the real installs later!). Things I found while trying to boot:
Wed, May 3
Reformatting this a bit for comparison, and using the "new" binning (which splits 0-1K from 1K-16K):
I've re-run the binning analysis, with a few minor changes:
To re-iterate what @Joe is saying a little differently: the point of cross-dc active/active (which is a goal for all services) is to have the ability at any moment in time to handle all traffic in just one DC because we've suddenly lost or depooled the other. We're not sharding data cross-DC, or distributing maximum load capacity cross-DC.
@Gilles - FYI the kernel upgrades that were blocking this are done, and we're tentatively looking at turning on BBR on May 22, so that we have a week of post-switchback stats to compare when looking at NavTiming impact.
Yeah we should discuss our options a bit here re: minimizing ulsfo downtime, I think we have a few options for how we arrange this. There some complicating factors with the misc and maps clusters: the new hardware config assumes we've done the software work to fold those into the primary clusters. The reason we didn't block is that the backup plan is to simply not have misc and maps endpoints in ulsfo until the software side of the work is done.
Thu, Apr 27
Some general updates on bin-sizing estimates: Based on the graph data for available bytes in each bin and comparing how fast they initially reach zero after restarts on live servers, we can make a few tweaks. Bins 1 and 2 seem slightly-oversized, while bins 0, 3, and 4 are slightly undersized. Proposed tweak to upload.pp based on the live data would be:
Wed, Apr 26
Apr 24 2017
hmm, no, it is the HTTPS check, not the IdleConnection one. I wonder why it's RST and not regular close?
I think this is "normal", it's from PyBal IdleConnection monitors (which just open a TCP conn and do no traffic, and eventually result in a RST).
Apr 21 2017
Apr 20 2017
Apr 19 2017
To do a soft-ish failover, on lvs2002 we can disable the puppet agent and stop pybal temporarily, wait a few minutes for traffic to settle over to lvs2005, and then re-seat or replace the optic on lvs2002 (and then restart pybal + re-enable puppet to bring lvs2002 back into service).
Apr 18 2017
We still have no real ETA on the IP addresses. We're attempting to acquire the address space from APNIC. They're (reasonably) requiring proof of our needs, which includes the physical address of the datacenter in Singapore (we're still evaluating multiple RFP responses), invoices for our equipment (which isn't ordered for the same lack of a shipping address), lists of our network peers in Singapore (which is, again, blocked on contracting with a datacenter vendor so we know which peers are available and what physical building we're peering at).
For now we've solved the pragmatic issues in other ways: some general nginx/varnish tuning, kernel TCP params tuning, and using 8x TCP sockets in parallel for the local traffic. I don't see any point in pursuing varnish patches for unix domain sockets at this time, or in the foreseeable Varnish future here.
Going back over some of the unchecked boxes at the top:
@GWicke yeah we should.
Yeah, leave the traffic tag as we'll want to basically revert https://gerrit.wikimedia.org/r/#/c/348456/ once dbtree is ready for it.
I'll close it for now. If we see more strange issues with super-low cpu freqs we can always search these up to correlate I guess.
Apr 17 2017
Yeah the patch I deployed above should have fixed the issue in this ticket. Both of the suggested followups would be ideal, but probably aren't pressing at this time.
I've deployed the change above, which gets all of the basics on track for how we want to operate the real campaign. We'll obviously make appropriate minor wording changes once we have dates set and as percentages increase.
(also, generally speaking errors aren't cached, but in this case the error would be cached, because it's returned with a 200 status code...)
tendril.wikimedia.org is independent of varnish, only dbtree.wikimedia.org (that we're talking about here) goes through the standard varnish stuff (although arguably tendril should be moved there as well someday).
Apr 13 2017
This is the last time I'll respond to trolling on this ticket.
Apr 12 2017
modules/base/files/kernel/blacklist-wmf.conf is probably the place to try disabling this first, FWIW.
FWIW - I did the same depooling (for reinstalls) in codfw this afternoon, and there was no impact in that case. So this seems to also be eqiad-specific (but that could be just an effect of all the real user load being in eqiad - maybe the same problem, whatever it is, happens in codfw but it's a non-issue due to light load)