Mon, Apr 24
hmm, no, it is the HTTPS check, not the IdleConnection one. I wonder why it's RST and not regular close?
I think this is "normal", it's from PyBal IdleConnection monitors (which just open a TCP conn and do no traffic, and eventually result in a RST).
Fri, Apr 21
Thu, Apr 20
Wed, Apr 19
To do a soft-ish failover, on lvs2002 we can disable the puppet agent and stop pybal temporarily, wait a few minutes for traffic to settle over to lvs2005, and then re-seat or replace the optic on lvs2002 (and then restart pybal + re-enable puppet to bring lvs2002 back into service).
Tue, Apr 18
We still have no real ETA on the IP addresses. We're attempting to acquire the address space from APNIC. They're (reasonably) requiring proof of our needs, which includes the physical address of the datacenter in Singapore (we're still evaluating multiple RFP responses), invoices for our equipment (which isn't ordered for the same lack of a shipping address), lists of our network peers in Singapore (which is, again, blocked on contracting with a datacenter vendor so we know which peers are available and what physical building we're peering at).
For now we've solved the pragmatic issues in other ways: some general nginx/varnish tuning, kernel TCP params tuning, and using 8x TCP sockets in parallel for the local traffic. I don't see any point in pursuing varnish patches for unix domain sockets at this time, or in the foreseeable Varnish future here.
Going back over some of the unchecked boxes at the top:
@GWicke yeah we should.
Yeah, leave the traffic tag as we'll want to basically revert https://gerrit.wikimedia.org/r/#/c/348456/ once dbtree is ready for it.
I'll close it for now. If we see more strange issues with super-low cpu freqs we can always search these up to correlate I guess.
Mon, Apr 17
Yeah the patch I deployed above should have fixed the issue in this ticket. Both of the suggested followups would be ideal, but probably aren't pressing at this time.
I've deployed the change above, which gets all of the basics on track for how we want to operate the real campaign. We'll obviously make appropriate minor wording changes once we have dates set and as percentages increase.
(also, generally speaking errors aren't cached, but in this case the error would be cached, because it's returned with a 200 status code...)
tendril.wikimedia.org is independent of varnish, only dbtree.wikimedia.org (that we're talking about here) goes through the standard varnish stuff (although arguably tendril should be moved there as well someday).
Thu, Apr 13
This is the last time I'll respond to trolling on this ticket.
Wed, Apr 12
modules/base/files/kernel/blacklist-wmf.conf is probably the place to try disabling this first, FWIW.
FWIW - I did the same depooling (for reinstalls) in codfw this afternoon, and there was no impact in that case. So this seems to also be eqiad-specific (but that could be just an effect of all the real user load being in eqiad - maybe the same problem, whatever it is, happens in codfw but it's a non-issue due to light load)
Tue, Apr 11
Mon, Apr 10
@Papaul Everything looks good with lvs2002 (checked icinga, interfaces on correct vlans, etc).
Switching this to me
Merge above should fix this, at least for this case and any others on our cache terminators of similar magnitude (but not those with >8K of response header). Example URL from the description WFM now.
Sat, Apr 8
Update: others noticed the serial number didn't change. So, the new part is not yet installed, and we're not sure whether the old part recovered spontaneously, or due to some local action (e.g. reseated while inspecting, etc)
The last round of bans mentioned above is complete now. If all of our theories and workarounds are completely valid (and there aren't other bugs or behaviors in play), this issue should be resolved now with no remaining examples (or new ones being created).
As best as I can tell from looking at a longer section of the cr2-esams logs, it really does look like esams remote hands already swapped in the replacement part and things came up normally (with a brief 503 spike). The part was already on-site a day or so before this according to UPS tracking. The logs are currently spamming some errors about misconfigured BGP peers, but that may well be "normal". A large number of peers are established and working fine.
The continued reports above were expected, as detailed when the Varnish-level workaround was applied above in T162035#3159658 . I've done another of the periodic bans this morning. After giving some time for that impact to settle, I'll start later today on executing and monitoring the more-complete ban on "all objects without X-Original-Content-Type". After that ban, we should be able to get the workaround to take complete effect with one last ban on CT ~ text/html across the fleet. Next week we'll sort out the plans for solving the underlying issue with Swift so that we can eventually revert the Varnish-level hacks and restore our storage keep-time, etc.
Fri, Apr 7
This is going to block deploying edns-client-subnet -enabled recdns packages (requires jessie), which is important for the DC switching stuff. Perhaps we can squeeze this in next week?
Tue, Apr 4
Mon, Apr 3
Added to other email aliases in private repo as well: dns-admin, peering, ripe-updates
Depending on the context I've been flipping between whether we're talking about just 3DES or both of the non-FS ciphers, sorry. In current weekly stats, 3DES is around 0.125% and AES128-SHA is around 0.225% for total of ~0.35% non-FS ( https://grafana.wikimedia.org/dashboard/db/tls-ciphers ). Probably the bolded redirect notice with the 0.2% number should be removed from the wikitech page, and the synthetic varnish error shown here should use "less than 0.2%" during the 3DES campaign. Once 3DES is disabled we can re-assess how we approach AES128-SHA and fix up various things appropriately.
Fri, Mar 31
We've been stalling on this a bit too long now. I'd like to start kicking off this process and getting in touch with Community as well. I've kinda backtracked on the idea of a redirect to meta-wiki for the initial notice to the user. I think for the initial page replacement, we should stick with a synthetic output directly from Varnish itself. That output can in turn contain a link to our existing wikitech page, or a similar page to that one on meta-wiki that's more-detailed. I've stolen from our existing Varnish errorpage.html and proposed some HTML for this here: P5175 (I wish pastes could be viewed raw with content-type!). Obviously wording and layout can be worked on a bit (and actual dates inserted for the real thing), but I like it having our standardized error theme and logo.
Thu, Mar 30
Ok, I was wrong in my initial thinking. Even though we configure proxy_buffering off;, proxy_buffer_size is still a factor. Technically it only defines a chunk-size for reading the response according to the docs, but I'm guessing if it can't read all of the headers in the first chunk it fails. Manual experiments made the logstash url work with proxy_buffer_size 8k;. This might bloat nginx memory usage if applied in the general case, but I don't think it's enough to really hurt anything.
The content of the location header is 3960 bytes
Wed, Mar 29
Adding Traffic and myself and @ema to this. I don't think we've been aware of the uselang hack or its mechanics before (why did ?uselang=foo trigger uncacheability in the first place? query params vary the cache by default, it would've been fine and preferable to leave it cacheable...).
Tue, Mar 28
I think wiping the whole table, even at startup, is probably not ideal (but certainly better than wiping it on shutdown!)., What we should really be aiming for is just better state-sync. Pybal should delete unconfigured services on startup, but it shouldn't delete and then recreate ones that remained stable. So basically it needs to read the current state and model from that what the minimal actions are to bring it into alignment with configuration. What it seems to do now is more blind/idempotent than that, but it leads to these kinds of issues.
Mar 24 2017
Mar 23 2017
I was trying to think of a way to do this that isn't quite as stateful as current cron_splay, but I haven't thought of a good one yet. If we assume we're trying to just extend the cron_splay mechanism to cover a more-general case like this, it would need the entire nodelist the global cron is applied to, as well as some way to notice which nodes are part of a shared cluster (e.g. name of applied role class?), and some way to identify datacenter/site (current cron_splay uses NNNN from hostname, which doesn't apply to all clusters).