Mon, Dec 11
It's a pain any direction we slice this, and I'm not fond of adding new canonical domains outside the known set for individual low-traffic projects. We didn't add new domains for a variety of other public-facing efforts (e.g. wdqs, ORES, maps, etc).
Mon, Nov 27
Yeah I still use oxygen pretty routinely.
Wed, Nov 22
awesome, thanks :)
Recapping where we're at on all things here, because even I get lost sometimes:
@Cmjohnson - did this ship out?
No, we never made an incident rep on this one, and I don't think it would be fair at this time to implicate ORES as a cause. We can't really say that ORES was directly involved at all (or any of the other services investigated here). Because the cause was so unknown at the time, we stared at lots of recently-deployed things, and probably uncovered hints at minor issues in various services incidentally, but none of them may have been causative.
These are now non-primary, but still active as backups for now. Will switch to spare role and remove from router configs post-Thanksgiving and then real decom can start.
These are fully in-service now
Sat, Nov 18
Nevermind, I took a quick look at some of the PURGE logs and found some. The tildes are decoded canonically (as we should expect from rawurlencode() and RFC) so I'll update the canonicalization to decode tildes as well, it will avoid some pointless URL rewrites in the canonical cases (which should be the most-common).
Yeah, you're right, I see that now. So it's all using rawurlencode() in practice, which is better! Since literal spaces aren't allowed in Title or File URLs (underscores), I think the only real impact here is how canonicalization works for the tilde. Do we have examples for File: URLs -> upload that have tildes to verify on?
Fri, Nov 17
The above is in effect on all upload caches since about 17:23 UTC (just before this post), and doesn't seem to be causing any adverse effects.
Done here I think as well, thanks everyone :)
No 3DES connections or saved sessions left on the public cache endpoints \o/
Nov 17 2017
On the MW side of (2) above, it appears the swiftFileBackend code in MW uses PHP's urlencode to transform the filenames into upload URL paths. urlencode documentation claims that it percent-encodes everything but alphanumerics and -_. (so the set it does not encode is almost the official Unreserved Set, but it's missing the tilde). It also encodes spaces as + rather than %20 because it's meant for query strings rather than paths. PHP's rawurlencode would probably have been more appropriate here as it conforms to the RFC and excludes from encoding exactly the Unreserved Set and doesn't do the +-for-spaces thing. However, in practice, we can deal with the ~ issue and spaces have already been made into underscores, so the plusses shouldn't ever actually appear.
Nov 16 2017
Do we have answers about what the right behaviors are, the questions asked above in July?
So, my top questions at this point on all things related are:
On the cache_text side for the actual wikis, Varnish does do some normalization, but not complete normalization. Varnish basically just decodes a special handful of %-escapes based on what wfUrlencode does, and that code is here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/normalize_path.inc.vcl.erb;c817459c34aa7ab815da266496864125b470b04a$40 . It's been a known issue for quite a long while that we could/should be doing better on that normalization, but hasn't been a priority because there's not much pragmatic fallout other than slight impact on cache hitrates.
Nov 15 2017
Closing for now, assuming no new problems surface. Thanks @RobH :)
Does RL make use of the CP cookie information to use different module-loading strategies for H/1 vs H/2? I remember that being the intent in creating it, but I'm not sure what the status is today.
Nov 14 2017
For now I'm puppetizing it back into the cluster (and ipsec lists), but not repooling yet...
Would it be fair to assume that the URL-encoding normalization rules for the upload.wikimedia.org URLs should be the same as the one we use for MediaWiki? Anyone know if there's any reason for it to vary from that?
Nov 13 2017
I think the hints about parentheses are pointing in a useful direction, but I think I was thinking about the repercussions incorrectly above. It's not a question of an encoding failure of some kind in the purging pipeline. It's that the fetching pipeline (as in User->Varnish->Swift->(MW|Thumbor|etc?)) accepts multiple possible encodings of a file's URI, without consistent normalization (or rejection) across layers, while the PURGEs are only sent for whatever is considered the canonical encoding of the filename. And I think the pirate uploaders have figured this out and are exploiting it.
Err, we should really move the sub-conversation back to T171881 . This ticket is more about general reliability problems and/or race-conditions, not about the WP0 abuse specifically.
Unfortunately, there is a known issue with this version of nginx where it breaks older versions of the popular Android library OkHttp.
What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion log search interfaces?
Nov 9 2017
So, looking at all the crash messages we've managed to record since the beginning of this ticket, the CPU# indicated has had a history of: 41, 23, 47, 47, 1, 19 . The way Linux numbers the CPU cores, since they're all odd they're all on the second CPU die (the one with fewer DIMM slots next to it).
I'm pretty sure all of the TCP application-level data flows match up roughly with the expected sequence of TLS HANDSHAKE -> CLIENT HTTP REQ -> SERVER HTTP RESP up until just before the idle period. There's a pair of final data packets that are 30 bytes in length, one for server->client and then one for client->server. I'm almost certain the client->server one is a TLS-layer close-notify. Not sure about the earlier server->client one. This would not be the first time we've suspected some relationship between interesting/bad TCP indicators and the close-notify-related stuff being worked on over in T163674 -> https://gerrit.wikimedia.org/r/#/c/386195/ (which is now deployed on the servers, but configuration hasn't been turned on to try it).
Annotating some basic thoughts on the above (keep in mind with various kinds of offload in play, packetization/MTU/checksum will often look different here in the host tcpdump than on the wire...):
I haven't had time to analyze it deeply/manually, but I managed to capture/filter down tcpdump verbose/stamped outputs for exactly one connection which eventually suffered the ICMP host-unreach. This might put the problem in better context (client IP replaced with 192.0.2.192):
Nov 8 2017
@RobH - also, we should go stretch from the get-go on these as well (like bast4)
@RobH - the hostnames for these should be dns4001 + dns4002. We won't be running ganeti when we initially bring these into service, so should have standard no-virtualization setup.
Re: all of the above about git-ssh: I pushed https://gerrit.wikimedia.org/r/#/c/389871/ and @ayounsi fixed up the router ACLs, so the public entrypoint git-ssh.codfw.wikimedia.org into phab2001-vcs works now. Also pushed the TTL reduction for the real service hostname git-ssh.wikimedia.org in https://gerrit.wikimedia.org/r/#/c/389869/ .
Well there's two different actions to get through here:
Nov 7 2017
Nov 6 2017
The timeout changes above will offer some insulation, and as time passes we're not seeing evidence of this problem recurring with the do_stream=false patch reverted.
Copying in some commentary I was accidentally putting in the wrong ticket (the private purchasing one) for the new globalsign certs over the past few days:
Nov 5 2017
related to T131012 ? The file doesn't appear to have an XML prolog (although as that ticket notes, one probably shouldn't be required)
Nov 1 2017
I don't think they're currently puppetized for lvs::realserver, but it looks like the machines had such a configuration in the past, and removing it from puppet doesn't remove the effects from the host. Probably need to remove the wikimedia-lvs-realserver package from the host, and/or remove /etc/default/wikimedia-lvs-realserver, and/or manually remove the IP from the loopback?
Oct 31 2017
And the other side of this audit. Before we try to (carefully) switch to nginx-light, we need them all upgraded to the latest version so the dpkg-level replacement works sanely:
bblack@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'apt-cache policy nginx-full|egrep "Installed:|Candidate:"' 392 hosts will be targeted: conf[2001-2003].codfw.wmnet,conf[1001-1003].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2017,2097,2099-2117,2120-2147,2150-2151,2153-2245,2247-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1235,1238-1258,1261-1290,1293-1306,1308-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet Confirm to continue [y/n]? y ===== NODE GROUP ===== (1) ms-fe2005.codfw.wmnet ----- OUTPUT of 'apt-cache policy...led:|Candidate:"' ----- Installed: 1.13.5-1+wmf1 Candidate: 1.13.6-2+wmf1 ===== NODE GROUP ===== (7) ms-fe[2006-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet ----- OUTPUT of 'apt-cache policy...led:|Candidate:"' ----- Installed: 1.11.10-1+wmf2~stretch1 Candidate: 1.13.6-2+wmf1 ===== NODE GROUP ===== (80) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4023,4025-4032].ulsfo.wmnet,cp1008.wikimedia.org ----- OUTPUT of 'apt-cache policy...led:|Candidate:"' ----- Installed: 1.13.6-2+wmf1~jessie1 Candidate: 1.13.6-2+wmf1~jessie1 ===== NODE GROUP ===== (53) conf[1001-1003].eqiad.wmnet,mw[2100,2153-2162,2201,2243,2247-2250,2256].codfw.wmnet,mw[1228,1261-1265,1299-1306,1312-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet ----- OUTPUT of 'apt-cache policy...led:|Candidate:"' ----- Installed: 1.11.10-1+wmf3 Candidate: 1.13.6-2+wmf1~jessie1 ===== NODE GROUP ===== (4) mw[1308-1311].eqiad.wmnet ----- OUTPUT of 'apt-cache policy...led:|Candidate:"' ----- Installed: 1.13.5-1+wmf1~jessie1 Candidate: 1.13.6-2+wmf1~jessie1 ===== NODE GROUP ===== (246) conf[2001-2003].codfw.wmnet,mw[2017,2097,2099,2101-2117,2120-2147,2150-2151,2163-2200,2202-2242,2244-2245,2251-2255,2257-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1227,1229-1235,1238-1258,1266-1290,1293-1298].eqiad.wmnet ----- OUTPUT of 'apt-cache policy...led:|Candidate:"' ----- Installed: 1.11.4-1+wmf14 Candidate: 1.13.6-2+wmf1~jessie1 ================
Auditing production tlsproxy users for the switch to light in https://gerrit.wikimedia.org/r/#/c/386424/ shows no excess modules used on any of them, except for the expected lua+ndk on the cache hosts (which are installed explicitly and separately anyways, not part of full):
bblack@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'grep /usr/lib/nginx/modules/ /proc/$(systemctl show nginx -p MainPID|cut -d= -f2)/maps|cut -d/ -f6-|sort|uniq' 392 hosts will be targeted: conf[2001-2003].codfw.wmnet,conf[1001-1003].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2017,2097,2099-2117,2120-2147,2150-2151,2153-2245,2247-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1235,1238-1258,1261-1290,1293-1306,1308-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet Confirm to continue [y/n]? y ===== NODE GROUP ===== (80) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4023,4025-4032].ulsfo.wmnet,cp1008.wikimedia.org ----- OUTPUT of 'grep /usr/lib/ng.../ -f6-|sort|uniq' ----- ndk_http_module.so ngx_http_lua_module.so ===== NODE GROUP ===== (1) cp4024.ulsfo.wmnet ----- OUTPUT of 'grep /usr/lib/ng.../ -f6-|sort|uniq' ----- ssh: connect to host cp4024.ulsfo.wmnet port 22: Connection timed out ================
Ping @RobH - this hardware needs replacing. I guess diagnostics aren't perfect, and neither is the SEL, but clearly the node crashes out even during a fresh install.
Oct 30 2017
Reducing this from UBN->High, because current best-working-theory is this problem is gone so long as we keep the VCL do_stream=false change reverted. Obviously, there's still some related investigations ongoing, and I'm going to write up an Incident_Report about the 503s later today as well.
Trickled-in POST on the client side would be something else. Varnish's timeout_idle, which is set to 5s on our frontends, acts as the limit for receiving all client request headers, but I'm not sure that it has such a limitation that applies to client-sent bodies. In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into the stack. We could/should double-check varnish's behavior there, but that's not what's causing this, this is definitely on the receiving end of responses from the applayer.
Oct 29 2017
Now that I'm digging deeper, it seems there are one or more projects in progress built around Push-like things, in particular T113125 . I don't see any evidence that there's been live deploy of them yet, but maybe I'm missing something or other. If we have a live deploy of any kind of push-like functionality through the text cluster, it's a likely candidate for issues above in the short term.
Does Echo have any kind of push notification going on, even in light testing yet?
Yes, but the work for that is more on the CA end than ours, from a technical perspective. Because of Google's deadlines, in practice virtually all CA vendors will have to automatically embed SCTs in all certs they issue by April 2018. Our vendors are already on top of this though, and we expect to have SCTs embedded in our next round of unified certificate renewals from our dual CAs: GlobalSign + Digicert happening this quarter.
Oct 28 2017
@Whatamidoing-WMF - Kinda? See also the explanation here: T25932#3614933 . TL;DR is that just because we removed the cipher that IE8/XP should use, that doesn't necessarily mean that no IE8/XP clients will ever connect to us.
A while after the above, @hoo started focusing on a different aspect of this we've been somewhat ignoring as more of a side-symptom: that there tend to be a lot of sockets in a strange state on the "target" varnish, to various MW nodes. They look strange on both sides, in that they spend significant time in the CLOSE_WAIT state on the varnish side and FIN_WAIT_2 on the MW side. This is a consistent state between the two nodes, but it's not usually one that non-buggy application code spends much time in. In this state, the MW side has sent FIN, Varnish has seen that and sent FIN+ACK, but Varnish has not yet decided to send its own FIN to finish the active closing process, and MW is still waiting on it.
For all of the same good reasons pointed out a while back in e.g. https://blog.qualys.com/ssllabs/2016/09/06/is-http-public-key-pinning-dead , Google is putting the last nail in the coffin of [H]PKP, effectively killing it:
Updates from the Varnish side of things today (since I've been bad about getting commits/logs tagged onto this ticket):
Oct 27 2017
So, I noticed again today and figured I should bring it up here: it seems highly fishy that most of the files that end up on the Files to purge lists on the WP0 Reporter's room have parentheses in their titles, either literally or as the %-encoded %28...%29 pair. Has anyone deeply investigated the angle that there's an encoding problem here? E.g. that the actual URL of the file on upload and the URL being PURGEd differ in parentheses-encoding details in some way, or that there's not some fault that causes PURGE URL parentheses to be double-encoded, etc?
So, the reinstall attempt failed with another crash during the installer. I think we have to be looking at bad hardware here:
Since diagnostics and SEL don't turn up anything, I'll recap what we've observed:
Unless anyone objects, I'd like to start with reverting our emergency varnish max_connections changes from https://gerrit.wikimedia.org/r/#/c/386756 . Since the end of the log above, connection counts have returned to normal, which is ~100, which is 1/10th the normal 1K limit that usually isn't a problem. If we leave the 10K limit in place, it will only serve to mask (for a time) any recurrence of the issue, making it only possible to detect it early by watching varnish socket counts on all the text cache machines.
My gut instinct remains what it was at the end of the log above. I think something in the revert of wikidatawiki to wmf.4 fixed this. And I think given the timing alignment of the Fix sorting of NullResults changes + the initial ORES->wikidata fatals makes those in particular a strong candidate. I would start with undo all of the other emergency changes first, leaving the wikidatawiki->wmf.4 bit for last.
Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but still pretty verbose):
# cache servers work ongoing here, ethtool changes that require short depooled downtimes around short ethernet port outages: 17:49 bblack: ulsfo cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 17:57 bblack@neodymium: conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet 17:59 bblack: codfw cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 18:00 <+jouncebot> Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. 18:27 bblack: esams cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause 18:41 bblack: eqiad cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
Oct 26 2017
Powering off for now, less confusing for other software-level maintenance.
It looks like that commentary is still valid for even the most-recent Firefox builds. So, we may find that even modern Firefox doesn't send close_notify in the case we're talking about. Open question whether Safari and/or Chrome do (they don' t derive from Gecko).
I left this down because @RobH was due on-site a short while later. He observed no SEL entry while on-site.
Oct 25 2017
cp4024 died randomly today. I've left it alone other than to connect to the console and verify no response there.
21:07 < icinga-wm> PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100%
patch above released with nginx-1.13.6-2+wmf1, so we're capable of experimentation now. Flag isn't turned on anywhere yet.
Seems the bot missed logging this here:
would it mean that nginx should keep more TCP connections opened hoping for the client to eventually send a close notify (in response to its previous one)? Are those connections going to be tore down after their overall timeouts expire?