BBlack (Brandon Black)
WMF Operations Engineer

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (162 w, 4 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF)

Recent Activity

Mon, Dec 11

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia misc-cluster.

It's a pain any direction we slice this, and I'm not fond of adding new canonical domains outside the known set for individual low-traffic projects. We didn't add new domains for a variety of other public-facing efforts (e.g. wdqs, ORES, maps, etc).

Mon, Dec 11, 4:05 PM · Patch-For-Review, Traffic, wikiba.se, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Mon, Nov 27

BBlack added a comment to T181264: Refresh or replace oxygen.

Yeah I still use oxygen pretty routinely.

Mon, Nov 27, 2:54 PM · hardware-requests, Operations, Analytics

Wed, Nov 22

BBlack added a comment to T166179: singapore caching center: eqiad staging tracking task.

awesome, thanks :)

Wed, Nov 22, 10:55 PM · ops-eqsin, DC-Ops, Operations
BBlack added a comment to T164327: replace ulsfo aging servers.

Recapping where we're at on all things here, because even I get lost sometimes:

Wed, Nov 22, 9:01 PM · Traffic, Operations, ops-ulsfo
BBlack closed T178423: rack/setup/install cp40(29|3[012]).ulsfo.wmnet as Resolved.
Wed, Nov 22, 8:53 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo
BBlack closed T178423: rack/setup/install cp40(29|3[012]).ulsfo.wmnet, a subtask of T164327: replace ulsfo aging servers, as Resolved.
Wed, Nov 22, 8:53 PM · Traffic, Operations, ops-ulsfo
BBlack added a comment to T166179: singapore caching center: eqiad staging tracking task.

@Cmjohnson - did this ship out?

Wed, Nov 22, 8:42 PM · ops-eqsin, DC-Ops, Operations
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

No, we never made an incident rep on this one, and I don't think it would be fair at this time to implicate ORES as a cause. We can't really say that ORES was directly involved at all (or any of the other services investigated here). Because the cause was so unknown at the time, we stared at lots of recently-deployed things, and probably uncovered hints at minor issues in various services incidentally, but none of them may have been causative.

Wed, Nov 22, 7:38 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T178535: decommission lvs400[1-4].ulsfo.wmnet.

These are now non-primary, but still active as backups for now. Will switch to spare role and remove from router configs post-Thanksgiving and then real decom can start.

Wed, Nov 22, 9:43 AM · Patch-For-Review, Traffic, Operations, ops-ulsfo
BBlack closed T178436: rack/setup/install lvs400[567].ulsfo.wmnet as Resolved.

These are fully in-service now

Wed, Nov 22, 9:42 AM · Patch-For-Review, Traffic, Operations
BBlack closed T178436: rack/setup/install lvs400[567].ulsfo.wmnet, a subtask of T164327: replace ulsfo aging servers, as Resolved.
Wed, Nov 22, 9:42 AM · Traffic, Operations, ops-ulsfo

Sat, Nov 18

BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

Nevermind, I took a quick look at some of the PURGE logs and found some. The tildes are decoded canonically (as we should expect from rawurlencode() and RFC) so I'll update the canonicalization to decode tildes as well, it will avoid some pointless URL rewrites in the canonical cases (which should be the most-common).

Sat, Nov 18, 12:42 AM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero
BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

Yeah, you're right, I see that now. So it's all using rawurlencode() in practice, which is better! Since literal spaces aren't allowed in Title or File URLs (underscores), I think the only real impact here is how canonicalization works for the tilde. Do we have examples for File: URLs -> upload that have tildes to verify on?

Sat, Nov 18, 12:38 AM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero

Fri, Nov 17

BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

The above is in effect on all upload caches since about 17:23 UTC (just before this post), and doesn't seem to be causing any adverse effects.

Fri, Nov 17, 5:27 PM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero
BBlack moved T180792: Remove 3DES patch from OpenSSL builds from Triage to TLS on the Traffic board.
Fri, Nov 17, 5:16 PM · Operations, Traffic
BBlack closed T163251: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members as Resolved.

Done here I think as well, thanks everyone :)

Fri, Nov 17, 3:49 PM · Community-Liaisons (Oct-Dec 2017), Patch-For-Review, User-Johan, Operations, Traffic
BBlack closed T163251: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members, a subtask of T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support), as Resolved.
Fri, Nov 17, 3:49 PM · Browser-Support-Internet-Explorer, User-notice, Patch-For-Review, Operations, Traffic
Dzahn awarded T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) a Goat token.
Fri, Nov 17, 3:07 PM · Browser-Support-Internet-Explorer, User-notice, Patch-For-Review, Operations, Traffic
BBlack created T180792: Remove 3DES patch from OpenSSL builds.
Fri, Nov 17, 3:06 PM · Operations, Traffic
BBlack closed T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) as Resolved.

No 3DES connections or saved sessions left on the public cache endpoints \o/

Fri, Nov 17, 3:05 PM · Browser-Support-Internet-Explorer, User-notice, Patch-For-Review, Operations, Traffic
BBlack closed T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support), a subtask of T118181: Planning for phasing out non-Forward-Secret TLS ciphers, as Resolved.
Fri, Nov 17, 3:05 PM · Patch-For-Review, Operations, Traffic

Nov 17 2017

BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

On the MW side of (2) above, it appears the swiftFileBackend code in MW uses PHP's urlencode to transform the filenames into upload URL paths. urlencode documentation claims that it percent-encodes everything but alphanumerics and -_. (so the set it does not encode is almost the official Unreserved Set, but it's missing the tilde). It also encodes spaces as + rather than %20 because it's meant for query strings rather than paths. PHP's rawurlencode would probably have been more appropriate here as it conforms to the RFC and excludes from encoding exactly the Unreserved Set and doesn't do the +-for-spaces thing. However, in practice, we can deal with the ~ issue and spaces have already been made into underscores, so the plusses shouldn't ever actually appear.

Nov 17 2017, 12:14 AM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero

Nov 16 2017

BBlack added a subtask for T178173: Renew unified certificates 2017: Unknown Object (Task).
Nov 16 2017, 11:47 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T154026: On mobile, http://wikipedia.org/wiki/Foo redirects to https://www.m.wikipedia.org/wiki/Foo which does not exist.

Do we have answers about what the right behaviors are, the questions asked above in July?

Nov 16 2017, 5:10 PM · Patch-For-Review, Readers-Web-Backlog (Tracking), Operations, Puppet, Wikimedia-Apache-configuration, Mobile
BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

So, my top questions at this point on all things related are:

Nov 16 2017, 2:03 PM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero
BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

On the cache_text side for the actual wikis, Varnish does do some normalization, but not complete normalization. Varnish basically just decodes a special handful of %-escapes based on what wfUrlencode does, and that code is here: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/normalize_path.inc.vcl.erb;c817459c34aa7ab815da266496864125b470b04a$40 . It's been a known issue for quite a long while that we could/should be doing better on that normalization, but hasn't been a priority because there's not much pragmatic fallout other than slight impact on cache hitrates.

Nov 16 2017, 4:51 AM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero

Nov 15 2017

BBlack closed T174891: cp4024 kernel errors as Resolved.

Closing for now, assuming no new problems surface. Thanks @RobH :)

Nov 15 2017, 10:56 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T180407: Change "CP" cookie from subdomain to project level.

Does RL make use of the CP cookie information to use different module-loading strategies for H/1 vs H/2? I remember that being the intent in creating it, but I'm not sure what the status is today.

Nov 15 2017, 8:02 AM · Operations, Traffic

Nov 14 2017

BBlack added a comment to T174891: cp4024 kernel errors.

For now I'm puppetizing it back into the cluster (and ipsec lists), but not repooling yet...

Nov 14 2017, 8:07 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

Would it be fair to assume that the URL-encoding normalization rules for the upload.wikimedia.org URLs should be the same as the one we use for MediaWiki? Anyone know if there's any reason for it to vary from that?

Nov 14 2017, 5:37 AM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero
BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion log search interfaces?

Z591 should be the best list we have.

Nov 14 2017, 5:35 AM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero
BBlack created T180424: cp3048 crashed.
Nov 14 2017, 5:30 AM · Operations, ops-esams, Traffic

Nov 13 2017

BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

I think the hints about parentheses are pointing in a useful direction, but I think I was thinking about the repercussions incorrectly above. It's not a question of an encoding failure of some kind in the purging pipeline. It's that the fetching pipeline (as in User->Varnish->Swift->(MW|Thumbor|etc?)) accepts multiple possible encodings of a file's URI, without consistent normalization (or rejection) across layers, while the PURGEs are only sent for whatever is considered the canonical encoding of the filename. And I think the pirate uploaders have figured this out and are exploiting it.

Nov 13 2017, 6:00 PM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero
BBlack added a comment to T133821: Content purges are unreliable.

Err, we should really move the sub-conversation back to T171881 . This ticket is more about general reliability problems and/or race-conditions, not about the WP0 abuse specifically.

Nov 13 2017, 5:49 PM · Patch-For-Review, Operations, Traffic
BBlack triaged T180269: Wikimedia's recent upgrade to nginx v. 1.13.6 breaks older Android HTTP libraries as Normal priority.

Unfortunately, there is a known issue with this version of nginx where it breaks older versions of the popular Android library OkHttp.

Nov 13 2017, 4:49 PM · Traffic, Wikimedia-General-or-Unknown, HTTPS, Operations
BBlack added a comment to T133821: Content purges are unreliable.

What I really need to dig on this further is an easy way to see a list of recent WP0-abuse-related deletions on various wikis. Am I missing some way to use the deletion log search interfaces?

Nov 13 2017, 4:40 PM · Patch-For-Review, Operations, Traffic

Nov 9 2017

BBlack added a comment to T174891: cp4024 kernel errors.

So, looking at all the crash messages we've managed to record since the beginning of this ticket, the CPU# indicated has had a history of: 41, 23, 47, 47, 1, 19 . The way Linux numbers the CPU cores, since they're all odd they're all on the second CPU die (the one with fewer DIMM slots next to it).

Nov 9 2017, 6:59 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T167691: High amount of unexpected ICMP dest unreachable toward esams cache clusters.

I'm pretty sure all of the TCP application-level data flows match up roughly with the expected sequence of TLS HANDSHAKE -> CLIENT HTTP REQ -> SERVER HTTP RESP up until just before the idle period. There's a pair of final data packets that are 30 bytes in length, one for server->client and then one for client->server. I'm almost certain the client->server one is a TLS-layer close-notify. Not sure about the earlier server->client one. This would not be the first time we've suspected some relationship between interesting/bad TCP indicators and the close-notify-related stuff being worked on over in T163674 -> https://gerrit.wikimedia.org/r/#/c/386195/ (which is now deployed on the servers, but configuration hasn't been turned on to try it).

Nov 9 2017, 5:39 PM · netops, Operations, Traffic
BBlack added a comment to T167691: High amount of unexpected ICMP dest unreachable toward esams cache clusters.

Annotating some basic thoughts on the above (keep in mind with various kinds of offload in play, packetization/MTU/checksum will often look different here in the host tcpdump than on the wire...):

Nov 9 2017, 3:57 PM · netops, Operations, Traffic
BBlack added a comment to T167691: High amount of unexpected ICMP dest unreachable toward esams cache clusters.

I haven't had time to analyze it deeply/manually, but I managed to capture/filter down tcpdump verbose/stamped outputs for exactly one connection which eventually suffered the ICMP host-unreach. This might put the problem in better context (client IP replaced with 192.0.2.192):

Nov 9 2017, 2:30 PM · netops, Operations, Traffic
Liuxinyu970226 awarded T156026: Enable Service in Asia Cache DC a Mountain of Wealth token.
Nov 9 2017, 12:46 PM · Operations, Traffic

Nov 8 2017

BBlack added a comment to T179204: setup/deploy dns400[12]/wmf721[56].

@RobH - also, we should go stretch from the get-go on these as well (like bast4)

Nov 8 2017, 7:01 PM · Traffic, Operations, ops-ulsfo
BBlack added a project to T177742: Investigate Chrony as a replacement for ISC ntpd: Traffic.
Nov 8 2017, 6:59 PM · Traffic, Operations
BBlack reassigned T179204: setup/deploy dns400[12]/wmf721[56] from BBlack to RobH.

@RobH - the hostnames for these should be dns4001 + dns4002. We won't be running ganeti when we initially bring these into service, so should have standard no-virtualization setup.

Nov 8 2017, 6:58 PM · Traffic, Operations, ops-ulsfo
BBlack updated subscribers of T180069: Pybal should be able to advertise to multiple routers.
Nov 8 2017, 6:55 PM · Patch-For-Review, Pybal, Operations, Traffic
BBlack created T180069: Pybal should be able to advertise to multiple routers.
Nov 8 2017, 6:53 PM · Patch-For-Review, Pybal, Operations, Traffic
BBlack updated subscribers of T137928: Deploy phabricator to phab2001.codfw.wmnet.

Re: all of the above about git-ssh: I pushed https://gerrit.wikimedia.org/r/#/c/389871/ and @ayounsi fixed up the router ACLs, so the public entrypoint git-ssh.codfw.wikimedia.org into phab2001-vcs works now. Also pushed the TTL reduction for the real service hostname git-ssh.wikimedia.org in https://gerrit.wikimedia.org/r/#/c/389869/ .

Nov 8 2017, 6:21 PM · Release-Engineering-Team (Kanban), Patch-For-Review, WorkType-NewFunctionality, Availability, Phabricator
BBlack added a comment to T156256: Allocate address space for Singapore (APNIC).

Status updates?

Nov 8 2017, 3:21 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T164456: Migrate to nginx-light.

Well there's two different actions to get through here:

Nov 8 2017, 2:05 PM · Traffic, Operations

Nov 7 2017

BBlack renamed T179953: cp3043 disk failure from cp4043 disk failure to cp3043 disk failure.
Nov 7 2017, 4:31 PM · Traffic, ops-esams, Operations

Nov 6 2017

BBlack changed the status of T179156: 503 spikes and resulting API slowness starting 18:45 October 26 from Open to Stalled.

The timeout changes above will offer some insulation, and as time passes we're not seeing evidence of this problem recurring with the do_stream=false patch reverted.

Nov 6 2017, 9:37 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack changed the status of T179156: 503 spikes and resulting API slowness starting 18:45 October 26, a subtask of T174361: 1.31.0-wmf.5 deployment blockers, from Open to Stalled.
Nov 6 2017, 9:37 PM · Release-Engineering-Team (Kanban), Train Deployments, Release
BBlack added a comment to T178173: Renew unified certificates 2017.

Copying in some commentary I was accidentally putting in the wrong ticket (the private purchasing one) for the new globalsign certs over the past few days:

In T178831#3731564, @BBlack wrote:

I've issued new certs with new PKs and committed them into the repos (public side here: https://gerrit.wikimedia.org/r/#/c/388378/ ). There's some testing and puppet-work to do before public deployment (especially with the new SCTs), plus waiting a few days after the start date to avoid simpler cases with bad client clocks, before these actually go live (probably sometime next week).

Nov 6 2017, 3:36 PM · Patch-For-Review, Operations, Traffic

Nov 5 2017

BBlack added a comment to T179787: upload.wikimedia.org reports wrong mimetype for svg.

related to T131012 ? The file doesn't appear to have an XML prolog (although as that ticket notes, one probably shouldn't be required)

Nov 5 2017, 4:01 PM · Operations, media-storage

Nov 1 2017

BBlack added a comment to T179494: restbase.svc.eqiad.wmnet directs requests to staging if the origin is staging too.

I don't think they're currently puppetized for lvs::realserver, but it looks like the machines had such a configuration in the past, and removing it from puppet doesn't remove the effects from the host. Probably need to remove the wikimedia-lvs-realserver package from the host, and/or remove /etc/default/wikimedia-lvs-realserver, and/or manually remove the IP from the loopback?

Nov 1 2017, 2:53 PM · Services (done), Traffic, Operations
BBlack added a comment to T179050: setup bast4002/WMF7218.

This is currently installed with jessie, but if we setup a new box, let's use stretch from the start?

Nov 1 2017, 1:49 PM · Traffic, Operations, ops-ulsfo

Oct 31 2017

BBlack updated subscribers of T164456: Migrate to nginx-light.

And the other side of this audit. Before we try to (carefully) switch to nginx-light, we need them all upgraded to the latest version so the dpkg-level replacement works sanely:

bblack@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'apt-cache policy nginx-full|egrep "Installed:|Candidate:"'
392 hosts will be targeted:
conf[2001-2003].codfw.wmnet,conf[1001-1003].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2017,2097,2099-2117,2120-2147,2150-2151,2153-2245,2247-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1235,1238-1258,1261-1290,1293-1306,1308-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet
Confirm to continue [y/n]? y                                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                                         
(1) ms-fe2005.codfw.wmnet                                                                                                                                                                      
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.13.5-1+wmf1                                                                                                                                                                     
  Candidate: 1.13.6-2+wmf1                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                         
(7) ms-fe[2006-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet                                                                                                                                  
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.11.10-1+wmf2~stretch1                                                                                                                                                           
  Candidate: 1.13.6-2+wmf1                                                                                                                                                                     
===== NODE GROUP =====                                                                                                                                                                         
(80) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4023,4025-4032].ulsfo.wmnet,cp1008.wikimedia.org                                                                                                                                                         
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
===== NODE GROUP =====                                                                                                                                                                         
(53) conf[1001-1003].eqiad.wmnet,mw[2100,2153-2162,2201,2243,2247-2250,2256].codfw.wmnet,mw[1228,1261-1265,1299-1306,1312-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet           
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.11.10-1+wmf3                                                                                                                                                                    
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
===== NODE GROUP =====                                                                                                                                                                         
(4) mw[1308-1311].eqiad.wmnet                                                                                                                                                                  
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.13.5-1+wmf1~jessie1                                                                                                                                                             
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
===== NODE GROUP =====                                                                                                                                                                         
(246) conf[2001-2003].codfw.wmnet,mw[2017,2097,2099,2101-2117,2120-2147,2150-2151,2163-2200,2202-2242,2244-2245,2251-2255,2257-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1227,1229-1235,1238-1258,1266-1290,1293-1298].eqiad.wmnet                                                                                                                                                        
----- OUTPUT of 'apt-cache policy...led:|Candidate:"' -----                                                                                                                                    
  Installed: 1.11.4-1+wmf14                                                                                                                                                                    
  Candidate: 1.13.6-2+wmf1~jessie1                                                                                                                                                             
================
Oct 31 2017, 2:49 PM · Traffic, Operations
BBlack added a comment to T164456: Migrate to nginx-light.

Auditing production tlsproxy users for the switch to light in https://gerrit.wikimedia.org/r/#/c/386424/ shows no excess modules used on any of them, except for the expected lua+ndk on the cache hosts (which are installed explicitly and separately anyways, not part of full):

bblack@neodymium:~$ sudo cumin 'R:class = "tlsproxy::instance"' 'grep /usr/lib/nginx/modules/ /proc/$(systemctl show nginx -p MainPID|cut -d= -f2)/maps|cut -d/ -f6-|sort|uniq'
392 hosts will be targeted:
conf[2001-2003].codfw.wmnet,conf[1001-1003].eqiad.wmnet,cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4032].ulsfo.wmnet,cp1008.wikimedia.org,ms-fe[2005-2008].codfw.wmnet,ms-fe[1005-1008].eqiad.wmnet,mw[2017,2097,2099-2117,2120-2147,2150-2151,2153-2245,2247-2258].codfw.wmnet,mw[1180-1195,1197-1216,1218-1235,1238-1258,1261-1290,1293-1306,1308-1317,1319-1328].eqiad.wmnet,mwdebug[1001-1002].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                                         
(80) cp[2001-2002,2004-2008,2010-2014,2016-2020,2022-2026].codfw.wmnet,cp[1045,1048-1055,1058,1061-1068,1071-1074,1099].eqiad.wmnet,cp[3007-3008,3010,3030-3049].esams.wmnet,cp[4021-4023,4025-4032].ulsfo.wmnet,cp1008.wikimedia.org                                                                                                                                                         
----- OUTPUT of 'grep /usr/lib/ng.../ -f6-|sort|uniq' -----                                                                                                                                    
ndk_http_module.so                                                                                                                                                                             
ngx_http_lua_module.so                                                                                                                                                                         
===== NODE GROUP =====                                                                                                                                                                         
(1) cp4024.ulsfo.wmnet                                                                                                                                                                         
----- OUTPUT of 'grep /usr/lib/ng.../ -f6-|sort|uniq' -----                                                                                                                                    
ssh: connect to host cp4024.ulsfo.wmnet port 22: Connection timed out                                                                                                                          
================
Oct 31 2017, 1:10 PM · Traffic, Operations
BBlack added a comment to T174891: cp4024 kernel errors.

Ping @RobH - this hardware needs replacing. I guess diagnostics aren't perfect, and neither is the SEL, but clearly the node crashes out even during a fresh install.

Oct 31 2017, 12:28 PM · ops-ulsfo, Operations, Traffic

Oct 30 2017

BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.
Oct 30 2017, 5:40 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

We have an obvious case of normal slow chunked uploads of large files to commons to look at for examples to observe, though.

Oct 30 2017, 5:39 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack lowered the priority of T179156: 503 spikes and resulting API slowness starting 18:45 October 26 from Unbreak Now! to High.

Reducing this from UBN->High, because current best-working-theory is this problem is gone so long as we keep the VCL do_stream=false change reverted. Obviously, there's still some related investigations ongoing, and I'm going to write up an Incident_Report about the 503s later today as well.

Oct 30 2017, 4:28 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.
In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into the stack

That's assuming varnish always caches the entire request, and never "streams" to the backend, even for file uploads. When discussing this with @hoo he told me that this should be the case - but is it? That would make it easy to exhaust RAM on the varnish boxes, no?

Oct 30 2017, 3:54 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

Trickled-in POST on the client side would be something else. Varnish's timeout_idle, which is set to 5s on our frontends, acts as the limit for receiving all client request headers, but I'm not sure that it has such a limitation that applies to client-sent bodies. In any case, this would consume front-edge client connections, but wouldn't trigger anything deeper into the stack. We could/should double-check varnish's behavior there, but that's not what's causing this, this is definitely on the receiving end of responses from the applayer.

Oct 30 2017, 3:34 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

There's a timeout limiting the total amount of time varnish is allowed to spend on a single request, send_timeout, defaulting to 10 minutes. Unfortunately there's no counter tracking when the timer kicks in, although a debug line is logged to VSL when that happens. We can identify requests causing the "unreasonable" behavior as follows:

varnishlog -q 'Debug ~ "Hit total send timeout"'
Oct 30 2017, 11:48 AM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata

Oct 29 2017

BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

Now that I'm digging deeper, it seems there are one or more projects in progress built around Push-like things, in particular T113125 . I don't see any evidence that there's been live deploy of them yet, but maybe I'm missing something or other. If we have a live deploy of any kind of push-like functionality through the text cluster, it's a likely candidate for issues above in the short term.

Oct 29 2017, 2:02 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

Does Echo have any kind of push notification going on, even in light testing yet?

Oct 29 2017, 1:44 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T92002: implement Public Key Pinning (HPKP) for Wikimedia domains.

Yes, but the work for that is more on the CA end than ours, from a technical perspective. Because of Google's deadlines, in practice virtually all CA vendors will have to automatically embed SCTs in all certs they issue by April 2018. Our vendors are already on top of this though, and we expect to have SCTs embedded in our next round of unified certificate renewals from our dual CAs: GlobalSign + Digicert happening this quarter.

Oct 29 2017, 3:57 AM · Operations, Traffic, HTTPS

Oct 28 2017

BBlack added a comment to T97701: Can't visit login page or https wiki pages in IE7/8 on Windows XP (on SauceLabs).

@Whatamidoing-WMF - Kinda? See also the explanation here: T25932#3614933 . TL;DR is that just because we removed the cipher that IE8/XP should use, that doesn't necessarily mean that no IE8/XP clients will ever connect to us.

Oct 28 2017, 11:19 PM · MediaWiki-General-or-Unknown
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

A while after the above, @hoo started focusing on a different aspect of this we've been somewhat ignoring as more of a side-symptom: that there tend to be a lot of sockets in a strange state on the "target" varnish, to various MW nodes. They look strange on both sides, in that they spend significant time in the CLOSE_WAIT state on the varnish side and FIN_WAIT_2 on the MW side. This is a consistent state between the two nodes, but it's not usually one that non-buggy application code spends much time in. In this state, the MW side has sent FIN, Varnish has seen that and sent FIN+ACK, but Varnish has not yet decided to send its own FIN to finish the active closing process, and MW is still waiting on it.

Oct 28 2017, 10:05 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack closed T92002: implement Public Key Pinning (HPKP) for Wikimedia domains as Declined.

For all of the same good reasons pointed out a while back in e.g. https://blog.qualys.com/ssllabs/2016/09/06/is-http-public-key-pinning-dead , Google is putting the last nail in the coffin of [H]PKP, effectively killing it:

Oct 28 2017, 9:44 PM · Operations, Traffic, HTTPS
BBlack closed T92002: implement Public Key Pinning (HPKP) for Wikimedia domains, a subtask of T165455: Go from "E" to "A+" on Securityheaders.io, as Declined.
Oct 28 2017, 9:44 PM · Wikimedia-General-or-Unknown, Security
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

Updates from the Varnish side of things today (since I've been bad about getting commits/logs tagged onto this ticket):

Oct 28 2017, 8:21 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack created P6210 wikidata slow query.
Oct 28 2017, 7:00 PM · Traffic

Oct 27 2017

BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

So, I noticed again today and figured I should bring it up here: it seems highly fishy that most of the files that end up on the Files to purge lists on the WP0 Reporter's room have parentheses in their titles, either literally or as the %-encoded %28...%29 pair. Has anyone deeply investigated the angle that there's an encoding problem here? E.g. that the actual URL of the file on upload and the URL being PURGEd differ in parentheses-encoding details in some way, or that there's not some fault that causes PURGE URL parentheses to be double-encoded, etc?

Oct 27 2017, 8:09 PM · Patch-For-Review, Community-Liaisons (Oct-Dec 2017), Zero
BBlack added a comment to T174891: cp4024 kernel errors.

So, the reinstall attempt failed with another crash during the installer. I think we have to be looking at bad hardware here:

Oct 27 2017, 3:18 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T174891: cp4024 kernel errors.

Since diagnostics and SEL don't turn up anything, I'll recap what we've observed:

Oct 27 2017, 2:25 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

I think I found the root cuase now, seems it's actually related to the WikibaseQualityConstraints extension:

Oct 27 2017, 12:59 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

Unless anyone objects, I'd like to start with reverting our emergency varnish max_connections changes from https://gerrit.wikimedia.org/r/#/c/386756 . Since the end of the log above, connection counts have returned to normal, which is ~100, which is 1/10th the normal 1K limit that usually isn't a problem. If we leave the 10K limit in place, it will only serve to mask (for a time) any recurrence of the issue, making it only possible to detect it early by watching varnish socket counts on all the text cache machines.

Oct 27 2017, 12:47 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

My gut instinct remains what it was at the end of the log above. I think something in the revert of wikidatawiki to wmf.4 fixed this. And I think given the timing alignment of the Fix sorting of NullResults changes + the initial ORES->wikidata fatals makes those in particular a strong candidate. I would start with undo all of the other emergency changes first, leaving the wikidatawiki->wmf.4 bit for last.

Oct 27 2017, 12:37 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata
BBlack added a comment to T179156: 503 spikes and resulting API slowness starting 18:45 October 26.

Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but still pretty verbose):

# cache servers work ongoing here, ethtool changes that require short depooled downtimes around short ethernet port outages:
17:49 bblack: ulsfo cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
17:57 bblack@neodymium: conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet
17:59 bblack: codfw cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
18:00 <+jouncebot> Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will 
                   be rewarded with a sticker.
18:27 bblack: esams cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
18:41 bblack: eqiad cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
Oct 27 2017, 12:34 PM · Release-Engineering-Team (Watching / External), Patch-For-Review, Traffic, Wikimedia-Incident, Operations, ORES, Scoring-platform-team, Wikidata

Oct 26 2017

BBlack added a comment to T174891: cp4024 kernel errors.

Powering off for now, less confusing for other software-level maintenance.

Oct 26 2017, 6:10 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

It looks like that commentary is still valid for even the most-recent Firefox builds. So, we may find that even modern Firefox doesn't send close_notify in the case we're talking about. Open question whether Safari and/or Chrome do (they don' t derive from Gecko).

Oct 26 2017, 5:19 PM · Patch-For-Review, Pybal, Traffic, Operations, netops
BBlack reassigned T174891: cp4024 kernel errors from BBlack to RobH.
Oct 26 2017, 2:07 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T174891: cp4024 kernel errors.

I left this down because @RobH was due on-site a short while later. He observed no SEL entry while on-site.

Oct 26 2017, 2:07 PM · ops-ulsfo, Operations, Traffic

Oct 25 2017

BBlack reopened T174891: cp4024 kernel errors as "Open".

cp4024 died randomly today. I've left it alone other than to connect to the console and verify no response there.
21:07 < icinga-wm> PROBLEM - Host cp4024 is DOWN: PING CRITICAL - Packet loss = 100%

Oct 25 2017, 9:24 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

patch above released with nginx-1.13.6-2+wmf1, so we're capable of experimentation now. Flag isn't turned on anywhere yet.

Oct 25 2017, 6:47 PM · Patch-For-Review, Pybal, Traffic, Operations, netops
BBlack moved T179026: LVS IPv6 IPs should all be recorded in DNS from Triage to LoadBalancer on the Traffic board.
Oct 25 2017, 6:31 PM · Operations, Traffic
BBlack moved T179025: LVS hosts should have static-mapped IPv6 on all virtual interfaces from DNS Names to LoadBalancer on the Traffic board.
Oct 25 2017, 6:30 PM · Operations, Traffic
BBlack moved T179025: LVS hosts should have static-mapped IPv6 on all virtual interfaces from Triage to DNS Names on the Traffic board.
Oct 25 2017, 6:30 PM · Operations, Traffic
BBlack moved T158604: Investigate usefulness of SameSite cookies for logged-in accounts from Triage to Caching on the Traffic board.
Oct 25 2017, 6:30 PM · Traffic, Operations, Security-Core, MediaWiki-Authentication-and-authorization
BBlack moved T179027: Puppetize LVS interface IP sets per-DC for easy use in ferm rules from Triage to LoadBalancer on the Traffic board.
Oct 25 2017, 6:30 PM · Operations, Traffic
BBlack created T179027: Puppetize LVS interface IP sets per-DC for easy use in ferm rules.
Oct 25 2017, 6:26 PM · Operations, Traffic
BBlack added a subtask for T179026: LVS IPv6 IPs should all be recorded in DNS: T179025: LVS hosts should have static-mapped IPv6 on all virtual interfaces.
Oct 25 2017, 6:24 PM · Traffic, Operations
BBlack added a parent task for T179025: LVS hosts should have static-mapped IPv6 on all virtual interfaces: T179026: LVS IPv6 IPs should all be recorded in DNS.
Oct 25 2017, 6:24 PM · Operations, Traffic
BBlack created T179026: LVS IPv6 IPs should all be recorded in DNS.
Oct 25 2017, 6:24 PM · Traffic, Operations
BBlack created T179025: LVS hosts should have static-mapped IPv6 on all virtual interfaces.
Oct 25 2017, 6:23 PM · Operations, Traffic
BBlack added a comment to T164456: Migrate to nginx-light.

Seems the bot missed logging this here:

Oct 25 2017, 6:02 PM · Traffic, Operations
BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

would it mean that nginx should keep more TCP connections opened hoping for the client to eventually send a close notify (in response to its previous one)? Are those connections going to be tore down after their overall timeouts expire?

Oct 25 2017, 2:05 PM · Patch-For-Review, Pybal, Traffic, Operations, netops

Oct 24 2017

BBlack created T178954: Wikipedia Portal has insecure link to CC license in footer.
Oct 24 2017, 8:42 PM · Discovery-Portal-Sprint, Easy, Patch-For-Review, Discovery, Wikimedia-Portals