BBlack (Brandon Black)
Engineering Manager, SRE Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (214 w, 4 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Fri, Dec 14

BBlack created P7914 Internal NXDOMAIN lookups.
Fri, Dec 14, 1:28 PM · Operations, Traffic

Thu, Dec 13

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

There's still a couple of things that can be done serially at present, one of which is necessary for the cert issuance later:

Thu, Dec 13, 2:40 PM · wikidata-tech-focus, Patch-For-Review, Traffic, wikiba.se, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Wed, Dec 12

BBlack removed projects from T211813: SSL CERTIFICATE_VERIFY_FAILED on generating family file: Operations, Traffic, HTTPS.

Tag edit because all of those are specific to WMF Ops and this ticket isn't!

Wed, Dec 12, 8:25 PM · Pywikibot
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

BTW: https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 is a good test job when it's flipped. This fails current linting because of the outdated gdnsd version there, but hypothetically should pass on the new Docker-based stuff with updated software.

Wed, Dec 12, 5:42 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

So I see @Joe has merged up some Dockerfile stuff. What's our next step to flip operations/dns CI checks over to the new operations-dnslint? AFAIK we're ready for this at any time (current repo passes under the new checks and they're ready to use).

Wed, Dec 12, 5:40 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
BBlack added a comment to T98006: Anycast (Auth)DNS.

Some interesting stuff here (see also the Mailing Lists link there in the datatracker for discussion): https://datatracker.ietf.org/doc/draft-moura-dnsop-authoritative-recommendations/?include_text=1

Wed, Dec 12, 2:17 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

^ Fixing it to be self-explanatory! :)

Wed, Dec 12, 1:33 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

Out of curiosity: how do you ship the GeoDNS database? Is that relying on a package available through Debian?

Wed, Dec 12, 1:23 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations

Tue, Dec 11

BBlack added a comment to T207050: Migrate most standard public TLS certificates to CertCentral issuance.

Done, resolve?

Tue, Dec 11, 5:10 PM · Operations, Traffic
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

@hashar - So where we're at now is that we just need our CI switched to a Docker with the following properties (which is probably simple, but non-obvious to me!):

Tue, Dec 11, 3:02 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
Krenair awarded T207050: Migrate most standard public TLS certificates to CertCentral issuance a Party Time token.
Tue, Dec 11, 2:45 PM · Operations, Traffic
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

@hashar - I'm re-working the tools for the linting checks on operations/dns in the commits linked above, and we should be able to get away from cloning/using operations/puppet completely and just run a few simple commands on a checkout of operations/dns from a Docker image. I'm sure we can add the trivial tab-checking into the main CI run as well.

Tue, Dec 11, 10:11 AM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations

Fri, Dec 7

BBlack closed T199675: cp5001 unreachable since 2018-07-14 17:49:21 as Resolved.

No new EDAC errors reported since repooling, all we can do is assume it's ok for now I think.

Fri, Dec 7, 1:24 PM · Operations, ops-eqsin, Traffic
BBlack created P7895 161 error run.
Fri, Dec 7, 12:21 PM

Tue, Dec 4

BBlack closed T206688: SOA serial numbers returned by authoritative nameservers differ as Resolved.

Fixed again. Copying my whole terminal output for posterity. This runs a readonly command that md5sum's the zones directory to check whether all servers have the same exact zone data, then runs the same regeneration command that fixed them before, then confirms the hashes are aligned now:

bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'find /etc/gdnsd/zones -type f -exec md5sum {} \; |sort -k 2|md5sum'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(1) multatuli.wikimedia.org                                                                                                                                                   
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
ab66d08220b2475065d38c7c3bffc311  -                                                                                                                                           
===== NODE GROUP =====                                                                                                                                                        
(2) authdns[1001,2001].wikimedia.org                                                                                                                                          
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
749a6448e31706eab82740cfdab0cf5a  -                                                                                                                                           
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:01<00:00,  2.01hosts/s]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:01<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'find /etc/gdnsd/...sort -k 2|md5sum'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsdctl reload-zones'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(3) authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org                                                                                                                  
----- OUTPUT of 'authdns-gen-zone...ctl reload-zones' -----                                                                                                                   
info: Zone data reloaded                                                                                                                                                      
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:05<00:00,  1.96s/hosts]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:05<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'authdns-gen-zone...ctl reload-zones'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'find /etc/gdnsd/zones -type f -exec md5sum {} \; |sort -k 2|md5sum'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(3) authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org                                                                                                                  
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
edb7c18c736c92f6f34fd73850a001b5  -                                                                                                                                           
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:01<00:00,  2.01hosts/s]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:01<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'find /etc/gdnsd/...sort -k 2|md5sum'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Tue, Dec 4, 3:57 PM · Domains, Traffic, Operations
BBlack added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

(But note that first hop from Ashburn to Chicago is our routers' choice, so it's possible some of our route engineering is at play here).

Tue, Dec 4, 12:29 PM · Operations, Traffic, netops
BBlack added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

From bast1001 to the endpoints shown in line (2) above over v4 and v6:

bblack@bast1002:~$ mtr -c 10 -r -4 bottomless.aa.net.uk
Start: Tue Dec  4 12:23:35 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.2   0.2   0.2   0.4   0.0
  2.|-- ae0.cr1-eqiad.wikimedia.o  0.0%    10    0.2   0.2   0.2   0.3   0.0
  3.|-- xe-0-0-28-0.a03.asbnva02.  0.0%    10    1.7   2.8   0.6  11.6   3.4
  4.|-- ae-70.r06.asbnva02.us.bb.  0.0%    10   72.3  72.4  72.3  72.6   0.0
  5.|-- ae-2.r22.asbnva02.us.bb.g  0.0%    10    1.5   2.8   0.6  10.0   3.2
  6.|-- ae-5.r25.nycmny01.us.bb.g  0.0%    10    6.1   6.1   6.1   6.4   0.0
  7.|-- ae-1.r24.nycmny01.us.bb.g  0.0%    10    6.7   6.9   6.7   7.6   0.0
  8.|-- ae-9.r24.londen12.uk.bb.g  0.0%    10   73.7  74.6  73.7  79.4   1.7
  9.|-- ae-1.r04.londen05.uk.bb.g  0.0%    10   73.7  73.6  73.5  73.9   0.0
 10.|-- e.aimless.aa.net.uk       50.0%    10   74.6  74.7  74.6  74.7   0.0
 11.|-- bottomless.aa.net.uk       0.0%    10   74.7  74.8  74.7  74.9   0.0
bblack@bast1002:~$ mtr -c 10 -r -6 bottomless.aa.net.uk
Start: Tue Dec  4 12:23:58 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.3   0.3   0.2   0.3   0.0
  2.|-- xe-0-1-5.cr2-eqord.wikime  0.0%    10   66.5  32.2  28.3  66.5  12.0
  3.|-- 10gigabitethernet4-1.core  0.0%    10   25.8  25.1  25.0  25.8   0.0
  4.|-- 100ge16-1.core1.nyc4.he.n 10.0%    10   25.2  25.2  25.1  25.3   0.0
  5.|-- 100ge16-2.core1.lon2.he.n  0.0%    10   92.2  92.4  92.1  93.2   0.0
  6.|-- k.aimless.thn.aa.net.uk    0.0%    10   92.4  92.4  92.3  92.5   0.0
  7.|-- bottomless.aa.net.uk       0.0%    10   91.1  91.2  91.1  91.3   0.0
bblack@bast1002:~$ mtr -c 10 -r -4 a.gormless.thn.aa.net.uk
Start: Tue Dec  4 12:24:41 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    7.7   1.2   0.3   7.7   2.3
  2.|-- ae0.cr1-eqiad.wikimedia.o  0.0%    10    0.2   0.2   0.2   0.3   0.0
  3.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
bblack@bast1002:~$ mtr -c 10 -r -6 a.gormless.thn.aa.net.uk
Start: Tue Dec  4 12:25:03 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.2   0.3   0.2   0.3   0.0
  2.|-- xe-0-1-5.cr2-eqord.wikime  0.0%    10   28.3  33.5  28.3  79.0  16.0
  3.|-- 10gigabitethernet4-1.core  0.0%    10   25.0  25.1  25.0  25.6   0.0
  4.|-- 100ge16-1.core1.nyc4.he.n  0.0%    10   25.1  25.1  25.1  25.3   0.0
  5.|-- 100ge16-2.core1.lon2.he.n  0.0%    10   92.1  94.1  92.1 111.1   5.9
  6.|-- k.aimless.thn.aa.net.uk   10.0%    10   92.4  92.4  92.2  92.5   0.0
  7.|-- a.gormless.thn.aa.net.uk   0.0%    10   91.4  91.3  91.2  91.4   0.0
Tue, Dec 4, 12:28 PM · Operations, Traffic, netops

Mon, Dec 3

BBlack closed T210890: Loading full versions of larger images from Commons stucks / repeatedly gets interrupted after a few MBs as Resolved.

I can't reproduce this anymore in my own testing. I'm assuming it's fixed for now, barring further reports of continuing breakage showing up.

Mon, Dec 3, 11:52 PM · Patch-For-Review, Operations, media-storage, Traffic, Wikimedia-General-or-Unknown
BBlack added a comment to T210890: Loading full versions of larger images from Commons stucks / repeatedly gets interrupted after a few MBs.

I think the patch reverted above was at fault. What I can't be sure of is whether the reversion will help immediately, or will take some time. I suspect it will have a positive effect fairly quickly (as each failed ExpKill is going to nuke quite a few objects before it ultimately fails).

Mon, Dec 3, 11:21 PM · Patch-For-Review, Operations, media-storage, Traffic, Wikimedia-General-or-Unknown
BBlack added a comment to T210890: Loading full versions of larger images from Commons stucks / repeatedly gets interrupted after a few MBs.

They seem different, as T190988 is about faulty uploads (which I presume would still look broken when fetched directly from Swift), and this is about ones that are correct on Swift but have issues fetched through Varnish?

Mon, Dec 3, 4:04 PM · Patch-For-Review, Operations, media-storage, Traffic, Wikimedia-General-or-Unknown

Thu, Nov 29

BBlack raised the priority of T210683: lvs1006 down from Normal to High.
Thu, Nov 29, 4:04 PM · netops, ops-eqiad, Traffic, Operations
BBlack added projects to T210683: lvs1006 down : ops-eqiad, netops.
Thu, Nov 29, 2:35 AM · netops, ops-eqiad, Traffic, Operations
BBlack added a comment to T210683: lvs1006 down .

Yeah I got busy and dropped this.

Thu, Nov 29, 2:35 AM · netops, ops-eqiad, Traffic, Operations

Tue, Nov 27

BBlack added a comment to T206861: Power incident in eqsin.

Seems reasonable to close this; the event itself is long over. There are still risks present for a followup event, but if we close up all the actionables that goes away eventually. Maybe add incident tag and move to follow-up column for T206951? (the other is already there)

Tue, Nov 27, 7:27 PM · Wikimedia-Incident, Operations, Traffic

Mon, Nov 26

BBlack added a comment to T199675: cp5001 unreachable since 2018-07-14 17:49:21.

Update from SRE meeting today - memtest was successful, and we're asked to put it back in production and see if the error happens again or not. Re-pooling!

Mon, Nov 26, 5:52 PM · Operations, ops-eqsin, Traffic

Sat, Nov 24

Krinkle awarded T144187: Better handling for one-hit-wonder objects a Orange Medal token.
Sat, Nov 24, 1:24 AM · Performance-Team (Radar), Patch-For-Review, Operations, Traffic

Fri, Nov 23

BBlack added a comment to T209515: Renew Digicert Unified in 2019.

Downtimes set, we shouldn't get cert alerts in icinga

Fri, Nov 23, 4:26 PM · Operations, Traffic

Wed, Nov 21

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

Thanks for the data and the patch! We'll dig into the DNS patch next week and get it merged in so we're serving wikiba.se from our DNS as-is (as in, pointing at your existing server IPs). Then we can do handoff of the domain ownership/registration without causing any interruptions.

Wed, Nov 21, 3:48 PM · wikidata-tech-focus, Patch-For-Review, Traffic, wikiba.se, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Mon, Nov 19

BBlack added a comment to T209785: INMARSAT geolocates to the UK, leading to requests going to esams.

When looking at the latest MaxMind data, it locates this network as being in New Zealand, which we map to ulsfo as first choice, and esams as the last-resort choice. But the destination would've been set by geodns logic, so probably what really mattered was the location of the DNS cache in use. For future debugging, try a DNS lookup on reflect.wikimedia.org, which will show us what DNS cache exit IP our servers see, e.g.:

Mon, Nov 19, 1:46 PM · Operations, Traffic

Sun, Nov 18

BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

Yet another! cp1078 crash ticket above merged into here.

Sun, Nov 18, 8:13 PM · ops-eqiad, Traffic, Operations
BBlack merged task T209791: cp1078 crash into T203194: cp1075-90 - bnxt_en transmit hangs.
Sun, Nov 18, 8:13 PM · Operations
BBlack merged T209791: cp1078 crash into T203194: cp1075-90 - bnxt_en transmit hangs.
Sun, Nov 18, 8:13 PM · ops-eqiad, Traffic, Operations

Sat, Nov 17

BBlack added a comment to T119366: Disable caching on the main page for anonymous users.

Fwiw: im of the opinion that date magic words should reduce varnish cache to at least 24 hours, maybe six hours.

Sat, Nov 17, 12:17 AM · Traffic, Operations, Wikimedia-General-or-Unknown

Nov 16 2018

BBlack edited P7816 url parsing in nodejs with no scheme.
Nov 16 2018, 1:53 PM
BBlack edited P7816 url parsing in nodejs with no scheme.
Nov 16 2018, 1:53 PM
BBlack created P7816 url parsing in nodejs with no scheme.
Nov 16 2018, 1:51 PM

Nov 14 2018

BBlack added a comment to T209515: Renew Digicert Unified in 2019.

Also, we should pre-downtime the unified ssl checks in icinga for cp3NNN and cp5NNN early next week before the US Thanksgiving holidays, so that nobody's pestered by a spam of WARNING alerts, which I believe are set to trigger 60 days out from expiry.

Nov 14 2018, 5:48 PM · Operations, Traffic
BBlack updated the task description for T209515: Renew Digicert Unified in 2019.
Nov 14 2018, 5:37 PM · Operations, Traffic
BBlack triaged T209515: Renew Digicert Unified in 2019 as Normal priority.
Nov 14 2018, 5:35 PM · Operations, Traffic
BBlack closed T206804: Renew GlobalSign Unified in 2018 as Resolved.
Nov 14 2018, 5:20 PM · Patch-For-Review, Operations, Traffic
BBlack closed Unknown Object (Task), a subtask of T206804: Renew GlobalSign Unified in 2018, as Resolved.
Nov 14 2018, 5:19 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T206339: Separate Traffic layer caches for PHP7/HHVM.

From IRC for posterity:

Nov 14 2018, 5:08 PM · Patch-For-Review, Traffic, Operations

Nov 13 2018

BBlack added a comment to T206804: Renew GlobalSign Unified in 2018.

Seems to be testing fine on https://pinkunicorn.wikimedia.org/ , and the pre-deployment to all caches hosts and OCSP Stapling looks fine too.

Nov 13 2018, 2:00 PM · Patch-For-Review, Operations, Traffic

Nov 10 2018

BBlack removed projects from T209019: qrpedia.org and qrwp.org are down: Operations, Traffic, Domains.

Removing the ops/traffic/domains tags, as the Foundation doesn't operate anything about these domains (we don't own or operate the DNS, the IPs, or the servers). Whois says they belong to:

Nov 10 2018, 1:02 PM · QRpedia-General

Nov 8 2018

BBlack added a comment to T206804: Renew GlobalSign Unified in 2018.

The dual RSA+ECDSA certs above have:

Not Before: Nov  8 21:37:02 2018 GMT
Not Before: Nov  8 21:21:04 2018 GMT
Nov 8 2018, 10:06 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T206804: Renew GlobalSign Unified in 2018.

Must-Staple didn't turn out to be a realistic option for GlobalSign, we'll look at it again later/elsewhere!

Nov 8 2018, 10:00 PM · Patch-For-Review, Operations, Traffic

Nov 6 2018

BBlack added a comment to T208888: Puppet errors on traffic-recdns-anycast.traffic.eqiad.wmflabs.

Looking at the puppet level of this, the root of this problem is that this VM has no $facts['ipaddress6'], because there's no IPv6 in wmflabs I guess. Gaining some IPv6 might help reduce the prod<->wmcs diff in all things puppet! :) On the other hand, apparently very few other manifests directly reference that fact/global.

Nov 6 2018, 8:04 PM · Cloud-VPS
BBlack closed T163541: cache hosts should auto-repool iff OCSP files are sane as Resolved.

For now I've taken the pragmatic approach, and the merge above seems to do the trick. update-ocsp-all is now run as an ExecStartPre of nginx when tlsproxy does OCSP stapling, and nginx won't start if it exits non-zero, and traffic-pool won't repool things without nginx. There's a minor recursive glitch in all of this, which is that when the OCSP updater runs as an nginx ExecStartPre, it also runs its hook that tries to reload nginx, which isn't yet running. This results in log output:

Nov 6 2018, 6:53 PM · Patch-For-Review, Operations, Traffic

Nov 2 2018

BBlack added a subtask for T206804: Renew GlobalSign Unified in 2018: Unknown Object (Task).
Nov 2 2018, 7:00 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T208584: Decommission old eqiad caches.

In case parsing all those regexes gets annoying/confusing:

Nov 2 2018, 2:52 PM · ops-eqiad, decommission, Operations, Traffic
BBlack closed T206394: cp1076 hardware failure as Resolved.

Resolving for now, it's been repooled for a while without issue

Nov 2 2018, 3:45 AM · Operations, ops-eqiad, Traffic

Oct 31 2018

BBlack renamed T203194: cp1075-90 - bnxt_en transmit hangs from cp1076-90 - bnxt_en transmit hangs to cp1075-90 - bnxt_en transmit hangs.
Oct 31 2018, 6:33 PM · ops-eqiad, Traffic, Operations
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

cp1085 hit this today, virtually identical in all respects with the sequence of events and kernel/log outputs, etc. That's 2/16 nodes, a little under 2 months apart. Not an epidemic, but maybe worth looking into. Most likely this is a kernel driver bug or firmware bug rather than bad hardware. A simple reboot brough cp1085 back into service afterwards.

Oct 31 2018, 5:47 PM · ops-eqiad, Traffic, Operations
BBlack renamed T203194: cp1075-90 - bnxt_en transmit hangs from cp1080 - kernel / bnxt_en failures to cp1076-90 - bnxt_en transmit hangs.
Oct 31 2018, 5:45 PM · ops-eqiad, Traffic, Operations

Oct 30 2018

BBlack moved T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections from Triage to Caching on the Traffic board.
Oct 30 2018, 3:45 PM · Performance-Team (Radar), Traffic, Operations
BBlack moved T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS from Triage to DNS Names on the Traffic board.
Oct 30 2018, 3:45 PM · Operations, Traffic
BBlack added a comment to T204997: certcentral: delay deployment of renewed certs to wait out skewed client clocks.

So, with regard to the potential staging delays in this and T207295 , the reason they're not urgent or required for conversion of the existing legacy LE usages is that none of those legacy cases do OCSP stapling, and none of them are for a high-traffic production project domain where we're really worried about supporting awful client clocks.

Oct 30 2018, 12:02 PM · Certcentral, Operations, Traffic

Oct 29 2018

BBlack added a comment to T208244: ntp broken in new region.

So, a few things I can say along those lines:

Oct 29 2018, 8:40 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, netops, Cloud-VPS
BBlack triaged T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS as Normal priority.
Oct 29 2018, 8:01 PM · Operations, Traffic
BBlack added a comment to T208244: ntp broken in new region.

(or alternatively, we could look at this as one of the clear examples where a separate WMCS puppetization would be far simpler).

Oct 29 2018, 6:13 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, netops, Cloud-VPS
BBlack added a comment to T208244: ntp broken in new region.

Yeah I'd agree that's the direction we should go. We don't offer our ntp servers to the globe for good reasons, and we similarly probably shouldn't be offering them to WMCS, plus it's pretty easy to build new NTP servers there. There's a whole bunch of prod-specific stuff in modules/profile/manifests/ntp.pp that needs hiera-izing to override it for WMCS use-case. The structural stuff there is far more complex than the WMCS NTP servers would need, but I think it could be adapted to that case.

Oct 29 2018, 6:12 PM · cloud-services-team (Kanban), Patch-For-Review, Operations, netops, Cloud-VPS
BBlack updated the task description for T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections.
Oct 29 2018, 5:20 PM · Performance-Team (Radar), Traffic, Operations
BBlack added projects to T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections: Traffic, Performance-Team.
Oct 29 2018, 5:18 PM · Performance-Team (Radar), Traffic, Operations
BBlack updated the task description for T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections.
Oct 29 2018, 5:17 PM · Performance-Team (Radar), Traffic, Operations
BBlack triaged T208242: Investigate using RFC 7838 Alternate Services to better optimize edge connections as Normal priority.
Oct 29 2018, 5:14 PM · Performance-Team (Radar), Traffic, Operations
BBlack added a comment to T207615: Varnish won't purge thumbnails of specific file.

Most likely, this is related to URI normalization rules (note %-encoded chars in the relevant titles) and/or the generation of purges at the origins (tracking known thumbnails for purging). Historically, AFAIK we've never found a case where Varnish actually refuses or fails to purge an object. It's all about it being asked to purge the wrong object, and/or there being multiple available path encodings for the same object.

Oct 29 2018, 11:44 AM · Operations, Traffic

Oct 19 2018

BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

Shouldn't the container be able to puppetize from authdns::lint directly, which would provide all the pathways for updating the package/config/geoip/etc? Do the docker containers not get access to a puppetmaster?

Oct 19 2018, 1:47 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations

Oct 18 2018

BBlack created P7693 CC ssh outputs.
Oct 18 2018, 6:21 PM · Traffic
BBlack added a comment to T207340: Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found).

Yeah I think @Bawolff's explanation seems plausible. If there's a DNS hijacking "transparent" proxy which returns the same IP for all hostnames, then this could potentially confuse H/2 coalescing for the UA, depending on how the proxy behaves.

Oct 18 2018, 1:47 AM · Performance-Team (Radar), Traffic, Operations

Oct 17 2018

BBlack created P7687 Ethernet drivers in production.
Oct 17 2018, 8:34 PM · Operations, Traffic
BBlack created T207328: es2017 and es2019 have an idrac ethernet interface in Linux.
Oct 17 2018, 8:25 PM · ops-codfw, Operations

Oct 15 2018

BBlack added a comment to T206105: Optimize networking configuration for WDQS.

Yes, let's look at this today. I think we need better tg3 ethernet card support in interface::rps for one of our authdnses anyways, which you'll need here too.

Oct 15 2018, 4:52 PM · Patch-For-Review, Wikidata, Operations, Discovery-Wikidata-Query-Service-Sprint, Wikidata-Query-Service
BBlack added a parent task for T199711: Deploy a scalable service for ACME (LetsEncrypt) certificate management: T207050: Migrate most standard public TLS certificates to CertCentral issuance.
Oct 15 2018, 4:18 PM · Certcentral, Patch-For-Review, Traffic, Operations, Goal
BBlack added a subtask for T207050: Migrate most standard public TLS certificates to CertCentral issuance: T199711: Deploy a scalable service for ACME (LetsEncrypt) certificate management.
Oct 15 2018, 4:18 PM · Operations, Traffic
BBlack created T207050: Migrate most standard public TLS certificates to CertCentral issuance.
Oct 15 2018, 4:17 PM · Operations, Traffic
BBlack added a parent task for T204208: puppetize http purging for ATS backends: T207048: ATS production-ready as a backend cache layer.
Oct 15 2018, 4:15 PM · Patch-For-Review, Traffic, Operations
BBlack added a subtask for T207048: ATS production-ready as a backend cache layer: T204208: puppetize http purging for ATS backends.
Oct 15 2018, 4:15 PM · Patch-For-Review, Operations, Traffic
BBlack triaged T207048: ATS production-ready as a backend cache layer as Normal priority.
Oct 15 2018, 4:14 PM · Patch-For-Review, Operations, Traffic

Oct 12 2018

BBlack triaged T206876: certcentral: check for SCTs, with optional disable per-account as Normal priority.
Oct 12 2018, 8:13 PM · Operations, Traffic
BBlack created P7670 replace-during-upgrade.
Oct 12 2018, 12:58 PM
BBlack updated the task description for T206804: Renew GlobalSign Unified in 2018.
Oct 12 2018, 11:15 AM · Patch-For-Review, Operations, Traffic
BBlack updated subscribers of T206804: Renew GlobalSign Unified in 2018.

Also remembering there's some stats @Krinkle mentioned in T196248 about clock skews and users from some Google research. The TL;DR there was 24 hours gives us 93.3% , and 5 days is the sweet spot giving us 99.6%, after which it takes a lot more time to get significant gains. Will update the minimum/desirable timings above accordingly.

Oct 12 2018, 11:10 AM · Patch-For-Review, Operations, Traffic

Oct 11 2018

BBlack triaged T206804: Renew GlobalSign Unified in 2018 as Normal priority.
Oct 11 2018, 8:42 PM · Patch-For-Review, Operations, Traffic
BBlack closed T178173: Renew unified certificates 2017 as Resolved.

Yes, these certs are long-deployed :)

Oct 11 2018, 8:20 PM · Patch-For-Review, Operations, Traffic

Oct 10 2018

BBlack added a comment to T206688: SOA serial numbers returned by authoritative nameservers differ .

SOA Serial values only have meaning to the administrators of a zone, and to servers with which they authorize legacy zone transfers. The registrar is neither of these parties. We could serve randomly garbage digits for a serial number that are unique on every request and never match across our servers or across time, and this has absolutely no bearing on correct operations. Therefore, it's a pretty silly policy for the .IS registry to care about them, or warn about them, or especially to threaten registry suspension over them.

Oct 10 2018, 7:42 PM · Domains, Traffic, Operations
BBlack added a comment to T204931: Re-evaluate use of EV certificates for payments.wm.o?.

The kicker probably wouldn't be the monetary cost. It would be that if you didn't require EV, you could auto-issue certs from LetsEncrypt and get rid of manually worrying about them ever again.

Oct 10 2018, 3:49 PM · Fundraising-Backlog, Traffic, HTTPS, Operations
BBlack created P7658 Zone linting.
Oct 10 2018, 3:05 PM · Traffic
BBlack added a comment to T204931: Re-evaluate use of EV certificates for payments.wm.o?.

@Krenair please, no more DV certs, that's the reason why jawiki, ugwiki, wuuwiki, zhwiki, zh-yuewiki and zhwikinews are SNI RSTed by GFW, because DV and some kinds of OV certs can still provide SNI informations regularly (T205378).

Does that actually have anything to do with whether the cert is DV vs. OV vs. EV?

Oct 10 2018, 2:04 PM · Fundraising-Backlog, Traffic, HTTPS, Operations

Oct 9 2018

BBlack created P7653 eqiad ipv6 per-subnet ping test results.
Oct 9 2018, 6:47 PM · Traffic

Oct 6 2018

BBlack added a comment to T206394: cp1076 hardware failure.

Note to future self on a weekday: we should probably dig further via the nvme-cli commands, as there's lots of queryable hardware errorlog/state/status that might give more insight.

Oct 6 2018, 1:43 PM · Operations, ops-eqiad, Traffic
BBlack updated the task description for T206394: cp1076 hardware failure.
Oct 6 2018, 1:43 PM · Operations, ops-eqiad, Traffic
BBlack triaged T206394: cp1076 hardware failure as Normal priority.
Oct 6 2018, 1:30 PM · Operations, ops-eqiad, Traffic

Oct 5 2018

BBlack edited P7644 Smooth daemon upgrades.
Oct 5 2018, 6:38 PM · Traffic
BBlack created P7644 Smooth daemon upgrades.
Oct 5 2018, 6:38 PM · Traffic
BBlack added a project to T206339: Separate Traffic layer caches for PHP7/HHVM: Traffic.
Oct 5 2018, 4:38 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T206339: Separate Traffic layer caches for PHP7/HHVM.

Strawperson proposal from IRC, in pseudocode for cache_text, assuming the magic cookie is f2b31d03ab7:

Oct 5 2018, 4:38 PM · Patch-For-Review, Traffic, Operations

Oct 2 2018

BBlack added a comment to T205988: Simplify comment misc-frontend.inc.vcl.erb.

Probably we need to do more than simplify the comment here, and instead actually fix/refactor the logic so it can work sanely. Either way, we'll need some relatively-bulletproof way to limit the target scope of domains for the sitemaps rewrites.

Oct 2 2018, 6:12 PM · Operations, Traffic

Oct 1 2018

RandomDSdevel awarded T147202: Removing support for AES128-SHA TLS cipher a Grey Medal token.
Oct 1 2018, 12:52 AM · Patch-For-Review, User-notice, Operations, Traffic

Sep 28 2018

herron awarded T102099: Fix IPv6 autoconf issues once and for all, across the fleet. a Like token.
Sep 28 2018, 7:44 PM · Traffic, netops, Operations, IPv6

Sep 25 2018

BBlack added a comment to T200754: Redirects for new Wikimedia Foundation website.

I think we may pushing things into a strange semantic corner hinging on the definition of the world Policy? Are we trying to say we can't use the word "Policy" publicly without it falling into this official scope with ED/Board direct oversight? There are a wide range of policies at various scopes that definitely don't rise to big-P "Policies" requiring legal review and/or direct ED/Board oversight (other than in the general sense of the chain of managerial command and responsibilities over things), yet definitely are little-p "policies", and some of which are for public reference as well.

Sep 25 2018, 9:21 PM · wikimediafoundation.org, WMF-Communications