Page MenuHomePhabricator

BBlack (Brandon Black)
Engineering Manager, SRE Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (232 w, 5 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Sat, Apr 20

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

Status update on the experiments above:

Sat, Apr 20, 3:12 PM · Performance-Team (Radar), Patch-For-Review, Operations, Traffic

Fri, Apr 19

BBlack created P8419 strange nxdomains to ns0.
Fri, Apr 19, 3:00 PM · Traffic

Tue, Apr 9

BBlack added a comment to T209707: tagged_interface sometimes exceeds IFNAMSIZ.

It's not ideal, but the part that was stripped was the most-predictable part of the name (the en prefix), so it's not all that confusing.

Tue, Apr 9, 10:35 AM · Patch-For-Review, Traffic, Operations

Mon, Apr 8

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

The wiktionary CNAME experiment is going out today, and I'm intending to keep it running for at least a week, assuming no issues arise.

Mon, Apr 8, 5:30 PM · Performance-Team (Radar), Patch-For-Review, Operations, Traffic
BBlack updated the task description for T186550: Anycast recdns.
Mon, Apr 8, 2:12 PM · Patch-For-Review, netops, Operations, Traffic
BBlack added a comment to T220383: Evaluate ATS TLS stack.
  • 0100-dynamic-tls-records.patch - I don't think we ever managed to prove a significant benefit from this on initial deploy, but it's just one of those things that seemed like a "good idea" so long as it remained simple to leave it in. I'd be happy with dropping this initially and putting the ideas behind that patch (or even more-generalized than that patch) on the back burner for the future when we have more time.
  • 0660-version-too-low.patch - This was a very nginx-specific thing about not having nginx spam error messages, shouldn't need to port it at all.
Mon, Apr 8, 1:50 PM · Traffic, Operations

Fri, Apr 5

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

We may try the wiktionary patch early next week. The goal with that test is just to see if we get any user complaints about wiktionary.org resolution being broken, so we'll leave it in place for a week or so if we don't get complaints, or revert if we do. Either way it will eventually get reverted, and if it's successful then we'll start patching for the "real" version where everything centralizes into a wikipedia.org hostname, so that's probably still at least a couple weeks out.

Fri, Apr 5, 4:49 PM · Performance-Team (Radar), Patch-For-Review, Operations, Traffic

Fri, Mar 29

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

There's some complexities here that I've been stewing on for a while, mostly noted in the original description, but I like this general direction. Most of the concerns briefly mentioned earlier aren't actually a big deal in practice, but there remains a key issue around CNAME + edns-client-subnet, and the decision between putting the terminal DYNA record in either wikipedia.org or some other domain (preferably one not used by current canonicals at all, e.g. maybe this variant would be a good use for wikimedia.net?). Where I'm at now in thinking on these two paths:

Fri, Mar 29, 3:03 PM · Performance-Team (Radar), Patch-For-Review, Operations, Traffic

Mar 12 2019

BBlack added a comment to T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater.

I think it would be better, from my perspective, to really understand the use-cases better (which I don't). Why do these remote clients need "realtime" (no staleness) fetches of Q items? What I hear is it sounds like all clients expect everything to be perfectly synchronous, but I don't understand why they need to be perfectly synchronous. In the case that lead to this ticket, it was a remote client at Orange issuing a very high rate of these uncacheable queries, which seems like a bulk data load/update process, not an "I just edited this thing and need to see my own edits reflected" sort of case.

Mar 12 2019, 5:18 PM · Patch-For-Review, User-Smalyshev, Wikidata, Wikidata-Query-Service, Traffic, Operations

Mar 8 2019

BBlack added a comment to T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater.

Looking at an internal version of the flavor=dump outputs of an entity, related observations:

Mar 8 2019, 2:13 PM · Patch-For-Review, User-Smalyshev, Wikidata, Wikidata-Query-Service, Traffic, Operations

Mar 4 2019

BBlack added a comment to T215987: Verify that hit/miss stats in WebRequest are correct.

The raw data should be accurate. I had thought we were already sending the summarized X-Cache-Status to hadoop as well, but apparently not. It might be useful to get that going in another ticket, because it saves dealing with some of the complexity below. In the meantime:

Mar 4 2019, 7:59 PM · Operations, Traffic, Core Platform Team Backlog (Later), Analytics, Services (blocked), RESTBase

Feb 27 2019

BBlack added a comment to T204281: Stop prioritizing peering over transit.

Circa 2019-02-21, eqsin was depooled to install a new router, and most of the users normally mapped to eqsin had fallen back to ulsfo temporarily, which would distort the stats of "ulsfo users" considerably.

Feb 27 2019, 11:54 AM · Performance-Team (Radar), netops, Operations

Feb 26 2019

BBlack created P8132 Network oddities from AT&T.
Feb 26 2019, 3:03 PM

Feb 25 2019

BBlack added a comment to T212197: Deliver mobile-based version for automatic translations.

The VCL looks good, please give us some notice (~24h would be ideal?) on when you need it actually deployed once you've decided on a date. Any news on the Desktop-denial regression?

Feb 25 2019, 8:17 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Patch-For-Review, Traffic, Operations, ExternalGuidance

Feb 21 2019

BBlack added subtasks for T216691: amber light on cp5006/5007: T216716: cp5007 correctable mem errors, T216717: cp5006 correctable mem errors.
Feb 21 2019, 2:22 PM · Traffic, Operations, ops-eqsin
BBlack added a parent task for T216716: cp5007 correctable mem errors: T216691: amber light on cp5006/5007.
Feb 21 2019, 2:22 PM · Operations, ops-eqsin, Traffic
BBlack added a parent task for T216717: cp5006 correctable mem errors: T216691: amber light on cp5006/5007.
Feb 21 2019, 2:22 PM · ops-eqsin, Operations, Traffic
BBlack created T216717: cp5006 correctable mem errors.
Feb 21 2019, 2:21 PM · ops-eqsin, Operations, Traffic
BBlack created T216716: cp5007 correctable mem errors.
Feb 21 2019, 2:21 PM · Operations, ops-eqsin, Traffic
BBlack closed T214274: Degraded RAID on cp5010 as Resolved.

Seems to be working fine after replacement!

Feb 21 2019, 5:48 AM · Traffic, ops-eqsin, Operations

Feb 20 2019

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

There are different layers of "handing off" DNS management which are being conflated, but to run through them in order:

Feb 20 2019, 5:26 PM · Patch-For-Review, User-Addshore, serviceops, wikidata-tech-focus, Traffic, wikiba.se website, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Feb 15 2019

BBlack added a comment to T215956: Consider stashing data-parsoid for VE .

Correct me if I'm wrong, but I would think all VE traffic would already be uncacheable at the Varnish level anyways, since it happens in the context of a session (although in the future we might fix this with content composition work). As for the rest of this discussion, I don't think I understand the context enough to say anything about its sanity or whether it increases any attack surface in a way that matters.

Feb 15 2019, 9:34 PM · Services (doing), Core Platform Team Kanban (Doing), Core Platform Team (RESTBase Split (CDP2)), User-Eevans, User-mobrovac, Parsoid, VisualEditor, RESTBase

Feb 14 2019

BBlack triaged T216172: Set up basic email infra for w.wiki domain as Normal priority.
Feb 14 2019, 7:44 PM · Operations, Mail
BBlack added a comment to T205897: Netbox: fill network topology.

The medium-term plan is for this data to be entered into Netbox after a server is racked but before it's provisioned or even powered up, and that data to be used by our tooling to configure and execute the provisioning itself (DHCP configuration, switchport, OS install etc.).

So, I don't think we can reasonably expect our on-site techs to look at a box and say "oh this port is enp4s0f0p1" and record it as such :)

Feb 14 2019, 12:51 PM · Operations

Feb 13 2019

BBlack added a comment to T205897: Netbox: fill network topology.
  • How should we name server interfaces? The physical Port 1, Port 2, etc. or the Linux naming (enp5s0f0, enp5s0f1, etc) My vote so far would go with #2. Even if it's harder to parse for a human, it should stay consistent.

My vote is for whatever is clearer/simpler for the people that has to physically interact with them as long as they uniquely identify the parts without ambiguity.

Feb 13 2019, 1:46 PM · Operations

Feb 10 2019

BBlack added a comment to T215071: Merge Wikipedia subdomains into one, to discourage censorship.

According to the article Censorship of Wikipedia, one effect of the switch to https was that it is now not possible to censor individual articles.

Conversely, now when China decides to censor articles about Tienanmen Square, their citizens also lose access to basic health information, STEM education background material and all other content that would probably have far more positive impact on their lives than articles about politics... it's a double edged sword. Arguably, strictly from the censorship / access angle, HTTPS was a bad trade-off. (There are a number of other reasons why it was absolutely necessary; but for merging subdomains that's probably not the case.)

Feb 10 2019, 3:32 PM · Domains, Traffic, DNS, HTTPS, Operations

Feb 8 2019

BBlack added a comment to T214529: EDAC events not being reported by node-exporter?.

Corrected errors are normal and expected to occur on healthy
hardware. They do not need user's attention until they repeatedly
occurred at a same place.

Apparently, you haven't been on enough maintanance calls, trying to
calm down the customer about the hardware error he sees in his
logs...

Actually, that's why. Reporting all corrected errors make users
worried, call support, and asking to replace healthy hardware...

So it seems possible that nothing is actually 'wrong' here? But I have very little confidence in anything at this point.

Feb 8 2019, 9:23 PM · Patch-For-Review, Operations, monitoring

Feb 7 2019

BBlack added a comment to T207389: Rename the Certcentral project to Acme-chief.

Sounds good to me!

Feb 7 2019, 4:06 PM · Patch-For-Review, Acme-chief
BBlack lowered the priority of T215071: Merge Wikipedia subdomains into one, to discourage censorship from Normal to Low.

Expounding on the lamentations above in a more realistic triage sort of sense:

Feb 7 2019, 1:41 PM · Domains, Traffic, DNS, HTTPS, Operations
BBlack added a comment to T215071: Merge Wikipedia subdomains into one, to discourage censorship.

The linked ESNI ticket is kind of a random user question ticket, and not actually one created for working on it (which still off in the Future, but obviously we'll implement ESNI as soon as we realistically can, as part of planned work).

Feb 7 2019, 1:10 PM · Domains, Traffic, DNS, HTTPS, Operations

Jan 23 2019

BBlack added a comment to T214516: cp4026 correctable dimm error.

See also T178011 for last time. Why didn't the icinga EDAC check catch this?

Jan 23 2019, 9:05 PM · ops-ulsfo, Operations, Traffic
BBlack reassigned T214516: cp4026 correctable dimm error from BBlack to RobH.

https://wikitech.wikimedia.org/wiki/Cache_servers#Depool_and_downtime is correct, it just needs to be depooled (it will auto-depool on shutdown, but a manual depool is preferable).

Jan 23 2019, 9:02 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T211254: Free up 185.15.59.0/24.

It may be possible to get more space in various shady ways, but it's not possible by following RIR rules.

Jan 23 2019, 4:22 PM · Patch-For-Review, Traffic, Operations, netops
BBlack added a comment to T211254: Free up 185.15.59.0/24.

It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.

In a world where there's ample address space (such as 10/8 in our context), yes. In today's world where IPv4 address space is scarce and we can likely not get any more, not so much.

I would personally have preferred that with the renumbering.of WMCS they simply acquired new public IPv4 space of their own

That's simply not realistic, they can't "acquire" IPv4 address space of their own. They're part of this organisation, this ASN, and need to use our PI/PA space where we have it available before we collectively can get more.

Jan 23 2019, 2:40 PM · Patch-For-Review, Traffic, Operations, netops
BBlack added a comment to T211254: Free up 185.15.59.0/24.

It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.

Jan 23 2019, 1:44 PM · Patch-For-Review, Traffic, Operations, netops
BBlack updated subscribers of T214459: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel.
Jan 23 2019, 1:06 PM · Traffic, Operations, netops

Jan 16 2019

BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

See also this email thread where Michael Chan (broadcom driver dev) asks for firmware level output, sees the same numbers we have on cp1088, and tells them to upgrade: https://www.spinics.net/lists/netdev/msg519478.html

Jan 16 2019, 1:08 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

on the Dell community forum there is a post with several people reporting the same issue. The last one suggesting that Dell could be replacing hardware to fix the issue.

Jan 16 2019, 1:03 PM · Patch-For-Review, Operations, Traffic

Jan 14 2019

BBlack added a comment to T212197: Deliver mobile-based version for automatic translations.

I don't have any suggestions, no. Develop a straw-patch which at least serves in code terms to document the intent (e.g. the explicit header and URI values matched/transformed, etc) and we'll poke at it!

Jan 14 2019, 10:58 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Patch-For-Review, Traffic, Operations, ExternalGuidance

Jan 11 2019

BBlack added a comment to T213475: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response).

It's a confusing set of things going on here, and it's going to need fixups on both the network/data/data.yaml side and the VCL side. Just to recap the historical situation for clarity:

Jan 11 2019, 12:29 AM · Patch-For-Review, Toolforge, Traffic, Operations, Cloud-VPS

Jan 10 2019

BBlack added a comment to T213475: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response).

Right. I'm not up to speed on where all related changes are, but from VCL's point of view its definition of wikimedia_nets was meant to include labs, whereas its nearly identical wikimedia_trust is meant to exclude labs.

Jan 10 2019, 10:46 PM · Patch-For-Review, Toolforge, Traffic, Operations, Cloud-VPS
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

Yeah. It's hard to "prove" whether we have this bug fixed other than running a supposed fix on the bnxt_en cp10 fleet for a while as a statistical test, but probably the sooner we start on that the better.

Jan 10 2019, 5:04 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

Actually, it is already in the 4.9.y LTS/stable branch, here: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.9.y&id=b2be15bb02b961146177d49204de22df3dddd415

Jan 10 2019, 4:55 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

I suspect our bug is fixed by:

Jan 10 2019, 4:40 PM · Patch-For-Review, Operations, Traffic

Dec 31 2018

Liuxinyu970226 awarded T97051: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well a Orange Medal token.
Dec 31 2018, 12:40 AM · Traffic, Operations, Patch-For-Review, DNS

Dec 21 2018

BBlack triaged T212504: Remove OOMScoreAdjust from nrpe unit file? as Low priority.
Dec 21 2018, 3:16 PM · Operations

Dec 20 2018

Dzahn awarded T97051: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well a Orange Medal token.
Dec 20 2018, 10:57 PM · Traffic, Operations, Patch-For-Review, DNS
BBlack added a comment to T206688: SOA serial numbers returned by authoritative nameservers differ .

Update for the record: with recent changes to authdns CI and deployment scripts, this scenario should no longer be possible and workarounds shouldn't be necessary! (see also related distant past incident T103915)

Dec 20 2018, 9:00 PM · Domains, Traffic, Operations
BBlack closed T97051: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well as Resolved.

This is fixed now, no workarounds should be needed.

Dec 20 2018, 8:53 PM · Traffic, Operations, Patch-For-Review, DNS
BBlack closed T161148: AuthDNS CM/CI refactor as Resolved.

Resolving this, as recent work has fixed a lot of it (other than discovery issues specifically), and at this point all the text above is woefully outdated and pointing in Wrong directions. We can file some smaller-scope tasks about remaining cleanups in this space.

Dec 20 2018, 8:52 PM · DNS, Traffic, Operations
BBlack closed T182028: DNS repo: add CI checks for obvious configuration errors as Resolved.

We've done all this and gone way past it at this point. We might tag some future improvements here, but I think it's safe to call the basic task Resolved. Thanks @Volans !

Dec 20 2018, 8:50 PM · Traffic, DNS, Patch-For-Review, Operations-Software-Development, Operations
BBlack added a comment to T210484: Only serve debug HTTP headers when x-wikimedia-debug is present.

localssl.erb would probably be more appropriate and is the site file, but it's a generic TLS reverse proxy setup with a few parameters, and there's not yet any "this is for Traffic termination" flag (classparam?) puppetized, but you could make one.

Dec 20 2018, 3:52 PM · Operations, Analytics, Traffic, Performance-Team
BBlack added a comment to T210484: Only serve debug HTTP headers when x-wikimedia-debug is present.

I don't know off-hand if we can live without them all for manual debugging and such, or if nginx is the best place to remove them. There might be other ways to suppress them in Varnish without affecting analytics, but that needs some digging and testing. We can't universally strip them in tlsproxy's nginx.conf like the patch above, though, because that config is also re-used for various applayer TLS termination behind varnish as well, and thus would hide any of those headers that are being sent from the applayer up to varnish itself.

Dec 20 2018, 3:38 PM · Operations, Analytics, Traffic, Performance-Team
BBlack added a comment to T167972: Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/.

I thought this was in a different ticket somewhere at one point, but in any case I just noticed it during someone's meta-updates, and the description seems slightly off. What we really want here is full alignment of the public and private API URIs: If the public is hitting https://en.wikipedia.org/api/rest_v1/foo, that should be what RB accepts as well (in Host: + URI terms). The description seems to leave out the /api/ part of it. Or in other words, make it unecessary for Varnish to do the transformation it does here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/text-backend.inc.vcl.erb#38 .

Oh you mean have RB behave exactly like the app servers, whereby RB should be able to accept by default requests to http://restbase.discovery.wmnet:7231/api/rest_v1/{path} with the Host: {domain} header set and automatically transform that itself to http://restbase.discovery.wmnet:7231/{domain}/v1/{path}? That is doable, but what would be the advantage given that Varnish strips the Host header?

Dec 20 2018, 3:20 PM · Core Platform Team Backlog (Later), RESTBase-API, HyperSwitch, Traffic, Operations, Services (next)
BBlack added a comment to T167972: Respect host header in RESTBase, and redirect /rest_v1 to /rest_v1/.

I thought this was in a different ticket somewhere at one point, but in any case I just noticed it during someone's meta-updates, and the description seems slightly off. What we really want here is full alignment of the public and private API URIs: If the public is hitting https://en.wikipedia.org/api/rest_v1/foo, that should be what RB accepts as well (in Host: + URI terms). The description seems to leave out the /api/ part of it. Or in other words, make it unecessary for Varnish to do the transformation it does here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/varnish/templates/text-backend.inc.vcl.erb#38 .

Dec 20 2018, 12:45 PM · Core Platform Team Backlog (Later), RESTBase-API, HyperSwitch, Traffic, Operations, Services (next)

Dec 18 2018

Dzahn awarded T102099: Fix IPv6 autoconf issues once and for all, across the fleet. a Barnstar token.
Dec 18 2018, 6:36 PM · Patch-For-Review, Traffic, netops, Operations, IPv6

Dec 14 2018

BBlack created P7914 Internal NXDOMAIN lookups.
Dec 14 2018, 1:28 PM · Operations, Traffic

Dec 13 2018

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

There's still a couple of things that can be done serially at present, one of which is necessary for the cert issuance later:

Dec 13 2018, 2:40 PM · Patch-For-Review, User-Addshore, serviceops, wikidata-tech-focus, Traffic, wikiba.se website, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Dec 12 2018

BBlack removed projects from T211813: SSL CERTIFICATE_VERIFY_FAILED on generating family file: Operations, Traffic, HTTPS.

Tag edit because all of those are specific to WMF Ops and this ticket isn't!

Dec 12 2018, 8:25 PM
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

BTW: https://gerrit.wikimedia.org/r/c/operations/dns/+/462693 is a good test job when it's flipped. This fails current linting because of the outdated gdnsd version there, but hypothetically should pass on the new Docker-based stuff with updated software.

Dec 12 2018, 5:42 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

So I see @Joe has merged up some Dockerfile stuff. What's our next step to flip operations/dns CI checks over to the new operations-dnslint? AFAIK we're ready for this at any time (current repo passes under the new checks and they're ready to use).

Dec 12 2018, 5:40 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
BBlack added a comment to T98006: Anycast (Auth)DNS.

Some interesting stuff here (see also the Mailing Lists link there in the datatracker for discussion): https://datatracker.ietf.org/doc/draft-moura-dnsop-authoritative-recommendations/?include_text=1

Dec 12 2018, 2:17 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

^ Fixing it to be self-explanatory! :)

Dec 12 2018, 1:33 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

Out of curiosity: how do you ship the GeoDNS database? Is that relying on a package available through Debian?

Dec 12 2018, 1:23 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations

Dec 11 2018

BBlack added a comment to T207050: Migrate most standard public TLS certificates to CertCentral issuance.

Done, resolve?

Dec 11 2018, 5:10 PM · Operations, Traffic
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

@hashar - So where we're at now is that we just need our CI switched to a Docker with the following properties (which is probably simple, but non-obvious to me!):

Dec 11 2018, 3:02 PM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations
Krenair awarded T207050: Migrate most standard public TLS certificates to CertCentral issuance a Party Time token.
Dec 11 2018, 2:45 PM · Operations, Traffic
BBlack added a comment to T205439: CI jobs for authdns linting need to run on Stretch.

@hashar - I'm re-working the tools for the linting checks on operations/dns in the commits linked above, and we should be able to get away from cloning/using operations/puppet completely and just run a few simple commands on a checkout of operations/dns from a Docker image. I'm sure we can add the trivial tab-checking into the main CI run as well.

Dec 11 2018, 10:11 AM · Patch-For-Review, User-ArielGlenn, Continuous-Integration-Infrastructure (Slipway), Traffic, Operations

Dec 7 2018

BBlack closed T199675: cp5001 unreachable since 2018-07-14 17:49:21 as Resolved.

No new EDAC errors reported since repooling, all we can do is assume it's ok for now I think.

Dec 7 2018, 1:24 PM · Operations, ops-eqsin, Traffic
BBlack created P7895 161 error run.
Dec 7 2018, 12:21 PM

Dec 4 2018

BBlack closed T206688: SOA serial numbers returned by authoritative nameservers differ as Resolved.

Fixed again. Copying my whole terminal output for posterity. This runs a readonly command that md5sum's the zones directory to check whether all servers have the same exact zone data, then runs the same regeneration command that fixed them before, then confirms the hashes are aligned now:

bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'find /etc/gdnsd/zones -type f -exec md5sum {} \; |sort -k 2|md5sum'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(1) multatuli.wikimedia.org                                                                                                                                                   
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
ab66d08220b2475065d38c7c3bffc311  -                                                                                                                                           
===== NODE GROUP =====                                                                                                                                                        
(2) authdns[1001,2001].wikimedia.org                                                                                                                                          
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
749a6448e31706eab82740cfdab0cf5a  -                                                                                                                                           
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:01<00:00,  2.01hosts/s]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:01<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'find /etc/gdnsd/...sort -k 2|md5sum'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'authdns-gen-zones -f /srv/authdns/git/templates /etc/gdnsd/zones && gdnsdctl reload-zones'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(3) authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org                                                                                                                  
----- OUTPUT of 'authdns-gen-zone...ctl reload-zones' -----                                                                                                                   
info: Zone data reloaded                                                                                                                                                      
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:05<00:00,  1.96s/hosts]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:05<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'authdns-gen-zone...ctl reload-zones'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
bblack@cumin1001:~$ sudo cumin 'C:role::authdns::server' 'find /etc/gdnsd/zones -type f -exec md5sum {} \; |sort -k 2|md5sum'
3 hosts will be targeted:
authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                                                        
(3) authdns[1001,2001].wikimedia.org,multatuli.wikimedia.org                                                                                                                  
----- OUTPUT of 'find /etc/gdnsd/...sort -k 2|md5sum' -----                                                                                                                   
edb7c18c736c92f6f34fd73850a001b5  -                                                                                                                                           
================                                                                                                                                                              
PASS:  |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (3/3) [00:01<00:00,  2.01hosts/s]     
FAIL:  |                                                                                                                                 |   0% (0/3) [00:01<?, ?hosts/s]     
100.0% (3/3) success ratio (>= 100.0% threshold) for command: 'find /etc/gdnsd/...sort -k 2|md5sum'.
100.0% (3/3) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Dec 4 2018, 3:57 PM · Domains, Traffic, Operations
BBlack added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

(But note that first hop from Ashburn to Chicago is our routers' choice, so it's possible some of our route engineering is at play here).

Dec 4 2018, 12:29 PM · Operations, Traffic, netops
BBlack added a comment to T211079: IPv6 ~20ms higher ping than IPv4 to gerrit.

From bast1001 to the endpoints shown in line (2) above over v4 and v6:

bblack@bast1002:~$ mtr -c 10 -r -4 bottomless.aa.net.uk
Start: Tue Dec  4 12:23:35 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.2   0.2   0.2   0.4   0.0
  2.|-- ae0.cr1-eqiad.wikimedia.o  0.0%    10    0.2   0.2   0.2   0.3   0.0
  3.|-- xe-0-0-28-0.a03.asbnva02.  0.0%    10    1.7   2.8   0.6  11.6   3.4
  4.|-- ae-70.r06.asbnva02.us.bb.  0.0%    10   72.3  72.4  72.3  72.6   0.0
  5.|-- ae-2.r22.asbnva02.us.bb.g  0.0%    10    1.5   2.8   0.6  10.0   3.2
  6.|-- ae-5.r25.nycmny01.us.bb.g  0.0%    10    6.1   6.1   6.1   6.4   0.0
  7.|-- ae-1.r24.nycmny01.us.bb.g  0.0%    10    6.7   6.9   6.7   7.6   0.0
  8.|-- ae-9.r24.londen12.uk.bb.g  0.0%    10   73.7  74.6  73.7  79.4   1.7
  9.|-- ae-1.r04.londen05.uk.bb.g  0.0%    10   73.7  73.6  73.5  73.9   0.0
 10.|-- e.aimless.aa.net.uk       50.0%    10   74.6  74.7  74.6  74.7   0.0
 11.|-- bottomless.aa.net.uk       0.0%    10   74.7  74.8  74.7  74.9   0.0
bblack@bast1002:~$ mtr -c 10 -r -6 bottomless.aa.net.uk
Start: Tue Dec  4 12:23:58 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.3   0.3   0.2   0.3   0.0
  2.|-- xe-0-1-5.cr2-eqord.wikime  0.0%    10   66.5  32.2  28.3  66.5  12.0
  3.|-- 10gigabitethernet4-1.core  0.0%    10   25.8  25.1  25.0  25.8   0.0
  4.|-- 100ge16-1.core1.nyc4.he.n 10.0%    10   25.2  25.2  25.1  25.3   0.0
  5.|-- 100ge16-2.core1.lon2.he.n  0.0%    10   92.2  92.4  92.1  93.2   0.0
  6.|-- k.aimless.thn.aa.net.uk    0.0%    10   92.4  92.4  92.3  92.5   0.0
  7.|-- bottomless.aa.net.uk       0.0%    10   91.1  91.2  91.1  91.3   0.0
bblack@bast1002:~$ mtr -c 10 -r -4 a.gormless.thn.aa.net.uk
Start: Tue Dec  4 12:24:41 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    7.7   1.2   0.3   7.7   2.3
  2.|-- ae0.cr1-eqiad.wikimedia.o  0.0%    10    0.2   0.2   0.2   0.3   0.0
  3.|-- ???                       100.0    10    0.0   0.0   0.0   0.0   0.0
bblack@bast1002:~$ mtr -c 10 -r -6 a.gormless.thn.aa.net.uk
Start: Tue Dec  4 12:25:03 2018
HOST: bast1002                    Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- ae3-1003.cr2-eqiad.wikime  0.0%    10    0.2   0.3   0.2   0.3   0.0
  2.|-- xe-0-1-5.cr2-eqord.wikime  0.0%    10   28.3  33.5  28.3  79.0  16.0
  3.|-- 10gigabitethernet4-1.core  0.0%    10   25.0  25.1  25.0  25.6   0.0
  4.|-- 100ge16-1.core1.nyc4.he.n  0.0%    10   25.1  25.1  25.1  25.3   0.0
  5.|-- 100ge16-2.core1.lon2.he.n  0.0%    10   92.1  94.1  92.1 111.1   5.9
  6.|-- k.aimless.thn.aa.net.uk   10.0%    10   92.4  92.4  92.2  92.5   0.0
  7.|-- a.gormless.thn.aa.net.uk   0.0%    10   91.4  91.3  91.2  91.4   0.0
Dec 4 2018, 12:28 PM · Operations, Traffic, netops

Dec 3 2018

BBlack closed T210890: Loading full versions of larger images from Commons stucks / repeatedly gets interrupted after a few MBs as Resolved.

I can't reproduce this anymore in my own testing. I'm assuming it's fixed for now, barring further reports of continuing breakage showing up.

Dec 3 2018, 11:52 PM · Patch-For-Review, Operations, media-storage, Traffic, Wikimedia-General-or-Unknown
BBlack added a comment to T210890: Loading full versions of larger images from Commons stucks / repeatedly gets interrupted after a few MBs.

I think the patch reverted above was at fault. What I can't be sure of is whether the reversion will help immediately, or will take some time. I suspect it will have a positive effect fairly quickly (as each failed ExpKill is going to nuke quite a few objects before it ultimately fails).

Dec 3 2018, 11:21 PM · Patch-For-Review, Operations, media-storage, Traffic, Wikimedia-General-or-Unknown
BBlack added a comment to T210890: Loading full versions of larger images from Commons stucks / repeatedly gets interrupted after a few MBs.

They seem different, as T190988 is about faulty uploads (which I presume would still look broken when fetched directly from Swift), and this is about ones that are correct on Swift but have issues fetched through Varnish?

Dec 3 2018, 4:04 PM · Patch-For-Review, Operations, media-storage, Traffic, Wikimedia-General-or-Unknown

Nov 29 2018

BBlack raised the priority of T210683: lvs1006 down from Normal to High.
Nov 29 2018, 4:04 PM · netops, ops-eqiad, Traffic, Operations
BBlack added projects to T210683: lvs1006 down : ops-eqiad, netops.
Nov 29 2018, 2:35 AM · netops, ops-eqiad, Traffic, Operations
BBlack added a comment to T210683: lvs1006 down .

Yeah I got busy and dropped this.

Nov 29 2018, 2:35 AM · netops, ops-eqiad, Traffic, Operations

Nov 27 2018

BBlack added a comment to T206861: Power incident in eqsin.

Seems reasonable to close this; the event itself is long over. There are still risks present for a followup event, but if we close up all the actionables that goes away eventually. Maybe add incident tag and move to follow-up column for T206951? (the other is already there)

Nov 27 2018, 7:27 PM · Wikimedia-Incident, Traffic, Operations

Nov 26 2018

BBlack added a comment to T199675: cp5001 unreachable since 2018-07-14 17:49:21.

Update from SRE meeting today - memtest was successful, and we're asked to put it back in production and see if the error happens again or not. Re-pooling!

Nov 26 2018, 5:52 PM · Operations, ops-eqsin, Traffic

Nov 24 2018

Krinkle awarded T144187: Better handling for one-hit-wonder objects a Orange Medal token.
Nov 24 2018, 1:24 AM · Performance-Team (Radar), Patch-For-Review, Operations, Traffic

Nov 23 2018

BBlack added a comment to T209515: Renew Digicert Unified in 2019.

Downtimes set, we shouldn't get cert alerts in icinga

Nov 23 2018, 4:26 PM · Patch-For-Review, Operations, Traffic

Nov 21 2018

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

Thanks for the data and the patch! We'll dig into the DNS patch next week and get it merged in so we're serving wikiba.se from our DNS as-is (as in, pointing at your existing server IPs). Then we can do handoff of the domain ownership/registration without causing any interruptions.

Nov 21 2018, 3:48 PM · Patch-For-Review, User-Addshore, serviceops, wikidata-tech-focus, Traffic, wikiba.se website, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Nov 19 2018

BBlack added a comment to T209785: INMARSAT geolocates to the UK, leading to requests going to esams.

When looking at the latest MaxMind data, it locates this network as being in New Zealand, which we map to ulsfo as first choice, and esams as the last-resort choice. But the destination would've been set by geodns logic, so probably what really mattered was the location of the DNS cache in use. For future debugging, try a DNS lookup on reflect.wikimedia.org, which will show us what DNS cache exit IP our servers see, e.g.:

Nov 19 2018, 1:46 PM · Operations, Traffic

Nov 18 2018

BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

Yet another! cp1078 crash ticket above merged into here.

Nov 18 2018, 8:13 PM · Patch-For-Review, Operations, Traffic
BBlack merged task T209791: cp1078 crash into T203194: cp1075-90 - bnxt_en transmit hangs.
Nov 18 2018, 8:13 PM · Operations
BBlack merged T209791: cp1078 crash into T203194: cp1075-90 - bnxt_en transmit hangs.
Nov 18 2018, 8:13 PM · Patch-For-Review, Operations, Traffic

Nov 17 2018

BBlack added a comment to T119366: Disable caching on the main page for anonymous users.

Fwiw: im of the opinion that date magic words should reduce varnish cache to at least 24 hours, maybe six hours.

Nov 17 2018, 12:17 AM · Traffic, Operations, Wikimedia-General-or-Unknown

Nov 16 2018

BBlack edited P7816 url parsing in nodejs with no scheme.
Nov 16 2018, 1:53 PM
BBlack edited P7816 url parsing in nodejs with no scheme.
Nov 16 2018, 1:53 PM
BBlack created P7816 url parsing in nodejs with no scheme.
Nov 16 2018, 1:51 PM

Nov 14 2018

BBlack added a comment to T209515: Renew Digicert Unified in 2019.

Also, we should pre-downtime the unified ssl checks in icinga for cp3NNN and cp5NNN early next week before the US Thanksgiving holidays, so that nobody's pestered by a spam of WARNING alerts, which I believe are set to trigger 60 days out from expiry.

Nov 14 2018, 5:48 PM · Patch-For-Review, Operations, Traffic
BBlack updated the task description for T209515: Renew Digicert Unified in 2019.
Nov 14 2018, 5:37 PM · Patch-For-Review, Operations, Traffic
BBlack triaged T209515: Renew Digicert Unified in 2019 as Normal priority.
Nov 14 2018, 5:35 PM · Patch-For-Review, Operations, Traffic
BBlack closed T206804: Renew GlobalSign Unified in 2018 as Resolved.
Nov 14 2018, 5:20 PM · Patch-For-Review, Operations, Traffic
BBlack closed Unknown Object (Task), a subtask of T206804: Renew GlobalSign Unified in 2018, as Resolved.
Nov 14 2018, 5:19 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T206339: Separate Traffic layer caches for PHP7/HHVM.

From IRC for posterity:

Nov 14 2018, 5:08 PM · Patch-For-Review, Traffic, Operations

Nov 13 2018

BBlack added a comment to T206804: Renew GlobalSign Unified in 2018.

Seems to be testing fine on https://pinkunicorn.wikimedia.org/ , and the pre-deployment to all caches hosts and OCSP Stapling looks fine too.

Nov 13 2018, 2:00 PM · Patch-For-Review, Operations, Traffic

Nov 10 2018

BBlack removed projects from T209019: qrpedia.org and qrwp.org are down: Operations, Traffic, Domains.

Removing the ops/traffic/domains tags, as the Foundation doesn't operate anything about these domains (we don't own or operate the DNS, the IPs, or the servers). Whois says they belong to:

Nov 10 2018, 1:02 PM · QRpedia-General