Page MenuHomePhabricator

BBlack (Brandon Black)
Engineering Manager, SRE Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (242 w, 3 h)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Wed, Jun 19

BBlack added a comment to T226044: Set up a subdomain for Phame to enable caching.

Implementing a blanket redirect to the legacy blog URI for ^/20(0[7-9]|1[0-8])/ should be feasible in VCL or Lua at the edge. Or alternatively, we could also just leave it alone and pick another hostname, too.

Wed, Jun 19, 1:54 PM · Operations, Traffic, Phabricator, Release-Engineering-Team (Kanban)
Restricted Application added a project to T226044: Set up a subdomain for Phame to enable caching: Operations.
Wed, Jun 19, 1:25 PM · Operations, Traffic, Phabricator, Release-Engineering-Team (Kanban)
BBlack updated the task description for T226044: Set up a subdomain for Phame to enable caching.
Wed, Jun 19, 1:25 PM · Operations, Traffic, Phabricator, Release-Engineering-Team (Kanban)

Sat, Jun 8

BBlack added a comment to T225347: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110).

The TLS-level error is just complaining that, at the end of the transaction, the connection was aborted abruptly instead of torn down cleanly. It would probably be more-ideal if gerrit's TLS stack would cleanly close on 500s when it can, but the real issue here is probably the 500 error, not the TLS error. At a glance, the GET request headers look identical in the two cases, so I'm at a loss as to what's happening on gerrit's side here. Is there perhaps a request difference in some HTTP-level authentication or cookie stuff that's not shown in the trace?

Sat, Jun 8, 12:31 PM · Traffic, Operations, Gerrit

Thu, Jun 6

BBlack closed T222078: Analyze readers' engagement in countries affected by Singapore Data Center's switch as Resolved.

@leila and @Miriam - Thanks for all the hard work here, it's truly outstanding the depth to which this analysis already goes, and it puts some useful numbers on the impact of expanding our edge network into under-served regions.

Thu, Jun 6, 2:42 PM · Research-consulting, Research

Thu, May 30

BBlack added a comment to T224694: cp3041 - Varnish frontend child restarted icinga alert.

That alert basically means that a varnish frontend daemon crashed (and as usual was auto-restarted by a manager process). These are pretty rare and usually worth some investigation.

Thu, May 30, 7:43 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T223408: Page gets redirected randomly to former blackout page.

We may want to think of a solution the community can employ for these kinds of blackouts that doesn't require a sitemap generation & deployment after the fact. Just a thought.

Thu, May 30, 12:57 PM · Readers-Web-Backlog, Performance-Team (Radar), Wikimedia-Incident

Wed, May 29

BBlack added a comment to T222937: Replace Varnish backends with ATS on cache upload nodes in esams.

The failed reimage was finished up manually (probably not the reimager's fault)

Wed, May 29, 8:14 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T212197: Deliver mobile-based version for automatic translations.

Done. Are we ready to deploy it already or blocked on other MW-level deploys still?

Wed, May 29, 3:49 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Operations, Traffic, ExternalGuidance

Tue, May 28

BBlack added a comment to T224511: cr1-codfw linecard failure.

Plan seems reasonable based on the info in the description! Maybe wait longer than 2h after the linecard is restarted? Or do we suspect that any recurrence is much less likely with no traffic?

Tue, May 28, 6:14 PM · Operations, netops

May 24 2019

BBlack added a comment to T223902: cloudcontrol: decide on FQDN for service endpoints.

That cloud rebranding link above also mentions wikimediacloud.org, which is yet another option nobody's exploiting yet. So even without getting into the over-long wikimediacloudservices.org, we have sufficient names to cover all the cases here (feel free to re-arrange, esp the latter two):

May 24 2019, 7:07 PM · Traffic, Operations, Cloud-VPS, cloud-services-team (Kanban)
BBlack updated subscribers of T223902: cloudcontrol: decide on FQDN for service endpoints.

Ok, @aborrero caught me up on all the context on IRC so I can stop asking dumb questions (Thanks!).

May 24 2019, 12:31 PM · Traffic, Operations, Cloud-VPS, cloud-services-team (Kanban)

May 23 2019

BBlack reassigned T224223: decommission lvs100[123456].wikimedia.org from BBlack to ayounsi.

These are reimaged to role(spare::system) now. Over to @ayounsi for getting rid of all the special cases related to these hosts in the eqiad routers and switches (BGP stuff, fw filters, the special public-vlan LVS-balancer port groups, etc), and then we can move this on to dcops -level decom stuff.

May 23 2019, 10:36 PM · Traffic, Operations, ops-eqiad, DC-Ops, decommission
BBlack updated the task description for T224223: decommission lvs100[123456].wikimedia.org.
May 23 2019, 10:33 PM · Traffic, Operations, ops-eqiad, DC-Ops, decommission
BBlack added a comment to T223902: cloudcontrol: decide on FQDN for service endpoints.

Do these belong in wikimedia.org at all? It seems this has already been discussed, but I guess I lack some context.

May 23 2019, 10:07 PM · Traffic, Operations, Cloud-VPS, cloud-services-team (Kanban)
BBlack added a comment to T224033: Fix operations/puppet.git "rebase hell".

One more:

May 23 2019, 3:00 PM · Continuous-Integration-Config, Operations
BBlack added a comment to T224033: Fix operations/puppet.git "rebase hell".

A few thoughts:

May 23 2019, 2:56 PM · Continuous-Integration-Config, Operations
BBlack updated the task description for T224223: decommission lvs100[123456].wikimedia.org.
May 23 2019, 1:39 PM · Traffic, Operations, ops-eqiad, DC-Ops, decommission
BBlack moved T224223: decommission lvs100[123456].wikimedia.org from Triage to LoadBalancer on the Traffic board.
May 23 2019, 1:34 PM · Traffic, Operations, ops-eqiad, DC-Ops, decommission
BBlack added a project to T224223: decommission lvs100[123456].wikimedia.org: Traffic.
May 23 2019, 1:33 PM · Traffic, Operations, ops-eqiad, DC-Ops, decommission
BBlack updated the task description for T224223: decommission lvs100[123456].wikimedia.org.
May 23 2019, 1:31 PM · Traffic, Operations, ops-eqiad, DC-Ops, decommission
Restricted Application added a project to T224223: decommission lvs100[123456].wikimedia.org: Operations.
May 23 2019, 1:30 PM · Traffic, Operations, ops-eqiad, DC-Ops, decommission

May 22 2019

BBlack added a comment to T223921: GSuite Test Domain Verification.

Either is fine. I assume you won't be able to do anything else with this (e.g. make https://gsuite-test.wikimedia.org/ work) without some followup records added on our side.

May 22 2019, 7:30 PM · Operations, DNS, Traffic
BBlack added a comment to T140365: Lower geodns TTLs from 600 (10min) to 300 (5min).

So we've reduced query volume by ~32% in T208263 . Since the last significant updates here, we've also deployed newer versions of our authdns software which perform even better, and refreshed some hardware as well. We're still in the basic scenario that we only have 3x singular authdns hosts in the world, but they're running with plenty of headroom in terms of handling query rate spikes and server outages. There's really two things holding us up on experimenting with lower TTLs for faster failover:

May 22 2019, 5:51 PM · Traffic, Operations
BBlack closed T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS as Resolved.

Scheme has been stable for ~1w now and seems to be working out fine. The net reduction in total authdns requests is ~32%. I suspect the drop in public requests for wiki hostnames is greater, as the total also includes all of our internal/infrastructure lookups as well, but either way we should be seeing far less DNS cache misses out there in the world, especially for longer-tail / less-popular project and language combinations.

May 22 2019, 5:42 PM · Performance-Team (Radar), Operations, Traffic
BBlack added a comment to T223921: GSuite Test Domain Verification.

The above is deployed. I'd wait a full 10 minutes from the time of this comment to re-test, in case they've negative-cached the previous lookup, then try again and let's see what happens.

May 22 2019, 5:35 PM · Operations, DNS, Traffic
BBlack added a comment to T223921: GSuite Test Domain Verification.

The context of the second token is that all of our canonical wiki domains, including wikimedia.org, already have persistent Google Site Verification TXT tokens so that we can manage Google Search stuff for our own domains on a different Google system.

May 22 2019, 5:31 PM · Operations, DNS, Traffic
BBlack added a comment to T223921: GSuite Test Domain Verification.

@HMarcus - The record is live, can you try the validation and let me know how it goes?

May 22 2019, 1:41 PM · Operations, DNS, Traffic

May 21 2019

BBlack added a comment to T222620: cp1083 crashed.

Nevermind, apparently it was already repooled, looking at the wrong thing here...

May 21 2019, 6:46 PM · Operations, ops-eqiad, Traffic
BBlack added a comment to T222620: cp1083 crashed.

It's been up for ~15 days now without incident, but depooled for frontend traffic. Re-pooling it today to see if we can get a recurrence or not.

May 21 2019, 6:42 PM · Operations, ops-eqiad, Traffic
BBlack added a comment to T224027: LVS interface settings from /e/n/i not consistently applied on first boots.

FWIW, lvs1016 came back with correct settings after the single additional reboot above.

May 21 2019, 6:03 PM · Traffic, Operations
BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

Current status of transition:

May 21 2019, 5:58 PM · ops-eqiad, Operations, Traffic
BBlack triaged T224027: LVS interface settings from /e/n/i not consistently applied on first boots as Normal priority.
May 21 2019, 2:29 PM · Traffic, Operations

May 19 2019

BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

Note https://gerrit.wikimedia.org/r/c/operations/puppet/+/511118 - I had to switch the lvs1015 cross-row ports for rows A and B (enp4s0f1 and enp5s0f0) backwards at the software level to match the physical reality shown by lldpcli show neighbors, which was backwards from the documented table of ports at the top of this task. The current config works and we can keep it if we want. Note that I didn't make any other related changes, so if we keep this config, we probably need to edit the software port labels in the switch configurations to match, and possibly any physical labeling in the DC, to avoid future confusion. Alternatively, before we put this machine in service, we could physically swap the cables back to the intended config at the rear of lvs1015, revert the mentioned puppet patch, and reimage the server again. Either way, there's probably some followup to do on this.

May 19 2019, 12:25 AM · ops-eqiad, Operations, Traffic

May 18 2019

BBlack created P8541 post-reimage bios settings warning.
May 18 2019, 11:59 PM · Traffic

May 16 2019

BBlack raised the priority of T184293: rack/setup/install lvs101[3-6] from Normal to High.

Outside of immediate emergency situations, resolving any blockers to get the remaining two LVSes into service should be a very high priority at this point.

May 16 2019, 2:41 PM · ops-eqiad, Operations, Traffic
BBlack added a comment to T223448: ErrorException from line 1274 of /srv/mediawiki/php-1.34.0-wmf.5/includes/upload/UploadBase.php: PHP Warning: fread() expects parameter 1 to be resource, boolean given.

Any chance this is interrelated with T222994 ?

May 16 2019, 1:33 PM · MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), Patch-For-Review, Multimedia, UploadWizard, Wikimedia-production-error

May 14 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

@kostajh - The OONI article you linked ( https://ooni.torproject.org/post/2019-china-wikipedia-blocking/ ) is accurate, and it's outside of our scope (more with our Legal and Communications teams at a high level) to communicate publicly and officially on that situation, if at all (they are aware!).

May 14 2019, 4:40 PM · Performance-Team (Radar), Operations, Traffic

May 9 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

Our analytics seems to indicate the changes above had the intended effect in restoring normal levels of traffic from CN for affected projects:

May 9 2019, 5:13 PM · Performance-Team (Radar), Operations, Traffic
BBlack added a comment to T213769: Zero VCL removal.

Yeah, it's mostly just blocked on us making some time to deal with it, and time has been in extremely short supply lately, so we tend not to prioritize anything that doesn't have imminent impact. There's some subtleties to backing out that stuff in stages and not breaking things.

May 9 2019, 3:47 PM · Patch-For-Review, Zero, Operations, Traffic
BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

@Cwek @lilydjwg - Thanks for the reports! I apologize, this time around the fallout should've been predictable, given what we know from https://ooni.io/post/2019-china-wikipedia-blocking/ about the mechanisms, we just didn't think it through. I've pushed some changes above to move the CNAME target over to a new hostname dyna.wikimedia.org, which should fix things assuming CN's censorship tactics remain otherwise-stable. It will take up to roughly an hour for global DNS caches to catch up with the change and then we can continue investigations from there.

May 9 2019, 2:04 PM · Performance-Team (Radar), Operations, Traffic
Cwek awarded T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS a Dislike token.
May 9 2019, 5:44 AM · Performance-Team (Radar), Operations, Traffic

May 8 2019

BBlack added a comment to T170567: Support TLSv1.3.

Putting this here for lack of a better place, for future reference:

May 8 2019, 5:49 PM · Performance, Goal, Patch-For-Review, Traffic, Operations

May 5 2019

BBlack created P8475 503-causing varnishes from 5xx.log.
May 5 2019, 12:20 AM · Traffic

May 3 2019

BBlack added a comment to T222078: Analyze readers' engagement in countries affected by Singapore Data Center's switch.

@Miriam - sorry for the slowness!

May 3 2019, 6:47 PM · Research-consulting, Research

May 2 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

The current iteration of the proposed broadly-applied production version is in PS3 of the patch @ https://gerrit.wikimedia.org/r/c/operations/dns/+/507399/3 (and then a followup switch from 1H to 1D CNAME TTLs to go out shortly afterwards), will likely shoot for deployment early next week.

May 2 2019, 1:18 PM · Performance-Team (Radar), Operations, Traffic

May 1 2019

BBlack edited P8465 Top AuthDNS address query names.
May 1 2019, 1:40 PM · Traffic
BBlack created P8465 Top AuthDNS address query names.
May 1 2019, 1:38 PM · Traffic

Apr 25 2019

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

@WMDE-leszek Thanks for looking into it! I believe @CRoslof is who you want to coordinate with on our end, whose last statement on this topic back in January was:

Apr 25 2019, 4:07 PM · Patch-For-Review, User-Addshore, serviceops, wikidata-tech-focus, Traffic, wikiba.se website, Operations, Wikidata-Sprint-2016-11-08, Wikidata
BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

Re: wikibase.org, adding it as a non-canonical redirection to catch confusion from those that manually type URLs is fine, but we should make sure everyone is clear on which domainname is canonical for this project (I assume https://wikiba.se/) and make sure that's the only one that's published, promoted, and used for links we control and such. It's an important notion that one name is canonical!

Apr 25 2019, 2:10 PM · Patch-For-Review, User-Addshore, serviceops, wikidata-tech-focus, Traffic, wikiba.se website, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Apr 24 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

@Cwek - Thanks for the reports! Have you tried other Wikimedia projects (e.g. wikiversity, wikiquote, wiktionary, etc) for SNI testing and/or DNS lookups from within? That may provide some level of insight as well. Currently we suspect the DNS changes here were not related to the new blockage, but obviously we'd like to gather all the data we can. The initial deployment date of the structural change was actually Apr 18th; the changes on the 20th merely extended the TTLs of that scheme from 10 minutes to 4 hours. Our own analytics seems to confirm that the dropoff of CN traffic was actually on the 23rd (same as when the community noticed).

Apr 24 2019, 2:44 PM · Performance-Team (Radar), Operations, Traffic

Apr 20 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

Status update on the experiments above:

Apr 20 2019, 3:12 PM · Performance-Team (Radar), Operations, Traffic

Apr 19 2019

BBlack created P8419 strange nxdomains to ns0.
Apr 19 2019, 3:00 PM · Traffic

Apr 9 2019

BBlack added a comment to T209707: tagged_interface sometimes exceeds IFNAMSIZ.

It's not ideal, but the part that was stripped was the most-predictable part of the name (the en prefix), so it's not all that confusing.

Apr 9 2019, 10:35 AM · Traffic, Operations

Apr 8 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

The wiktionary CNAME experiment is going out today, and I'm intending to keep it running for at least a week, assuming no issues arise.

Apr 8 2019, 5:30 PM · Performance-Team (Radar), Operations, Traffic
BBlack updated the task description for T186550: Anycast recdns.
Apr 8 2019, 2:12 PM · Patch-For-Review, netops, Operations, Traffic
BBlack added a comment to T220383: Evaluate ATS TLS stack.
  • 0100-dynamic-tls-records.patch - I don't think we ever managed to prove a significant benefit from this on initial deploy, but it's just one of those things that seemed like a "good idea" so long as it remained simple to leave it in. I'd be happy with dropping this initially and putting the ideas behind that patch (or even more-generalized than that patch) on the back burner for the future when we have more time.
  • 0660-version-too-low.patch - This was a very nginx-specific thing about not having nginx spam error messages, shouldn't need to port it at all.
Apr 8 2019, 1:50 PM · Traffic, Operations

Apr 5 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

We may try the wiktionary patch early next week. The goal with that test is just to see if we get any user complaints about wiktionary.org resolution being broken, so we'll leave it in place for a week or so if we don't get complaints, or revert if we do. Either way it will eventually get reverted, and if it's successful then we'll start patching for the "real" version where everything centralizes into a wikipedia.org hostname, so that's probably still at least a couple weeks out.

Apr 5 2019, 4:49 PM · Performance-Team (Radar), Operations, Traffic

Mar 29 2019

BBlack added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

There's some complexities here that I've been stewing on for a while, mostly noted in the original description, but I like this general direction. Most of the concerns briefly mentioned earlier aren't actually a big deal in practice, but there remains a key issue around CNAME + edns-client-subnet, and the decision between putting the terminal DYNA record in either wikipedia.org or some other domain (preferably one not used by current canonicals at all, e.g. maybe this variant would be a good use for wikimedia.net?). Where I'm at now in thinking on these two paths:

Mar 29 2019, 3:03 PM · Performance-Team (Radar), Operations, Traffic

Mar 12 2019

BBlack added a comment to T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater.

I think it would be better, from my perspective, to really understand the use-cases better (which I don't). Why do these remote clients need "realtime" (no staleness) fetches of Q items? What I hear is it sounds like all clients expect everything to be perfectly synchronous, but I don't understand why they need to be perfectly synchronous. In the case that lead to this ticket, it was a remote client at Orange issuing a very high rate of these uncacheable queries, which seems like a bulk data load/update process, not an "I just edited this thing and need to see my own edits reflected" sort of case.

Mar 12 2019, 5:18 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), User-Smalyshev, Wikidata, Wikidata-Query-Service, Traffic, Operations

Mar 8 2019

BBlack added a comment to T217897: Reduce / remove the aggessive cache busting behaviour of wdqs-updater.

Looking at an internal version of the flavor=dump outputs of an entity, related observations:

Mar 8 2019, 2:13 PM · MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), User-Smalyshev, Wikidata, Wikidata-Query-Service, Traffic, Operations

Mar 4 2019

BBlack added a comment to T215987: Verify that hit/miss stats in WebRequest are correct.

The raw data should be accurate. I had thought we were already sending the summarized X-Cache-Status to hadoop as well, but apparently not. It might be useful to get that going in another ticket, because it saves dealing with some of the complexity below. In the meantime:

Mar 4 2019, 7:59 PM · Traffic, Operations, Core Platform Team Backlog (Later), Analytics, Services (blocked), RESTBase

Feb 27 2019

BBlack added a comment to T204281: Stop prioritizing peering over transit.

Circa 2019-02-21, eqsin was depooled to install a new router, and most of the users normally mapped to eqsin had fallen back to ulsfo temporarily, which would distort the stats of "ulsfo users" considerably.

Feb 27 2019, 11:54 AM · Performance-Team (Radar), netops, Operations

Feb 26 2019

BBlack created P8132 Network oddities from AT&T.
Feb 26 2019, 3:03 PM

Feb 25 2019

BBlack added a comment to T212197: Deliver mobile-based version for automatic translations.

The VCL looks good, please give us some notice (~24h would be ideal?) on when you need it actually deployed once you've decided on a date. Any news on the Desktop-denial regression?

Feb 25 2019, 8:17 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Operations, Traffic, ExternalGuidance

Feb 21 2019

BBlack added subtasks for T216691: amber light on cp5006/5007: T216716: cp5007 correctable mem errors, T216717: cp5006 correctable mem errors.
Feb 21 2019, 2:22 PM · Traffic, ops-eqsin, Operations
BBlack added a parent task for T216716: cp5007 correctable mem errors: T216691: amber light on cp5006/5007.
Feb 21 2019, 2:22 PM · Operations, ops-eqsin, Traffic
BBlack added a parent task for T216717: cp5006 correctable mem errors: T216691: amber light on cp5006/5007.
Feb 21 2019, 2:22 PM · ops-eqsin, Traffic, Operations
BBlack created T216717: cp5006 correctable mem errors.
Feb 21 2019, 2:21 PM · ops-eqsin, Traffic, Operations
BBlack created T216716: cp5007 correctable mem errors.
Feb 21 2019, 2:21 PM · Operations, ops-eqsin, Traffic
BBlack closed T214274: Degraded RAID on cp5010 as Resolved.

Seems to be working fine after replacement!

Feb 21 2019, 5:48 AM · Traffic, ops-eqsin, Operations

Feb 20 2019

BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

There are different layers of "handing off" DNS management which are being conflated, but to run through them in order:

Feb 20 2019, 5:26 PM · Patch-For-Review, User-Addshore, serviceops, wikidata-tech-focus, Traffic, wikiba.se website, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Feb 15 2019

BBlack added a comment to T215956: Consider stashing data-parsoid for VE .

Correct me if I'm wrong, but I would think all VE traffic would already be uncacheable at the Varnish level anyways, since it happens in the context of a session (although in the future we might fix this with content composition work). As for the rest of this discussion, I don't think I understand the context enough to say anything about its sanity or whether it increases any attack surface in a way that matters.

Feb 15 2019, 9:34 PM · User-Ryasmeen, Core Platform Team Kanban (Done with CPT), Services (done), Core Platform Team (RESTBase Split (CDP2)), User-Eevans, User-mobrovac, Parsoid, VisualEditor, RESTBase

Feb 14 2019

BBlack triaged T216172: Set up basic email infra for w.wiki domain as Normal priority.
Feb 14 2019, 7:44 PM · Operations, Mail
BBlack added a comment to T205897: Netbox: fill network topology.

The medium-term plan is for this data to be entered into Netbox after a server is racked but before it's provisioned or even powered up, and that data to be used by our tooling to configure and execute the provisioning itself (DHCP configuration, switchport, OS install etc.).

So, I don't think we can reasonably expect our on-site techs to look at a box and say "oh this port is enp4s0f0p1" and record it as such :)

Feb 14 2019, 12:51 PM · netbox, Operations

Feb 13 2019

BBlack added a comment to T205897: Netbox: fill network topology.
  • How should we name server interfaces? The physical Port 1, Port 2, etc. or the Linux naming (enp5s0f0, enp5s0f1, etc) My vote so far would go with #2. Even if it's harder to parse for a human, it should stay consistent.

My vote is for whatever is clearer/simpler for the people that has to physically interact with them as long as they uniquely identify the parts without ambiguity.

Feb 13 2019, 1:46 PM · netbox, Operations

Feb 10 2019

BBlack added a comment to T215071: Merge Wikipedia subdomains into one, to discourage censorship.

According to the article Censorship of Wikipedia, one effect of the switch to https was that it is now not possible to censor individual articles.

Conversely, now when China decides to censor articles about Tienanmen Square, their citizens also lose access to basic health information, STEM education background material and all other content that would probably have far more positive impact on their lives than articles about politics... it's a double edged sword. Arguably, strictly from the censorship / access angle, HTTPS was a bad trade-off. (There are a number of other reasons why it was absolutely necessary; but for merging subdomains that's probably not the case.)

Feb 10 2019, 3:32 PM · Domains, Traffic, DNS, Operations, HTTPS

Feb 8 2019

BBlack added a comment to T214529: EDAC events not being reported by node-exporter?.

Corrected errors are normal and expected to occur on healthy
hardware. They do not need user's attention until they repeatedly
occurred at a same place.

Apparently, you haven't been on enough maintanance calls, trying to
calm down the customer about the hardware error he sees in his
logs...

Actually, that's why. Reporting all corrected errors make users
worried, call support, and asking to replace healthy hardware...

So it seems possible that nothing is actually 'wrong' here? But I have very little confidence in anything at this point.

Feb 8 2019, 9:23 PM · Patch-For-Review, Operations, observability

Feb 7 2019

BBlack added a comment to T207389: Rename the Certcentral project to Acme-chief.

Sounds good to me!

Feb 7 2019, 4:06 PM · Patch-For-Review, Acme-chief
BBlack lowered the priority of T215071: Merge Wikipedia subdomains into one, to discourage censorship from Normal to Low.

Expounding on the lamentations above in a more realistic triage sort of sense:

Feb 7 2019, 1:41 PM · Domains, Traffic, DNS, Operations, HTTPS
BBlack added a comment to T215071: Merge Wikipedia subdomains into one, to discourage censorship.

The linked ESNI ticket is kind of a random user question ticket, and not actually one created for working on it (which still off in the Future, but obviously we'll implement ESNI as soon as we realistically can, as part of planned work).

Feb 7 2019, 1:10 PM · Domains, Traffic, DNS, Operations, HTTPS

Jan 23 2019

BBlack added a comment to T214516: cp4026 correctable dimm error.

See also T178011 for last time. Why didn't the icinga EDAC check catch this?

Jan 23 2019, 9:05 PM · ops-ulsfo, Operations, Traffic
BBlack reassigned T214516: cp4026 correctable dimm error from BBlack to RobH.

https://wikitech.wikimedia.org/wiki/Cache_servers#Depool_and_downtime is correct, it just needs to be depooled (it will auto-depool on shutdown, but a manual depool is preferable).

Jan 23 2019, 9:02 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T211254: Free up 185.15.59.0/24.

It may be possible to get more space in various shady ways, but it's not possible by following RIR rules.

Jan 23 2019, 4:22 PM · Patch-For-Review, Traffic, Operations, netops
BBlack added a comment to T211254: Free up 185.15.59.0/24.

It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.

In a world where there's ample address space (such as 10/8 in our context), yes. In today's world where IPv4 address space is scarce and we can likely not get any more, not so much.

I would personally have preferred that with the renumbering.of WMCS they simply acquired new public IPv4 space of their own

That's simply not realistic, they can't "acquire" IPv4 address space of their own. They're part of this organisation, this ASN, and need to use our PI/PA space where we have it available before we collectively can get more.

Jan 23 2019, 2:40 PM · Patch-For-Review, Traffic, Operations, netops
BBlack added a comment to T211254: Free up 185.15.59.0/24.

It's the same basic rationale as moving WMCS out of 10.68.0.0/16. We could obviously leave them there and just manage our ACLs better with more automation, but it pays some pretty big dividends when address spaces are clearly split on such a big security and functional boundary as Prod-v-WMCS. Humans will always look at IPs as well in various debugging and configuration tasks. Having similar/shared/adjacent numbering for these two realms invites confusion and mistakes.

Jan 23 2019, 1:44 PM · Patch-For-Review, Traffic, Operations, netops
BBlack updated subscribers of T214459: Connection problem (Moscow ISP, 4G) with Beeline / Sovintel.
Jan 23 2019, 1:06 PM · Traffic, netops, Operations

Jan 16 2019

BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

See also this email thread where Michael Chan (broadcom driver dev) asks for firmware level output, sees the same numbers we have on cp1088, and tells them to upgrade: https://www.spinics.net/lists/netdev/msg519478.html

Jan 16 2019, 1:08 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

on the Dell community forum there is a post with several people reporting the same issue. The last one suggesting that Dell could be replacing hardware to fix the issue.

Jan 16 2019, 1:03 PM · Patch-For-Review, Traffic, Operations

Jan 14 2019

BBlack added a comment to T212197: Deliver mobile-based version for automatic translations.

I don't have any suggestions, no. Develop a straw-patch which at least serves in code terms to document the intent (e.g. the explicit header and URI values matched/transformed, etc) and we'll poke at it!

Jan 14 2019, 10:58 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Operations, Traffic, ExternalGuidance

Jan 11 2019

BBlack added a comment to T213475: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response).

It's a confusing set of things going on here, and it's going to need fixups on both the network/data/data.yaml side and the VCL side. Just to recap the historical situation for clarity:

Jan 11 2019, 12:29 AM · Patch-For-Review, Toolforge, Traffic, Operations, Cloud-VPS

Jan 10 2019

BBlack added a comment to T213475: Wikimedia varnish rules no longer exempt all Cloud VPS/Toolforge IPs from rate limits (HTTP 429 response).

Right. I'm not up to speed on where all related changes are, but from VCL's point of view its definition of wikimedia_nets was meant to include labs, whereas its nearly identical wikimedia_trust is meant to exclude labs.

Jan 10 2019, 10:46 PM · Patch-For-Review, Toolforge, Traffic, Operations, Cloud-VPS
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

Yeah. It's hard to "prove" whether we have this bug fixed other than running a supposed fix on the bnxt_en cp10 fleet for a while as a statistical test, but probably the sooner we start on that the better.

Jan 10 2019, 5:04 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

Actually, it is already in the 4.9.y LTS/stable branch, here: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.9.y&id=b2be15bb02b961146177d49204de22df3dddd415

Jan 10 2019, 4:55 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T203194: cp1075-90 - bnxt_en transmit hangs.

I suspect our bug is fixed by:

Jan 10 2019, 4:40 PM · Patch-For-Review, Traffic, Operations

Dec 31 2018

Liuxinyu970226 awarded T97051: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well a Orange Medal token.
Dec 31 2018, 12:40 AM · Traffic, Operations, Patch-For-Review, DNS

Dec 21 2018

BBlack triaged T212504: Remove OOMScoreAdjust from nrpe unit file? as Low priority.
Dec 21 2018, 3:16 PM · Operations

Dec 20 2018

Dzahn awarded T97051: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well a Orange Medal token.
Dec 20 2018, 10:57 PM · Traffic, Operations, Patch-For-Review, DNS
BBlack added a comment to T206688: SOA serial numbers returned by authoritative nameservers differ .

Update for the record: with recent changes to authdns CI and deployment scripts, this scenario should no longer be possible and workarounds shouldn't be necessary! (see also related distant past incident T103915)

Dec 20 2018, 9:00 PM · Domains, Traffic, Operations
BBlack closed T97051: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well as Resolved.

This is fixed now, no workarounds should be needed.

Dec 20 2018, 8:53 PM · Traffic, Operations, Patch-For-Review, DNS