Page MenuHomePhabricator

BBlack (Brandon Black)
Engineering Manager, SRE Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (272 w, 10 h)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Fri, Jan 17

BBlack added a comment to T242374: Set up git-driven static microsite for wikiworkshop.org.

Most of this has been configured now, the remaining slightly difficult bit is configuring an alternate SNI cert for the domain on our new ats-tls termination.

Fri, Jan 17, 3:08 PM · Patch-For-Review, Research, Operations, Traffic
BBlack updated the task description for T242374: Set up git-driven static microsite for wikiworkshop.org.
Fri, Jan 17, 3:07 PM · Patch-For-Review, Research, Operations, Traffic

Tue, Jan 14

BBlack added a comment to T230638: Move old transparency report pages to historical URLs and setup redirect.

Yes, more or less. The major caveat is some of our caches still have non-redirecting copies of various pages in https://transparency.wikimedia.org/ , but this will sort itself out over the next day or so at the most. To save anyone from trawling through the list of commits above, the changes in effect now are:

Tue, Jan 14, 6:14 PM · Patch-For-Review, serviceops, Operations, WMF-Legal
BBlack added a comment to T190090: Offload pings to dedicated server.

+1 from me, this was one of the many things we made the ganeti clusters for :)

Tue, Jan 14, 1:37 PM · Patch-For-Review, netops, Operations, Traffic

Mon, Jan 13

BBlack added a comment to T242602: Sort out plan for install* servers in edge sites.

Seems like a good plan!

Mon, Jan 13, 1:59 PM · Operations

Thu, Jan 9

BBlack closed T240303: Add wikiworkshop.org to the Foundation's DNS as Resolved.
Thu, Jan 9, 9:32 PM · Research, Traffic, DNS, Operations
BBlack closed T240303: Add wikiworkshop.org to the Foundation's DNS, a subtask of T242374: Set up git-driven static microsite for wikiworkshop.org, as Resolved.
Thu, Jan 9, 9:32 PM · Patch-For-Review, Research, Operations, Traffic
BBlack added a parent task for T240303: Add wikiworkshop.org to the Foundation's DNS: T242374: Set up git-driven static microsite for wikiworkshop.org.
Thu, Jan 9, 9:32 PM · Research, Traffic, DNS, Operations
BBlack added a subtask for T242374: Set up git-driven static microsite for wikiworkshop.org: T240303: Add wikiworkshop.org to the Foundation's DNS.
Thu, Jan 9, 9:31 PM · Patch-For-Review, Research, Operations, Traffic
BBlack triaged T242374: Set up git-driven static microsite for wikiworkshop.org as Medium priority.
Thu, Jan 9, 9:31 PM · Patch-For-Review, Research, Operations, Traffic

Wed, Jan 8

BBlack added a comment to T242200: Docker registry needs cache to vary on Accept header value.

So long as the registry's responses do all the standards-based things correctly (they contain Vary: Accept, and the matching Accept values also match the Content-Type values in the responses), this should Just Work on a functional level.

Wed, Jan 8, 1:15 PM · Traffic, Operations

Dec 18 2019

BBlack added a comment to T240813: HTTPS/Browser Recommendations page on Wikitech is outdated.

The wording issues here are actually a bit tricky. We've done several TLS standards upgrades over time, and there are still a few to go:

Dec 18 2019, 3:59 PM · Operations, Traffic
BBlack added a comment to T240794: /sec-warning page: please add an HTML comment that is more easily visible to API and transport-level inspection/debugging.

Or a patch to template this in. The problem is it's implemented from a standard template for the top 30-40 lines, which isn't specific to this case, in an attempt to standardize our error output templates.

Dec 18 2019, 3:11 PM · Operations, Traffic

Dec 16 2019

BBlack closed T239994: Implement DNS-over-TLS for AuthDNS as Resolved.
Dec 16 2019, 11:38 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

External queries now working (note they all return a codfw IP without edns-client-subnet in play, because codfw is closest to my laptop and PROXYv2 is working for sending the "real" client IP from haproxy to gdnsd).

Dec 16 2019, 11:37 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

Actually we can't realistically do global monitoring from icinga either, because icinga isn't on Buster and so it doesn't have the right library/tool access to check a TLSv1.3-only service, so we'll have to settle for the per-server NRPE checks for now.

Dec 16 2019, 11:34 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

Refactoring the dependencies a little here: Really (2) above's sub-point about shared ticket key rotation won't matter until we're anycasting, so I've made a separate task (+subtask) in T240863 to go look at that stuff later, blocking the anycast work.

Dec 16 2019, 5:09 PM · Operations, Traffic
BBlack created T240866: Create a system for distributed shared secret material to server tmps.
Dec 16 2019, 3:00 PM · Operations, Traffic
BBlack created T240863: Secure shared ticket key rotation for anycast authdns.
Dec 16 2019, 2:53 PM · Operations, Traffic

Dec 13 2019

BBlack added a comment to T240303: Add wikiworkshop.org to the Foundation's DNS.

All of this is irregular and outside of policies we like to adhere to, but I'll push a zonefile to our nameservers which supports the bare minimum (existing Stanford-hosted IPs for the insecure site http://wikiworkshop.org and the same IP for redirects from http://www.wikiworkshop.org , and nothing else ). At some point after the holidays are over, I'd like to find out what the overall intent and/or plan is here so that we can provide some additional guidance and get this onto some kind of more-acceptable path though.

Dec 13 2019, 2:25 AM · Research, Traffic, DNS, Operations
BBlack moved T240303: Add wikiworkshop.org to the Foundation's DNS from Triage to DNS Names on the Traffic board.
Dec 13 2019, 2:17 AM · Research, Traffic, DNS, Operations

Dec 12 2019

BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

This is now mostly-working, with heira flag controlling test deployment (currently only on dns4002, which doesn't have any public authserver IPs routed into it at this time).

Dec 12 2019, 11:10 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

P9867 <- First internal test query on a prod dns box :)

Dec 12 2019, 9:27 PM · Operations, Traffic
BBlack created P9867 AuthDNS-over-TLS.
Dec 12 2019, 9:23 PM · Traffic
BBlack triaged T240614: Fix acme-chief DNS validation correctly as High priority.
Dec 12 2019, 8:43 PM · Operations, Traffic
BBlack updated subscribers of T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).
Dec 12 2019, 3:24 PM · Wikimedia-Incident, Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T240497: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML).

I'm not even sure what the task is asking for, but yeah in general we're not going to make the sec-warning mechanism comply with all expected valid outputs from all possible APIs/URIs it's covering. It's designed to break things, in a way that at least provides some level of human info on what's going on if someone digs in and looks. The next step in the transition process after this is that whatever agent they're using which is getting the sec-warning output won't be able to establish a connection to our infrastructure at all, which is way more broken than this.

Dec 12 2019, 1:46 PM · Traffic, Operations
BBlack added a comment to T238038: Start warning and deprecation process for all legacy TLS.

BTW. We no longer have the cipher stats grafana board ? Too bad, that one was hella interesting.

Dec 12 2019, 1:34 PM · Operations, Traffic
BBlack added a comment to T240497: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML).

The way it works is that if the connection isn't using TLSv1.2, the user is served a 302 redirect to /sec-warning on the same domain, which in turn returns a cacheable 200 OK with the HTML warning content and the CT header as text/html; charset=utf-8. There are a lot of gory details in the compromises being made by that solution (vs. eg. we could have returned some kind of 4xx error immediately rather than 302->200), but we've learned this is the best pattern to avoid misbehavior of certain bots and scrapers out there in the world which spam-retry [45]xx return codes.

Dec 12 2019, 1:29 PM · Traffic, Operations
BBlack added a comment to T240495: investigate making 'notrack' the default on our ferm rules.

Yes, it's about that $notrack default. My hypothesis is that setting it to true wouldn't break any traffic, wouldn't change the security situation much, but would eliminate a bunch of potential for conntrack table size issues when various services get overwhelmed. Some thoughts about why that hypothesis might be false:

Dec 12 2019, 12:04 PM · Operations

Dec 10 2019

BBlack added a comment to T239993: Decom LVS recdns.

Status: The actual LVS portion of this is now completely removed globally. The IP addresses themselves are also completely unconfigured and removed from service at the all the edge sites, but not the core ones. What remains is that the legacy LVS recdns IPs 208.80.154.254 (eqiad) and 208.80.153.254 (codfw) are still statically-configured to avoid breaking any of the leftover dependencies on these IPs. Sniffer monitoring has shown at least the ircd instance on kraz is still using outdated resolv.conf data and hitting these IPs, several hardware PDUs are using them as well, and there are possibly other such cases which are rarer and thus harder to observe in short samples (I've done up to 1h samples).

Dec 10 2019, 6:10 PM · Patch-For-Review, Traffic, Operations
BBlack created P9846 authdns config.
Dec 10 2019, 3:30 PM
BBlack created P9845 The cacheable misses from v-fe with session cookies....
Dec 10 2019, 3:06 PM · Traffic
BBlack created P9844 Cookie/Vary request side.
Dec 10 2019, 2:22 PM · Traffic
BBlack moved T240285: Clean up DNS server puppetization from Triage to DNS Infra on the Traffic board.
Dec 10 2019, 1:07 PM · Operations, Traffic
BBlack added a comment to T240303: Add wikiworkshop.org to the Foundation's DNS.

I'm assuming that, for now, the hosting of the web service (and email?) is not moving, just the whois ownership and DNS service? We usually need a fair bit more information than this to handle such a case smoothly. At a glance it looks like there's potentially more to this (e.g. they have MX and SPF records, are there are also DMARC and such we need to copy?). Also, basic TLS doesn't seem to work on the target site, either. Is there a project-level task or something for whatever transition is happening here?

Dec 10 2019, 3:17 AM · Research, Traffic, DNS, Operations

Dec 9 2019

BBlack added a parent task for T240285: Clean up DNS server puppetization: T98006: Anycast AuthDNS.
Dec 9 2019, 10:27 PM · Operations, Traffic
BBlack added a subtask for T98006: Anycast AuthDNS: T240285: Clean up DNS server puppetization.
Dec 9 2019, 10:27 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
BBlack created T240285: Clean up DNS server puppetization.
Dec 9 2019, 10:26 PM · Operations, Traffic

Dec 6 2019

BBlack added a comment to T239993: Decom LVS recdns.

Dug into the odd cases from install2002 and kraz - the common pattern here is that there are some daemons in the world which both (a) parse /etc/resolv.conf for themselves because they use their own custom DNS client code and (b) don't ever re-read that file if it changes. A few of those are daemons we actually use, which happen to have not had their daemon (or the host) restarted since our resolv.conf was switched to the new recdns IP a few months ago (~Aug-Sept timeframe, it was rolled out at different times to different places).

Dec 6 2019, 6:16 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T239993: Decom LVS recdns.

In a sample I just took across all recdns for a little over 15 minutes of sniffer time looking for requests to the legacy LVS-based recdns IPs:

  • ulsfo, eqsin, and esams had no traffic to them at all (yay! and makes basic sense)
  • eqiad had a handful of requests from:
    • ps1-d7-eqiad.mgmt.eqiad.wmnet
    • ps1-d2-eqiad.mgmt.eqiad.wmnet
    • ps1-c1-eqiad.mgmt.eqiad.wmnet
  • codfw had more-interesting traffic from:
    • ps1-a8-codfw.mgmt.codfw.wmnet
    • ps1-22-ulsfo.mgmt.ulsfo.wmnet
    • install2002.wikimedia.org
    • kraz.wikimedia.org
Dec 6 2019, 4:56 PM · Patch-For-Review, Traffic, Operations
BBlack moved T239994: Implement DNS-over-TLS for AuthDNS from Triage to DNS Infra on the Traffic board.
Dec 6 2019, 2:27 PM · Operations, Traffic
BBlack triaged T239994: Implement DNS-over-TLS for AuthDNS as Medium priority.
Dec 6 2019, 2:27 PM · Operations, Traffic
BBlack added a parent task for T98006: Anycast AuthDNS: T81605: Offer AuthDNS service over IPv6.
Dec 6 2019, 2:24 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
BBlack added a subtask for T81605: Offer AuthDNS service over IPv6: T98006: Anycast AuthDNS.
Dec 6 2019, 2:24 PM · Operations, Traffic
BBlack added a comment to T140365: Lower geodns TTLs from 600 (10min) to 300 (5min).

This is still something we want to pursue, but we really need to get past the smooth repooling issue first, so I've added that as a subtask (consider it blocking this one).

Dec 6 2019, 2:23 PM · Operations, Traffic
BBlack removed a parent task for T101525: Set up LVS for current AuthDNS: T140365: Lower geodns TTLs from 600 (10min) to 300 (5min).
Dec 6 2019, 2:23 PM · Operations, Traffic
BBlack added a parent task for T228678: Implement GeoDNS smooth repooling in gdnsd: T140365: Lower geodns TTLs from 600 (10min) to 300 (5min).
Dec 6 2019, 2:22 PM · Operations, Traffic
BBlack edited subtasks for T140365: Lower geodns TTLs from 600 (10min) to 300 (5min), added: T228678: Implement GeoDNS smooth repooling in gdnsd; removed: T101525: Set up LVS for current AuthDNS.
Dec 6 2019, 2:22 PM · Operations, Traffic
BBlack placed T81605: Offer AuthDNS service over IPv6 up for grabs.
Dec 6 2019, 2:21 PM · Operations, Traffic
BBlack renamed T81605: Offer AuthDNS service over IPv6 from No IPv6 addresses on Wikimedia nameservers ns(0-2).wikimedia.org to Offer AuthDNS service over IPv6.
Dec 6 2019, 2:20 PM · Operations, Traffic
BBlack added a comment to T26413: Consider DNSSec.

Since we haven't updated this in two years, I figured I should post again:

Dec 6 2019, 2:16 PM · Operations, Traffic, DNS
BBlack moved T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability from DNS Infra to Watching on the Traffic board.
Dec 6 2019, 2:14 PM · serviceops, Core Platform Team Legacy (Watching / External), Services (watching), DNS, Operations, Traffic
BBlack added a comment to T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability.

While we'll work on improvements that make this less-likely in the first place in various DNS infra tickets like T171498 , and we're happy to help debugging this with someone, ultimately this is an applayer problem independent of our DNS infra, so I'm moving it over to the Watching column.

Dec 6 2019, 2:13 PM · serviceops, Core Platform Team Legacy (Watching / External), Services (watching), DNS, Operations, Traffic
BBlack renamed T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability from icinga alerts on nodejs services when a recdns server is depooled to nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability.
Dec 6 2019, 2:11 PM · serviceops, Core Platform Team Legacy (Watching / External), Services (watching), DNS, Operations, Traffic
BBlack added a project to T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability: serviceops.
Dec 6 2019, 2:09 PM · serviceops, Core Platform Team Legacy (Watching / External), Services (watching), DNS, Operations, Traffic
BBlack merged task T233660: mobileapps/aqs/recommendation-api (nodejs services) improve resilience against short network outages into T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability.
Dec 6 2019, 2:08 PM · Services, serviceops
BBlack merged T233660: mobileapps/aqs/recommendation-api (nodejs services) improve resilience against short network outages into T162818: nodejs / restbase services (mobileapps, aqs, recommendation-api, etc?) fail persistently after short windows of DNS unavailability.
Dec 6 2019, 2:08 PM · serviceops, Core Platform Team Legacy (Watching / External), Services (watching), DNS, Operations, Traffic
BBlack closed T238727: Include zone+subnet checks for DNS validation as Declined.

Declined in favor of netbox integration ( T233183 ? ) making this problem go away.

Dec 6 2019, 2:06 PM · Traffic, Operations, DNS, SRE-tools
BBlack added a comment to T239711: Make DNS operations resilient against predictable failures.

Thoughts from the main text of the merged ticket: ------------

Dec 6 2019, 2:04 PM · Traffic, Operations
BBlack merged task T219400: Make authdns-update compatible with local emergency changes into T239711: Make DNS operations resilient against predictable failures.
Dec 6 2019, 2:03 PM · Operations, Traffic
BBlack merged T219400: Make authdns-update compatible with local emergency changes into T239711: Make DNS operations resilient against predictable failures.
Dec 6 2019, 2:03 PM · Traffic, Operations
BBlack added a comment to T219400: Make authdns-update compatible with local emergency changes.

Sorry I hadn't remember we had this existing ticket. Will merge into the other newer one since it has patches already and some deeper context, and copy the main text over.

Dec 6 2019, 2:02 PM · Operations, Traffic
BBlack moved T239993: Decom LVS recdns from Triage to DNS Infra on the Traffic board.
Dec 6 2019, 2:01 PM · Patch-For-Review, Traffic, Operations
BBlack closed T211131: DNS recursors TCP retransmits as Declined.

These are still present AFAIK, and we're fairly certain it's just due to pybal healthchecks using blank/broken TCP connections to monitor them. That will be cleaned up in T239993 when we get rid of LVS-based recdns.

Dec 6 2019, 2:01 PM · Pybal, Operations, Traffic
BBlack created T239993: Decom LVS recdns.
Dec 6 2019, 2:00 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T171498: Implement machine-local forwarding DNS caches.

In these past couple of weeks we've had a real about-face on this issue, and I think there's a pretty strong consensus and rationale to pursue some kind of host-level caching, but there are details to sort out. Some of the data points to bring this argument up to speed:

Dec 6 2019, 1:51 PM · Traffic, Operations
BBlack closed T125170: Internal DNS resolver responds with NXDOMAIN for localhost AAAA as Resolved.

I'm not sure how long it's been fixed in our infra, but it definitely works correctly now in our new buster 4.1 installs:

Dec 6 2019, 1:23 PM · Traffic, Patch-For-Review, DNS, Operations
BBlack moved T238727: Include zone+subnet checks for DNS validation from Triage to DNS Infra on the Traffic board.
Dec 6 2019, 1:21 PM · Traffic, Operations, DNS, SRE-tools
BBlack moved T239711: Make DNS operations resilient against predictable failures from Triage to DNS Infra on the Traffic board.
Dec 6 2019, 1:21 PM · Traffic, Operations

Dec 5 2019

BBlack closed T239667: Convert DNS servers to Buster, a subtask of T98006: Anycast AuthDNS, as Resolved.
Dec 5 2019, 9:13 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
BBlack closed T239667: Convert DNS servers to Buster as Resolved.
(12) authdns[1001,2001].wikimedia.org,dns[1001-1002,2001-2002,3001-3002,4001-4002,5001-5002].wikimedia.org                                         
----- OUTPUT of 'cat /etc/debian_version' -----                                                                                                    
10.2
Dec 5 2019, 9:13 PM · Patch-For-Review, netops, Operations, Traffic
BBlack created P9832 fqdn_rand repro.
Dec 5 2019, 5:11 PM

Dec 4 2019

BBlack added a comment to T239862: unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet.

In general, usually applayer DNS caching is a Bad Idea unless it's done very carefully (e.g. cap it at something like 5s max, or actually use a full-featured resolver library and get the real TTLs from upstream, or both).

Dec 4 2019, 9:36 PM · Performance-Team (Radar), Operations
BBlack added a comment to T236216: rack/setup/install ganeti300[123].

With T236479 closed, ganeti3003 is no longer special and everyone can ignore the IMPORTANT NOTE earlier.

Dec 4 2019, 3:38 PM · Operations, ops-esams
BBlack updated the task description for T236216: rack/setup/install ganeti300[123].
Dec 4 2019, 3:36 PM · Operations, ops-esams
BBlack closed T236479: Temporarily use ganeti3003 as ns2 authdns as Resolved.

Our ns2 service address is now re-routed to dns3001, and ganeti3003 is reimaged back to spare::system.

Dec 4 2019, 3:30 PM · Traffic, Operations

Dec 3 2019

BBlack created T239711: Make DNS operations resilient against predictable failures.
Dec 3 2019, 1:30 PM · Traffic, Operations
BBlack created T239675: Add 10G NICs to core site DNS servers (6 servers, 3 per site).
Dec 3 2019, 2:50 AM · hardware-requests, Operations, Traffic

Dec 2 2019

BBlack added a comment to T239667: Convert DNS servers to Buster.

pdns-rec-exporter fixups in: https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-rec-exporter/+/554155/ + https://gerrit.wikimedia.org/r/#/c/operations/debs/prometheus-pdns-rec-exporter/+/554156/
gdnsd rebuilt again using a combo of https://github.com/gdnsd/gdnsd/tree/v3.2.1 + https://github.com/paravoid/gdnsd/tree/experimental + minor debian/systemd hacks, as there is no 3.x debian package outside the WMF yet 📦

Dec 2 2019, 10:23 PM · Patch-For-Review, netops, Operations, Traffic
BBlack created T239667: Convert DNS servers to Buster.
Dec 2 2019, 10:20 PM · Patch-For-Review, netops, Operations, Traffic
BBlack added a comment to T98006: Anycast AuthDNS.

Where we're at now:

  • There are 13x authdns servers participating in authdns-update:
    • The 3 traditional ones (authdns1001, authdns1002, ganeti3003) which are role::authdns
    • The ten dnsbox hosts (role::dnsbox) that also do recdns and ntp (dns[12345]00[12])
  • cumin's A:dns-auth alias targets all 13, whereas A:dns-rec targets only the ten recursors (these aliases currently target underlying profiles, not roles)
  • Public authdns service routing is unchanged, with each of the ns[012] IPv4s routed into their usual 3 traditional role::authdns boxes
  • The recursors (the 10x role::dnsbox) now use their machine-local authdns instance over the loopback to look up our own names, rather than talking to the "real" ns[012] machines.
  • At least some of https://wikitech.wikimedia.org/wiki/DNS has been caught up with reality a bit.
Dec 2 2019, 2:52 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic

Nov 27 2019

BBlack added a comment to T239334: Python3 style guide.

I will float the opinion that while I may have many opinions on code style, bikeshedding between reasonable options for a shared standard is a waste. If there's a standard upstream/outside-world set of common style rules for python3, we should just adopt them and be done with it.

Nov 27 2019, 5:33 PM · Patch-For-Review, User-ArielGlenn, User-jbond, Operations, Puppet

Nov 26 2019

BBlack added a comment to T98006: Anycast AuthDNS.

Status update: the blended authdns+recdns(+ntp) role is now nearly-complete in role::dnsbox. There's a hieradata flag profile::dnsbox::include_auth which is only set for dns4002 in hieradata, which causes the authdns functionality to be included in the role. dns4002 is running this way now, and is currently the 4th member of our authdns set for authdns-update and similar purposes (including authdns healthchecks and prometheus stats), but only the local recdns daemon on that box is using it for lookups, there's no public service routed into it. authdns-update has been parallelized with clush for now, to ensure it doesn't get really slow as the count of servers expands.

Nov 26 2019, 10:40 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
BBlack added a comment to T237319: ATS serving 502 errors due to malformed responses from wikibase (HTTP 304s with message body content).

I think you ran into a temporary blip in some unrelated DNS work (which is already dealt with), not this bug (502 errors can happen for real infra failure reasons, too!)

Nov 26 2019, 9:13 PM · User-Ladsgroup, Wikidata, Wikidata-Campsite, Operations, Traffic, User-DannyS712
BBlack added a comment to T226444: rack/setup/install ganeti400[123].

I think we'll keep them private-vlan only and no tagging, and for the rare cases of "public" service instances we'll use LVS to route the traffic (same for all the edge-site ganeti).

Nov 26 2019, 2:59 PM · Traffic, Operations
BBlack added a comment to T238825: Create wildcard DNS record for Wikimedia projects.

I could go either way on the subject of explicit langlist vs wildcard, really, so long as we're confident the MediaWiki layer handles all unknown language codes (really, unknown random hostname labels...) sanely, including crazy ones like :ffq384f9q8f9qj9j-/\.wikipedia.org or whatever. It would even make some things simpler at the DNS and Caching layers if we could assume that. I'd have to go do a quick audit of our DNS data for all the canonical domains to see what kind of exceptions there are, though.

Nov 26 2019, 2:21 PM · Traffic, Operations, DNS
BBlack closed T236497: cp3056 hardware issue, a subtask of T235805: ESAMS Refresh/Rebuild (October 2019), as Resolved.
Nov 26 2019, 1:38 PM · Patch-For-Review, DC-Ops, Operations, ops-esams
BBlack closed T236497: cp3056 hardware issue as Resolved.

Seems good so far, has been up a few days and in full service for about a day, without incident. Calling this resolved until anything changes!

Nov 26 2019, 1:38 PM · DC-Ops, ops-esams, Operations, Traffic
BBlack added a comment to T233274: ATS lua script reload doesn't work as expected.

@Vgutierrez - I really think, reading the Lua plugin code, that __reload__ in 8.0.x might not do what you'd sanely expect (although it is undocumented). I think the __reload__ hook is actually more like a destructor hook for any custom destruct actions you want to happen before the reload into a fresh Lua context. And reload also definitely doesn't hit __init__ either, which leaves the whole model seeming a little broken if you've got a global initialized to nil in its declaration and then initialized to a real value in __init__. Clearly we're missing something here...

Nov 26 2019, 4:14 AM · Patch-For-Review, Operations, Traffic

Nov 25 2019

BBlack added a comment to T238305: servers freeze across the caching cluster.

It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so this may be a "once per server" phenomenon, in which case it's also quite likely this can be pre-empted on the ones that haven't crashed yet by giving them a reboot (e.g. something deep has changed while the servers are live, and they stabilize once they've done a fresh boot with it, possibly a live update of some microcode or firmware?)

Nov 25 2019, 5:43 PM · Operations, Traffic

Nov 22 2019

BBlack added a comment to T236497: cp3056 hardware issue.

So far so good - it has completed all the initial puppetization stuff, which is much further than it got before.

Nov 22 2019, 5:54 PM · DC-Ops, ops-esams, Operations, Traffic
RobH awarded T236497: cp3056 hardware issue a Like token.
Nov 22 2019, 5:21 PM · DC-Ops, ops-esams, Operations, Traffic
BBlack claimed T236497: cp3056 hardware issue.

Attempting reimage (see above). If it fails like before, it won't get very far (certainly not into production use).

Nov 22 2019, 5:19 PM · DC-Ops, ops-esams, Operations, Traffic
BBlack added a comment to T236216: rack/setup/install ganeti300[123].

IMPORTANT NOTE ganeti3003 is temporarily repurposed as a critical authdns server and is in live production use for that role (see also: T236479 ). Do not reimage or touch ganeti3003. The other two (ganeti3001 and ganeti3002) are free to image and set up as a 2-node ganeti cluster, with the third node to join later when its temporary duties are complete.

Nov 22 2019, 2:36 PM · Operations, ops-esams
BBlack updated the task description for T236216: rack/setup/install ganeti300[123].
Nov 22 2019, 2:34 PM · Operations, ops-esams

Nov 20 2019

BBlack added a comment to T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).

There were two TLS-level changes to the certificate output for esams specifically, each of which bumped the output size (the size of bytes we send the client during the TLS handshake) by a small amount, but either could've pushed us over the boundary for an extra packet (although I wouldn't think we'd reach IW10). They were https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550463/ and https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/550564/ . They would've taken effect on the servers shortly after merge in each case (within 30 mins or so, anyways), although only for new TLS sessions going forward from that point. The first was ~14:00 UTC and the second was ~22:30 UTC, both on Nov 12.

Nov 20 2019, 5:15 PM · Wikimedia-Incident, Patch-For-Review, Performance-Team, Traffic, Operations

Nov 19 2019

BBlack updated subscribers of T236208: interface-rps.py should have a flag to avoid CPU0.

Adding @RLazarus in hopes of nerd-sniping him further on this topic...

Nov 19 2019, 8:33 PM · Operations, Traffic
BBlack added a comment to T236208: interface-rps.py should have a flag to avoid CPU0.

So the patch above adds it to the queue distribution logic in interface-rps, but there's another piece of the puzzle here, which is setting the hardware's queue count for the interface itself, which is where a little bit of a rabbithole develops...

Nov 19 2019, 8:30 PM · Operations, Traffic

Nov 18 2019

BBlack added a comment to T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).

@Gilles This could also be related to TLS certificate changes that were happening around the same dates, and could be inflating the bytes transferred in handshakes. We have a couple of different ongoing things there (renewals, revocations, vendor changes), and as a result we've seen a handshake bytes-sent increase in the EU for sure (which is temporary, but also unavoidable. Probably in the next week or two we'll see that go back to normal and can then confirm if the stats shift with it again).

Nov 18 2019, 3:50 PM · Wikimedia-Incident, Patch-For-Review, Performance-Team, Traffic, Operations