Page MenuHomePhabricator

BBlack (Brandon Black)
Engineering Manager, SRE Traffic Team

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (250 w, 1 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Yesterday

BBlack created T230955: Configure Layer3 hashing for router ECMP (for anycast DNS).
Wed, Aug 21, 8:15 PM · Operations, Traffic

Tue, Aug 20

BBlack changed the status of T230638: Move old transparency report pages to historical URLs and setup redirect from Open to Stalled.

Just stalling this so that anyone following it doesn't try to pick this up or move with it yet. There's an ongoing email thread about clarifying this task, and we're waiting for at least one person to return from a vacation and provide guidance before we move forward here.

Tue, Aug 20, 3:30 PM · serviceops, Operations, WMF-Legal

Mon, Aug 19

BBlack changed the status of T230687: Decide/document criteria needed to serve acme-chief LE issued unified certificate to end users from Open to Stalled.

There's perhaps a faulty implicit assumption here that we desire to use one cert for the world and that we'd just "switch" everything to LE. We're currently using the Globalsign cert at all edges due to various problems earlier in the year, but what we were doing in the past and would like to continue doing in the future is using two certs simultaneously from unrelated CAs, and making the split on a per-datacenter basis (with the US sites using GlobalSign, and the non-US sites using LE, in this case).

Mon, Aug 19, 5:31 PM · Operations, Traffic, Acme-chief
BBlack added a comment to T230733: Expose pooled status of gdnsd and conftool managed services as metrics.

I'd start with the conftool stuff before moving on to anything that tracks gdnsd's admin_state -driven things. That whole mechanism is likely to be replaced in the next quarter or two on the gdnsd side, and I wouldn't be surprised if we end up driving the new mechanism from conftool by default.

Mon, Aug 19, 4:02 PM · User-CDanis, Operations, observability
BBlack added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

(Also, is the specific TMH fix actually deployed to all groups yet?)

Mon, Aug 19, 3:47 PM · Core Platform Team, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations
BBlack added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

Is this fixed now ?

The specific issue of TMH triggering CentralAuth misbehavior is fixed. The more generic issue of CentralAuth misbehavior being easily triggerable via wrong use of Request or RequestContext is not.

Mon, Aug 19, 3:46 PM · Core Platform Team, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations
BBlack placed T230638: Move old transparency report pages to historical URLs and setup redirect up for grabs.

Unassign for now. The actual ask here is unclear in terms of technical details.

Mon, Aug 19, 10:50 AM · serviceops, Operations, WMF-Legal

Thu, Aug 15

BBlack added a comment to T98006: Anycast (Auth)DNS.

General status updates and planning, for this very old ticket which is still on the radar!

Thu, Aug 15, 2:48 PM · Performance-Team (Radar), Patch-For-Review, netops, Operations, Traffic
BBlack added a subtask for T186550: Anycast recdns: T228190: Roll out Anycast RecDNS to more servers.
Thu, Aug 15, 2:26 PM · Patch-For-Review, netops, Operations, Traffic
BBlack added a parent task for T228190: Roll out Anycast RecDNS to more servers: T186550: Anycast recdns.
Thu, Aug 15, 2:26 PM · Patch-For-Review, Operations, Traffic

Wed, Aug 14

BBlack created P8910 bblack NewPP on 9s greece load.
Wed, Aug 14, 4:57 PM
BBlack added a comment to T228190: Roll out Anycast RecDNS to more servers.

I'm not sure if it goes as a subtask here, or of T167841 and/or T227808 - but recording here so we don't forget, from an earlier IRC conversation:

Wed, Aug 14, 4:22 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T99531: [Task] move wikiba.se webhosting to wikimedia cluster.

As noted in T155359 - WMDE has moved the hosting of this to some other platform, including the DNS hosting (and we never had the whois entry). So this task can resolve as Decline I think (or whatever), but we should use it to track down various revert patches first before we close it up (revert the DNS repo stuff and whatever else we've got going on in various other repos supporting the wikiba.se site).

Wed, Aug 14, 3:10 PM · Patch-For-Review, User-Addshore, serviceops, wikidata-tech-focus, Traffic, wikiba.se website, Operations, Wikidata-Sprint-2016-11-08, Wikidata

Tue, Aug 13

BBlack closed T229860: SRE Onboarding for Sukhbir Singh as Resolved.

Looks like it to me :)

Tue, Aug 13, 7:12 PM · SRE-Access-Requests, Traffic, Operations

Mon, Aug 5

BBlack added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

May as well link in an earlier related ticket from late last year for more backstory, too: https://phabricator.wikimedia.org/T205609

Mon, Aug 5, 7:50 PM · Operations, netops
BBlack added a comment to T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams.

Again today, causing a small spike of esams-specific 503s and icinga alerts:

Mon, Aug 5, 7:49 PM · Operations, netops
BBlack created T229860: SRE Onboarding for Sukhbir Singh.
Mon, Aug 5, 5:19 PM · SRE-Access-Requests, Traffic, Operations
BBlack added a comment to T229621: Icinga check defined from LVS configuration for cloudelastic are borked.

So, yes, cloudelastic is correct in DNS for normal lookups. The issue is that the icinga check defines the virtual host entry for cloudelastic monitoring an explicit IP in its configuratoin, and that IP ends up being the IP of icinga1001, not of cloudelastic. This probably has to do with the puppet host context in which the resource is evaluated.

Mon, Aug 5, 12:53 PM · Patch-For-Review, Discovery-Search (Current work), Elasticsearch, Traffic, Operations

Fri, Aug 2

BBlack added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

Re: transitioning away from SLAAC for the current fleet/setup (which I think is probably a good incremental idea, and could happen ahead of the future netbox work to make that transition easier in the future). Some thoughts on accomplishing that:

Fri, Aug 2, 2:11 PM · Patch-For-Review, Traffic, netops, Operations, IPv6

Thu, Aug 1

BBlack created T229621: Icinga check defined from LVS configuration for cloudelastic are borked.
Thu, Aug 1, 8:57 PM · Patch-For-Review, Discovery-Search (Current work), Elasticsearch, Traffic, Operations
BBlack added a comment to T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099.

These are ready to go for dcops-level work!

Thu, Aug 1, 5:14 PM · ops-eqiad, DC-Ops, decommission, Operations
BBlack closed T221343: puppet fails to run in cp1008 under certain conditions, a subtask of T219803: upgrade facter and puppet across the fleet, as Declined.
Thu, Aug 1, 4:57 PM · Patch-For-Review, Packaging, Puppet, Operations
BBlack closed T221343: puppet fails to run in cp1008 under certain conditions as Declined.

Decom in T229586

Thu, Aug 1, 4:57 PM · Packaging, Puppet, Operations
BBlack created T229586: decommission cp1008, cp1071, cp1072, cp1073, cp1074, cp1099.
Thu, Aug 1, 4:00 PM · ops-eqiad, DC-Ops, decommission, Operations
BBlack closed T202966: Make cp1099 the new pinkunicorn, a subtask of T208734: Decommission asw-c-eqiad, as Declined.
Thu, Aug 1, 3:31 PM · decommission, Operations, ops-eqiad, netops
BBlack closed T202966: Make cp1099 the new pinkunicorn as Declined.

We had a quick discussion and a small informal vote and decided we don't really need this functionality (pinkunicorn) anymore, so we're going to retire it and not replace it.

Thu, Aug 1, 3:31 PM · Traffic, Operations

Wed, Jul 31

BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

Heh, apparently I can't even remember things I read and said before even when they're right above me in the same ticket!

Wed, Jul 31, 6:50 PM · Release-Engineering-Team-TODO (201908), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator
BBlack updated the task description for T228190: Roll out Anycast RecDNS to more servers.
Wed, Jul 31, 6:32 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T228190: Roll out Anycast RecDNS to more servers.

Rollout status update: things that are using anycast recdns resolv.conf in production as of 2019-07-31:

  • All hosts in edge DCs (esams, ulsfo, eqsin)
  • All cp edge cache hosts globally
  • All LVS hosts globally
  • Canary Mediawiki API and Appserver hosts in both core DCs
  • Network devices
  • Install-time stuff (as in dhcp settings and Debian installer)
Wed, Jul 31, 6:32 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

Replying to myself earlier: apparently they're datestamped URIs beginning with /yyyy/mm/, examples being:

Wed, Jul 31, 6:19 PM · Release-Engineering-Team-TODO (201908), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator
BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

I think it should be techblog.wikimedia.org, because even if that introduces a complication around redirect based on time period

Wed, Jul 31, 6:13 PM · Release-Engineering-Team-TODO (201908), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator
BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

TODO list here from my POV, as best I understand things:

Wed, Jul 31, 4:47 PM · Release-Engineering-Team-TODO (201908), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator
BBlack closed T207340: Determine cause of upload.wikimedia.org requests routed to text-lb (404 Not Found) as Resolved.

The 421 code is deployed and seems to be working correctly, with a fairly small global average rate of somewhere <1 req/sec. This is the most-legitimate thing we can do with these misdirected requests, and it may actually fix some of them if the UA's own confusion is truly at fault, but it may not be able to help if some kind of DNS or HTTPS proxy interference is causing persistent issues. Maybe it will at least reduce error reporting and debugging confusion in such cases, though, as 421 is very specific to this issue (vs generic 404).

Wed, Jul 31, 4:24 PM · Performance-Team (Radar), Traffic, Operations

Thu, Jul 25

BBlack added a comment to T228190: Roll out Anycast RecDNS to more servers.

All the LVSes are now using the anycasted recdns, which gets rid of the LVS<->recdns dependency loop and simplifies recdns server downtime processes: https://wikitech.wikimedia.org/w/index.php?title=Service_restarts&type=revision&diff=1833705&oldid=1832671

Thu, Jul 25, 10:13 PM · Patch-For-Review, Operations, Traffic

Tue, Jul 23

BBlack created P8785 esams L3 wave recent history.
Tue, Jul 23, 4:41 PM · netops, Traffic

Jul 23 2019

BBlack added a comment to T228730: TLS config issue for nginx on Buster.

If we need this to work ASAP, probably the most-expedient thing to do would be to patch our puppetization to exclude the patched features from config on buster only, and use the vendor package. Traffic is in the process of moving away from nginx, hopefully by EOQ-ish, after which we won't need the problematic custom package, and the stock vendor package should work fine for other uses of the tlsproxy module (but we're not quite ready enough, yet, to mess with our current solution by removing the WMF package from stretch!).

Jul 23 2019, 12:45 PM · Patch-For-Review, Operations, Traffic

Jul 22 2019

BBlack added a comment to T227538: b2-eqiad pdu refresh (Tuesday 10/29 @11am UTC).

cp1079 and cp1080 just need normal depooling process here.

Jul 22 2019, 8:13 PM · DC-Ops, Operations, ops-eqiad
BBlack added a comment to T227542: b7-eqiad pdu refresh (Tuesday 11/5 @11am UTC).

lvs1014 here will need special care, Traffic should stop puppet and pybal and monitor failover to lvs1016 ahead of work, then revert afterwards. cp1081 and cp1082 here can be depooled as normal.

Jul 22 2019, 8:12 PM · DC-Ops, Operations, ops-eqiad
BBlack added a comment to T227143: a7-eqiad pdu refresh.

(task desc edited for correct cp nodes: this rack has 77/78, not 76/77)

Jul 22 2019, 8:09 PM · DC-Ops, Operations, ops-eqiad
BBlack updated the task description for T227143: a7-eqiad pdu refresh.
Jul 22 2019, 8:09 PM · DC-Ops, Operations, ops-eqiad
BBlack added a comment to T227143: a7-eqiad pdu refresh.

The Traffic nodes cp1077 + cp1078 can be depooled the usual way, but lvs1013 needs some special care. Someone from Traffic should handle and monitor that just in case (basically we need to manually disable puppet and stop pybal a few minutes in advance of the work, verify traffic moving correctly to lvs1016, and then put everything back to normal afterwards).

Jul 22 2019, 8:05 PM · DC-Ops, Operations, ops-eqiad
BBlack added a comment to T227141: a5-eqiad pdu refresh.

All the traffic cp and lvs nodes are decoms and not in use: T208584 T208586

Jul 22 2019, 6:14 PM · DC-Ops, Operations, ops-eqiad
BBlack updated the task description for T228678: Implement GeoDNS smooth repooling in gdnsd.
Jul 22 2019, 4:33 PM · Traffic, Operations
BBlack moved T228678: Implement GeoDNS smooth repooling in gdnsd from Triage to DNS Infra on the Traffic board.
Jul 22 2019, 4:26 PM · Traffic, Operations
BBlack merged task T94697: implement better failure-scenario geoip mapping in gdnsd into T228678: Implement GeoDNS smooth repooling in gdnsd.
Jul 22 2019, 4:25 PM · Operations, Traffic
BBlack merged T94697: implement better failure-scenario geoip mapping in gdnsd into T228678: Implement GeoDNS smooth repooling in gdnsd.
Jul 22 2019, 4:25 PM · Traffic, Operations
BBlack triaged T228678: Implement GeoDNS smooth repooling in gdnsd as Normal priority.
Jul 22 2019, 4:10 PM · Traffic, Operations
BBlack updated the task description for T228671: Decommission lvs100[123456].
Jul 22 2019, 2:54 PM · Traffic, DC-Ops, Operations, decommission
BBlack created T228671: Decommission lvs100[123456].
Jul 22 2019, 2:52 PM · Traffic, DC-Ops, Operations, decommission
BBlack added a comment to T227140: a4-eqiad pdu refresh.

cp1076 - Can depool ahead of work and repool later, with the local commands "depool" and "pool"
lvs100[123] - Not in use and should be decommed, but this ticket made me realize we haven't made an lvs1001-6 decom ticket yet (will do shortly!)

Jul 22 2019, 2:46 PM · DC-Ops, Operations, ops-eqiad
BBlack closed T184293: rack/setup/install lvs101[3-6] as Resolved.

These have been in-service for a while now, closing!

Jul 22 2019, 2:43 PM · Operations, Traffic

Jul 19 2019

BBlack triaged T228533: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 as Normal priority.
Jul 19 2019, 5:22 PM · Analytics, Traffic, Operations

Jul 18 2019

BBlack added a comment to T120085: Serve Main Page of WMF wikis from a consistent URL.

Oh one more thing that should've been (3) on that list:

Jul 18 2019, 12:24 PM · Core Platform Team, Patch-For-Review, Performance-Team, Operations, Traffic, TechCom-RFC, SEO, Wikimedia-Site-requests
BBlack added a comment to T120085: Serve Main Page of WMF wikis from a consistent URL.

I like the end result here, and I don't think it's problematic from the Traffic perspective in the long view, but I think the initial rollout isn't so trivial:

Jul 18 2019, 12:17 PM · Core Platform Team, Patch-For-Review, Performance-Team, Operations, Traffic, TechCom-RFC, SEO, Wikimedia-Site-requests

Jul 17 2019

BBlack moved T228190: Roll out Anycast RecDNS to more servers from Triage to DNS Infra on the Traffic board.
Jul 17 2019, 3:49 PM · Patch-For-Review, Operations, Traffic
BBlack closed T203194: cp1075-90 - bnxt_en transmit hangs as Resolved.

@Vgutierrez The firmware update on the NICs fixed this for good, right? Can we close this task?

Jul 17 2019, 3:48 PM · Patch-For-Review, Operations, Traffic

Jul 16 2019

BBlack added a comment to T186550: Anycast recdns.

It's my understanding that this reduces the steps necessary to restart our recursors is now reduced to a simple depool/repool and that the previous, complex approach from
https://wikitech.wikimedia.org/wiki/Service_restarts#DNS_recursors_(in_production_and_labservices) is now obsolete, right?

Jul 16 2019, 4:57 PM · Patch-For-Review, netops, Operations, Traffic

Jul 3 2019

BBlack added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

I wonder if other uses of new RequestContext() (most of which are intentional) can trigger this bug.

Looks to be exclusively in tests?

Jul 3 2019, 3:56 PM · Core Platform Team, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations
BBlack added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

Thanks for chasing this down! After fixing up any further sources of the extra sessions: do we have to do something about clearing out the excess sessions from storage (redis?), or is this mostly an ephemeral sort of problem?

Jul 3 2019, 1:12 PM · Core Platform Team, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations

Jul 2 2019

BBlack updated subscribers of T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

@Anomie / @Legoktm - Can you take a look at this? We're out of our depth over here trying to figure out this bug. TL;DR is that some logged-in sessions are getting excess (in the example above, ~50) Set-Cookie headers for auth sessions, with many repeats for the same wiki with different session id numbers, to the point where it's causing us real problems.

Jul 2 2019, 6:00 PM · Core Platform Team, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations
BBlack edited P8699 Directors VCL->C.
Jul 2 2019, 3:45 PM · Traffic
BBlack created P8699 Directors VCL->C.
Jul 2 2019, 3:44 PM · Traffic

Jul 1 2019

BBlack raised the priority of T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) from Normal to High.

Re-setting this to at least High for now, given the criticality of the component involved and the production impacts.

Jul 1 2019, 1:42 PM · Core Platform Team, Patch-For-Review, TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations

Jun 28 2019

BBlack created P8684 Crazy cookies.
Jun 28 2019, 3:45 PM · Traffic

Jun 26 2019

BBlack placed T226444: rack/setup/install ganeti400[123] up for grabs.

I don't think anyone's 100% sure how we're handling this project, but probably Traffic will figure out the setup for these and ask Alex if we need help. We probably won't get around to it very quickly, can leave them in role::spare for now until we get to it.

Jun 26 2019, 6:50 PM · Traffic, Operations

Jun 19 2019

BBlack added a comment to T226044: Prepare Phame to support heavy traffic for a Tech Department blog.

Implementing a blanket redirect to the legacy blog URI for ^/20(0[7-9]|1[0-8])/ should be feasible in VCL or Lua at the edge. Or alternatively, we could also just leave it alone and pick another hostname, too.

Jun 19 2019, 1:54 PM · Release-Engineering-Team-TODO (201908), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator
Restricted Application added a project to T226044: Prepare Phame to support heavy traffic for a Tech Department blog: Operations.
Jun 19 2019, 1:25 PM · Release-Engineering-Team-TODO (201908), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator
BBlack updated the task description for T226044: Prepare Phame to support heavy traffic for a Tech Department blog.
Jun 19 2019, 1:25 PM · Release-Engineering-Team-TODO (201908), User-greg, Release-Engineering-Team (Development services), Operations, Traffic, Phabricator

Jun 8 2019

BBlack added a comment to T225347: When downloading from git using HTTPS: HTTP 500 / GnuTLS recv error (-110).

The TLS-level error is just complaining that, at the end of the transaction, the connection was aborted abruptly instead of torn down cleanly. It would probably be more-ideal if gerrit's TLS stack would cleanly close on 500s when it can, but the real issue here is probably the 500 error, not the TLS error. At a glance, the GET request headers look identical in the two cases, so I'm at a loss as to what's happening on gerrit's side here. Is there perhaps a request difference in some HTTP-level authentication or cookie stuff that's not shown in the trace?

Jun 8 2019, 12:31 PM · Traffic, Operations, Gerrit

Jun 6 2019

BBlack closed T222078: Analyze readers' engagement in countries affected by Singapore Data Center's switch as Resolved.

@leila and @Miriam - Thanks for all the hard work here, it's truly outstanding the depth to which this analysis already goes, and it puts some useful numbers on the impact of expanding our edge network into under-served regions.

Jun 6 2019, 2:42 PM · Research-consulting, Research

May 30 2019

BBlack added a comment to T224694: cp3041 - Varnish frontend child restarted icinga alert.

That alert basically means that a varnish frontend daemon crashed (and as usual was auto-restarted by a manager process). These are pretty rare and usually worth some investigation.

May 30 2019, 7:43 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T223408: Page gets redirected randomly to former blackout page.

We may want to think of a solution the community can employ for these kinds of blackouts that doesn't require a sitemap generation & deployment after the fact. Just a thought.

May 30 2019, 12:57 PM · Readers-Web-Backlog (Tracking), Performance-Team (Radar), Wikimedia-Incident

May 29 2019

BBlack added a comment to T222937: Replace Varnish backends with ATS on cache upload nodes in esams.

The failed reimage was finished up manually (probably not the reimager's fault)

May 29 2019, 8:14 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T212197: Deliver mobile-based version for automatic translations.

Done. Are we ready to deploy it already or blocked on other MW-level deploys still?

May 29 2019, 3:49 PM · MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), Traffic, Operations, ExternalGuidance

May 28 2019

BBlack added a comment to T224511: cr1-codfw linecard failure.

Plan seems reasonable based on the info in the description! Maybe wait longer than 2h after the linecard is restarted? Or do we suspect that any recurrence is much less likely with no traffic?

May 28 2019, 6:14 PM · netops, Operations

May 24 2019

BBlack added a comment to T223902: cloudcontrol: decide on FQDN for service endpoints.

That cloud rebranding link above also mentions wikimediacloud.org, which is yet another option nobody's exploiting yet. So even without getting into the over-long wikimediacloudservices.org, we have sufficient names to cover all the cases here (feel free to re-arrange, esp the latter two):

May 24 2019, 7:07 PM · Traffic, Operations, Cloud-VPS, cloud-services-team (Kanban)
BBlack updated subscribers of T223902: cloudcontrol: decide on FQDN for service endpoints.

Ok, @aborrero caught me up on all the context on IRC so I can stop asking dumb questions (Thanks!).

May 24 2019, 12:31 PM · Traffic, Operations, Cloud-VPS, cloud-services-team (Kanban)

May 23 2019

BBlack reassigned T224223: decommission lvs100[123456].wikimedia.org from BBlack to ayounsi.

These are reimaged to role(spare::system) now. Over to @ayounsi for getting rid of all the special cases related to these hosts in the eqiad routers and switches (BGP stuff, fw filters, the special public-vlan LVS-balancer port groups, etc), and then we can move this on to dcops -level decom stuff.

May 23 2019, 10:36 PM · Traffic, Operations, DC-Ops
BBlack updated the task description for T224223: decommission lvs100[123456].wikimedia.org.
May 23 2019, 10:33 PM · Traffic, Operations, DC-Ops
BBlack added a comment to T223902: cloudcontrol: decide on FQDN for service endpoints.

Do these belong in wikimedia.org at all? It seems this has already been discussed, but I guess I lack some context.

May 23 2019, 10:07 PM · Traffic, Operations, Cloud-VPS, cloud-services-team (Kanban)
BBlack added a comment to T224033: Fix operations/puppet.git "rebase hell".

One more:

May 23 2019, 3:00 PM · Gerrit, Release-Engineering-Team-TODO, Continuous-Integration-Config, Operations
BBlack added a comment to T224033: Fix operations/puppet.git "rebase hell".

A few thoughts:

May 23 2019, 2:56 PM · Gerrit, Release-Engineering-Team-TODO, Continuous-Integration-Config, Operations
BBlack updated the task description for T224223: decommission lvs100[123456].wikimedia.org.
May 23 2019, 1:39 PM · Traffic, Operations, DC-Ops
BBlack moved T224223: decommission lvs100[123456].wikimedia.org from Triage to LoadBalancer on the Traffic board.
May 23 2019, 1:34 PM · Traffic, Operations, DC-Ops
BBlack added a project to T224223: decommission lvs100[123456].wikimedia.org: Traffic.
May 23 2019, 1:33 PM · Traffic, Operations, DC-Ops
BBlack updated the task description for T224223: decommission lvs100[123456].wikimedia.org.
May 23 2019, 1:31 PM · Traffic, Operations, DC-Ops
Restricted Application added a project to T224223: decommission lvs100[123456].wikimedia.org: Operations.
May 23 2019, 1:30 PM · Traffic, Operations, DC-Ops

May 22 2019

BBlack added a comment to T223921: GSuite Test Domain Verification.

Either is fine. I assume you won't be able to do anything else with this (e.g. make https://gsuite-test.wikimedia.org/ work) without some followup records added on our side.

May 22 2019, 7:30 PM · Operations, DNS, Traffic
BBlack added a comment to T140365: Lower geodns TTLs from 600 (10min) to 300 (5min).

So we've reduced query volume by ~32% in T208263 . Since the last significant updates here, we've also deployed newer versions of our authdns software which perform even better, and refreshed some hardware as well. We're still in the basic scenario that we only have 3x singular authdns hosts in the world, but they're running with plenty of headroom in terms of handling query rate spikes and server outages. There's really two things holding us up on experimenting with lower TTLs for faster failover:

May 22 2019, 5:51 PM · Operations, Traffic
BBlack closed T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS as Resolved.

Scheme has been stable for ~1w now and seems to be working out fine. The net reduction in total authdns requests is ~32%. I suspect the drop in public requests for wiki hostnames is greater, as the total also includes all of our internal/infrastructure lookups as well, but either way we should be seeing far less DNS cache misses out there in the world, especially for longer-tail / less-popular project and language combinations.

May 22 2019, 5:42 PM · Performance-Team (Radar), Operations, Traffic
BBlack added a comment to T223921: GSuite Test Domain Verification.

The above is deployed. I'd wait a full 10 minutes from the time of this comment to re-test, in case they've negative-cached the previous lookup, then try again and let's see what happens.

May 22 2019, 5:35 PM · Operations, DNS, Traffic
BBlack added a comment to T223921: GSuite Test Domain Verification.

The context of the second token is that all of our canonical wiki domains, including wikimedia.org, already have persistent Google Site Verification TXT tokens so that we can manage Google Search stuff for our own domains on a different Google system.

May 22 2019, 5:31 PM · Operations, DNS, Traffic
BBlack added a comment to T223921: GSuite Test Domain Verification.

@HMarcus - The record is live, can you try the validation and let me know how it goes?

May 22 2019, 1:41 PM · Operations, DNS, Traffic

May 21 2019

BBlack added a comment to T222620: cp1083 crashed.

Nevermind, apparently it was already repooled, looking at the wrong thing here...

May 21 2019, 6:46 PM · Operations, ops-eqiad, Traffic
BBlack added a comment to T222620: cp1083 crashed.

It's been up for ~15 days now without incident, but depooled for frontend traffic. Re-pooling it today to see if we can get a recurrence or not.

May 21 2019, 6:42 PM · Operations, ops-eqiad, Traffic
BBlack added a comment to T224027: LVS interface settings from /e/n/i not consistently applied on first boots.

FWIW, lvs1016 came back with correct settings after the single additional reboot above.

May 21 2019, 6:03 PM · Operations, Traffic
BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

Current status of transition:

May 21 2019, 5:58 PM · Operations, Traffic
BBlack triaged T224027: LVS interface settings from /e/n/i not consistently applied on first boots as Normal priority.
May 21 2019, 2:29 PM · Operations, Traffic

May 19 2019

BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

Note https://gerrit.wikimedia.org/r/c/operations/puppet/+/511118 - I had to switch the lvs1015 cross-row ports for rows A and B (enp4s0f1 and enp5s0f0) backwards at the software level to match the physical reality shown by lldpcli show neighbors, which was backwards from the documented table of ports at the top of this task. The current config works and we can keep it if we want. Note that I didn't make any other related changes, so if we keep this config, we probably need to edit the software port labels in the switch configurations to match, and possibly any physical labeling in the DC, to avoid future confusion. Alternatively, before we put this machine in service, we could physically swap the cables back to the intended config at the rear of lvs1015, revert the mentioned puppet patch, and reimage the server again. Either way, there's probably some followup to do on this.

May 19 2019, 12:25 AM · Operations, Traffic

May 18 2019

BBlack created P8541 post-reimage bios settings warning.
May 18 2019, 11:59 PM · Traffic