Page MenuHomePhabricator

BBlack (Brandon Black)
Engineering Manager, SRE Traffic Team

Projects (8)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (300 w, 5 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Jul 7 2020

BBlack triaged T257326: Consolidate edge dnsbox servers into ganeti as Medium priority.
Jul 7 2020, 2:48 PM · Traffic, Operations
BBlack updated the task description for T257324: Consolidate edge bastion server into ganeti.
Jul 7 2020, 2:48 PM · Traffic, Operations
BBlack created T257326: Consolidate edge dnsbox servers into ganeti.
Jul 7 2020, 2:47 PM · Traffic, Operations
BBlack triaged T257324: Consolidate edge bastion server into ganeti as Medium priority.
Jul 7 2020, 2:43 PM · Traffic, Operations
BBlack created T257324: Consolidate edge bastion server into ganeti.
Jul 7 2020, 2:43 PM · Traffic, Operations
BBlack triaged T257323: Consolidate misc servers at edge sites as Medium priority.
Jul 7 2020, 2:41 PM · Operations, Traffic

Jun 30 2020

BBlack moved T254178: Fix recdns config on various hardware devices from Triage to DNS Infra on the Traffic board.
Jun 30 2020, 4:38 PM · DC-Ops, Traffic, Operations
BBlack moved T256655: Current codfw caches have wrong NVME format from Triage to Caching on the Traffic board.
Jun 30 2020, 4:38 PM · Operations, Traffic

Jun 29 2020

BBlack updated the task description for T256655: Current codfw caches have wrong NVME format.
Jun 29 2020, 4:47 PM · Operations, Traffic
BBlack updated the task description for T256655: Current codfw caches have wrong NVME format.
Jun 29 2020, 3:58 PM · Operations, Traffic
BBlack updated the task description for T256655: Current codfw caches have wrong NVME format.
Jun 29 2020, 3:57 PM · Operations, Traffic
BBlack triaged T256655: Current codfw caches have wrong NVME format as Low priority.
Jun 29 2020, 3:55 PM · Operations, Traffic

Jun 22 2020

BBlack added a comment to T255748: Netbox DNS change not effective in gdns.

+1 on the latest patch, looks like the right fix.

Jun 22 2020, 1:41 PM · DNS, Traffic, Operations, netbox

Jun 9 2020

BBlack added a comment to T254157: esams,ulsfo,eqsin: one VM request each for install_servers.

@ayounsi - Yes, we're going to have some outbound recursive DNS needs from some ganeti-hosted services

Jun 9 2020, 3:54 PM · Patch-For-Review, Operations, vm-requests
BBlack added a comment to T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable.

Looking at that other ticket T250912 - would an in-band service ping or NOP event of some kind address detecting "hung" connections at the application layer? Does SSE have any kind of null requests or null events that could be sent for that purpose?

Jun 9 2020, 2:06 PM · Patch-For-Review, Traffic, Operations, Analytics-Kanban, Analytics, EventStreams

Jun 8 2020

BBlack added a comment to T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable.

[reordering a little]

What happens right now if someone had to download a large file from I dunno, commons or dumps.wikimedia.org on a slow connection, such that the download takes more than 15 minutes? Would they be disconnected and never be able to download the file?

Jun 8 2020, 5:00 PM · Patch-For-Review, Traffic, Operations, Analytics-Kanban, Analytics, EventStreams

Jun 4 2020

BBlack created P11401 wmflabs NS records....
Jun 4 2020, 3:42 PM

Jun 1 2020

BBlack created T254178: Fix recdns config on various hardware devices.
Jun 1 2020, 4:43 PM · DC-Ops, Traffic, Operations

May 26 2020

BBlack added a comment to T253666: Anycast: consistent routers->servers routing.

Even if we could experimentally verify option A, we probably can't trust it across future firmware differences (between sites, or between two routers in a site). Option B via MEDs sounds like a good path forward for now, though!

May 26 2020, 6:32 PM · Patch-For-Review, Traffic, Operations, netops

May 21 2020

BBlack added a project to T252227: Mobile redirects drop provenance parameters: Analytics.
May 21 2020, 6:31 PM · Analytics-Radar, Readers-Web-Backlog (Tracking), Traffic, Operations
BBlack updated subscribers of T252227: Mobile redirects drop provenance parameters.

Interestingly, the mobile redirect code in varnish doesn't strip any parameters. The problem is that the analytics-side VCL code that consumes the wprov parameter also removes it from the URL (so that only analytics sees it, but it doesn't affect caching or get sent to MediaWiki, etc), so it's not present anymore at the time the redirect happens.

May 21 2020, 6:24 PM · Analytics-Radar, Readers-Web-Backlog (Tracking), Traffic, Operations

May 20 2020

BBlack added a comment to T239993: Decom LVS recdns.

The kraz case is gone now (yay!) and hasn't recurred since the ircd restart above. What's left appears to be all infrastructure stuff: PDUs, switches, firewalls, etc. I've picked up quite a few of them in a few hours, so I'm going to let it run for a full 24h to try to capture them all, and then I'll make some sub-tasks to clean them up.

May 20 2020, 10:12 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T98006: Anycast AuthDNS.

(correction - it's also internet-reachable via ulsfo only for now, in this interim state, just by chance because it's still advertising the whole original /23)

May 20 2020, 6:23 PM · Patch-For-Review, Performance-Team (Radar), netops, Operations, Traffic
BBlack added a comment to T98006: Anycast AuthDNS.

Update: the nsa authdns IP at 198.35.27.27 is live internally everywhere and monitored and working. There's some stuff to finish up later this week for the public side in T253196 , and then we can start really digging into the detailed bits about loadbancing and hashing with a live testbed.

May 20 2020, 6:21 PM · Patch-For-Review, Performance-Team (Radar), netops, Operations, Traffic

May 19 2020

BBlack added a comment to T98006: Anycast AuthDNS.

Status Update: Worked through a bunch of the software-level complexities today with getting bird::anycast to advertise an authdns IP from all the auth and dnsbox hosts ( https://gerrit.wikimedia.org/r/#/q/topic:anycast-authdns+(status:open+OR+status:merged) ), and merged a DNS patch to set up a single anycast authdns IP in the unused /24 we had adjacent to ulsfo's space in: https://gerrit.wikimedia.org/r/#/c/operations/dns/+/597310/ . There's still one patch outstanding, which turns on the actual adverts and local healthchecks, at: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/597311/ .

May 19 2020, 9:09 PM · Patch-For-Review, Performance-Team (Radar), netops, Operations, Traffic
BBlack closed T241965: Use check_dns_query for anycast DNS checks as Resolved.
May 19 2020, 8:45 PM · Patch-For-Review, observability, Operations
BBlack closed T240863: Secure shared ticket key rotation for anycast authdns, a subtask of T98006: Anycast AuthDNS, as Declined.
May 19 2020, 4:43 PM · Patch-For-Review, Performance-Team (Radar), netops, Operations, Traffic
BBlack closed T240863: Secure shared ticket key rotation for anycast authdns as Declined.

There's not much DoTLS adoption so far, and really our primary HTTPS termination needs this more than AuthDNS does, at which point we can just copy whatever solution emerges there.

May 19 2020, 4:43 PM · Operations, Traffic

May 12 2020

BBlack added a comment to T252577: Maxmind data update issues for DNS (and others?).

Diving a little deeper on the symlink issue:

May 12 2020, 6:33 PM · Operations, Traffic
BBlack moved T252577: Maxmind data update issues for DNS (and others?) from Triage to DNS Infra on the Traffic board.
May 12 2020, 6:18 PM · Operations, Traffic
BBlack triaged T252577: Maxmind data update issues for DNS (and others?) as Medium priority.
May 12 2020, 6:16 PM · Operations, Traffic

May 5 2020

BBlack added a comment to T251726: Certificate *.wikipedia.org valid until 2020-06-20.

Even if we still use non-LE certs in some DCs i believe this is ok since we should also have other monitoring for the expiration of that cert. We do, right?

May 5 2020, 1:31 PM · Traffic, serviceops, Operations

Apr 30 2020

BBlack added a comment to T133821: Make CDN purges reliable.

I like the root event timestamp info. We could potentially put in future rules to help by ignoring ancient purges, in some cases (e.g. if we can guarantee the cache's contents are no older then 24h, we can also ignore root events older than 24h, which might speed up replaying a backlog of data...).

Apr 30 2020, 1:44 PM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic

Apr 29 2020

BBlack updated subscribers of T250815: HTTP 400 Error when trying to save an edit on English Wikipedia: Error contacting the Parsoid/RESTBase server.

Yeah, the linked Lua code is, I think, trying to emulate what our VCL has always traditionally done similarly as:

Apr 29 2020, 5:29 PM · Patch-For-Review, Operations, Traffic, Platform Team Workboards (Clinic Duty Team), Parsoid, RESTBase
BBlack added a comment to T133821: Make CDN purges reliable.
  • Define a schema for a "url purge message".
Apr 29 2020, 4:11 PM · serviceops, Patch-For-Review, Sustainability (MediaWiki-MultiDC), Performance-Team (Radar), Operations, Traffic

Apr 17 2020

kolbert awarded T170567: Support TLSv1.3 a Mountain of Wealth token.
Apr 17 2020, 2:32 AM · Performance-Team (Radar), Wikimedia-Incident, Goal, Operations, Traffic

Apr 6 2020

BBlack added a comment to T247783: Remove "Cache-control: no-cache" hack from wmf-config.

I worry that removing a no-cache default might have all kinds of unintended consequences. We should probably do some research into what php outputs are even relying on this default and why.

Apr 6 2020, 4:13 PM · Operations, Traffic, serviceops, Wikimedia-Site-requests, Performance-Team, Technical-Debt

Apr 3 2020

BBlack added a comment to T249346: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality.

We're probably going to add multiple purger connections to fan out the per-thread load from T241232 to help with T249325 . I've poked at that already, and the likely impact here is that the overall format will stay the same, but the Purger0 labels at the start of the lines will probably change, so don't rely on the PurgerN formatting of that leading label. I'm thinking right now that I'll give the purgers names in the config and then number them for the configured fanout factor, so we're likely to see something more like:

Apr 3 2020, 5:19 PM · Operations, Traffic, observability
BBlack added a comment to T249325: cache_text cluster consistently backlogged on purge requests.

Probably-related: T241232

Apr 3 2020, 3:58 PM · Wikimedia-Incident, Product-Infrastructure-Team-Backlog, Page Content Service, MediaWiki-Cache, Platform Team Workboards (Clinic Duty Team), Operations, Traffic
BBlack added a comment to T249346: vhtcpd prometheus metrics broken; prometheus-vhtcpd-stats.py out-of-date with reality.

I had a chat with the author to make sure we understand the meaning of the fields:

Apr 3 2020, 3:52 PM · Operations, Traffic, observability
BBlack added a comment to T249325: cache_text cluster consistently backlogged on purge requests.

So, the backend purging queues in esams are way behind. On the one node I'm staring at the most, there are currently about 87 million backlogged purge requests, which is probably somewhere in the ballpark of 10 hours of lag time. The backlog is in the local daemon on the esams hosts themselves (so this isn't a network issue with delivering the purges to the hosts over the WAN), so the culprit is likely the ATS daemon consuming them slowly.

Apr 3 2020, 3:16 PM · Wikimedia-Incident, Product-Infrastructure-Team-Backlog, Page Content Service, MediaWiki-Cache, Platform Team Workboards (Clinic Duty Team), Operations, Traffic
BBlack added a comment to T249325: cache_text cluster consistently backlogged on purge requests.
bblack@cumin1001:~$ sudo cumin A:cp-text 'curl -s https://en.wikipedia.org/static/images/project-logos/cswiki.png|md5sum'
36 hosts will be targeted:
cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet,cp[4027-4032].ulsfo.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====                                                                                                                             
(8) cp[3050,3052,3054,3056,3058,3060,3062,3064].esams.wmnet                                                                                        
----- OUTPUT of 'curl -s https://...swiki.png|md5sum' -----                                                                                        
d67c283275f6441b49458077ea9e22ed  -                                                                                                                
===== NODE GROUP =====                                                                                                                             
(28) cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[1075,1077,1079,1081,1083,1085,1087,1089].eqiad.wmnet,cp[5007-5012].eqsin.wmnet,cp[4027-4032].ulsfo.wmnet                                                                                                                              
----- OUTPUT of 'curl -s https://...swiki.png|md5sum' -----                                                                                        
de8be0dfd83481f739a05e837b054b90  -
Apr 3 2020, 2:47 PM · Wikimedia-Incident, Product-Infrastructure-Team-Backlog, Page Content Service, MediaWiki-Cache, Platform Team Workboards (Clinic Duty Team), Operations, Traffic
BBlack added a comment to T249325: cache_text cluster consistently backlogged on purge requests.

Yeah, the varnish (frontend) code for this is in modules/varnish/templates/text-frontend.inc.vcl.erb:

# normalize all /static to the same hostname for caching
if (req.url ~ "^/static/") { set req.http.host = "<%= @vcl_config.fetch("static_host") %>"; }
Apr 3 2020, 2:39 PM · Wikimedia-Incident, Product-Infrastructure-Team-Backlog, Page Content Service, MediaWiki-Cache, Platform Team Workboards (Clinic Duty Team), Operations, Traffic
BBlack added a comment to T249325: cache_text cluster consistently backlogged on purge requests.

Most likely the cause is that the Varnish rule for normalizing /static/ to the enwiki hostname hasn't been applied to ATS, and thus this isn't effectively purging /static from the ATS backend caches. Will check...

Apr 3 2020, 2:33 PM · Wikimedia-Incident, Product-Infrastructure-Team-Backlog, Page Content Service, MediaWiki-Cache, Platform Team Workboards (Clinic Duty Team), Operations, Traffic

Mar 18 2020

BBlack closed T104681: HTTPS Plans (tracking / high-level info) as Resolved.

Resolving this, since it has become an undead tracker for too long. There are still two trailing issues, but having this over-arching tracking task above them isn't accomplishing anything:

Mar 18 2020, 6:54 PM · Tracking-Neverending, Operations, Traffic, HTTPS
BBlack placed T107236: Switch port 80 to nginx on primary clusters up for grabs.
Mar 18 2020, 6:47 PM · Operations, Traffic
BBlack closed T107236: Switch port 80 to nginx on primary clusters, a subtask of T108827: Investigate TCP Fast Open for tlsproxy, as Declined.
Mar 18 2020, 6:44 PM · Patch-For-Review, Operations, Traffic
BBlack closed T107236: Switch port 80 to nginx on primary clusters as Declined.

We're not using nginx software for this functionality anymore, and everything else related to these parts of the software stack have changed and are still evolving, so this task doesn't make a ton of sense anymore as it stands.

Mar 18 2020, 6:44 PM · Operations, Traffic
BBlack updated the task description for T128559: Enable HSTS on store.wikimedia.org for HTTPS.
Mar 18 2020, 6:34 PM · Operations, Traffic, Wikimedia-Shop, HTTPS
BBlack updated the task description for T128559: Enable HSTS on store.wikimedia.org for HTTPS.
Mar 18 2020, 6:26 PM · Operations, Traffic, Wikimedia-Shop, HTTPS
BBlack added a comment to T128559: Enable HSTS on store.wikimedia.org for HTTPS.

Update: sometime since I last checked, they've changed the header to: strict-transport-security: max-age=31557600 (~1 year, vs ~90 days before). Still missing the other attributes (preload and includeSubDomains)...

Mar 18 2020, 6:25 PM · Operations, Traffic, Wikimedia-Shop, HTTPS

Feb 27 2020

BBlack added a comment to T242374: Set up git-driven static microsite for wikiworkshop.org.

Sorry, I missed this update when it flew through on a busy week. Will do today!

Feb 27 2020, 5:10 PM · Research, Operations, Traffic

Feb 18 2020

BBlack added a comment to T156955: Standardizing our partman recipes.

After the recent merger https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571928/ - I'm having installer failure on dns2001 (we did its sibling dns2002 a few days before the merge without issue). It breaks out to promting with the installer GUI for netmask stuff and then disk partitioning. I saved the debug stuff over HTTP to cumin1001.eqiad.wmnet:~bblack/dns2001-logs/ ... we can live with it for a little bit, so leaving things as-is for debugging...

Feb 18 2020, 8:15 PM · Patch-For-Review, User-fgiunchedi, Operations
BBlack created P10450 gdnsd strange issue.
Feb 18 2020, 2:29 PM · Traffic

Feb 11 2020

BBlack updated the task description for T242374: Set up git-driven static microsite for wikiworkshop.org.
Feb 11 2020, 2:01 PM · Research, Operations, Traffic
BBlack added a comment to T242374: Set up git-driven static microsite for wikiworkshop.org.

I think all that's left on our side is a DNS switch, which is pretty trivial and quick.

Feb 11 2020, 1:59 PM · Research, Operations, Traffic

Feb 10 2020

BBlack reassigned T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems from BBlack to RobH.
Feb 10 2020, 6:45 PM · DC-Ops, Traffic, ops-esams, Operations
BBlack added a comment to T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems.

@RobH - Yes, let's edit this to include eqiad as well. We've had the same symptoms both places, and they're the same approximate generation of hardware configuration (IIRC, only the NVMe changed to a slightly newer/better model from eqiad to esams, but the base system is otherwise fairly identical).

Feb 10 2020, 6:44 PM · DC-Ops, Traffic, ops-esams, Operations

Feb 7 2020

BBlack added a comment to P10348 (An Untitled Masterwork).

Everything but the first few rows (system/repl stuff) and with all the Sleep rows removed:

Feb 7 2020, 5:05 PM

Feb 6 2020

BBlack created P10324 HTTP vs HTTPS local latency.
Feb 6 2020, 7:24 PM · Traffic
BBlack created P10323 ats-tls latency benchmarking.
Feb 6 2020, 6:38 PM · Traffic
BBlack added a comment to T165765: Refactor pybal/LVS config for shared failover.

Status update: what's missing here is codfw, which will happen when we finish its hardware upgrade switch to lvs2007-10

Feb 6 2020, 4:38 PM · Performance-Team (Radar), Patch-For-Review, Operations, Traffic

Jan 31 2020

BBlack added a comment to T243634: ulsfo varnish-fe vcache processes overflow on FDs.

We also tested depooling just port 80 yesterday, which didn't affect anything (fd leak was still growing), which means this isn't driven by external->:80 traffic. cp4029 was at ~400K fds this morning, so I depooled its frontend services and restarted ats-tls + varnish-fe to reset the clock. I also repooled cp4032 while I was in there.

Jan 31 2020, 6:32 PM · Operations, Traffic

Jan 17 2020

BBlack added a comment to T242374: Set up git-driven static microsite for wikiworkshop.org.

Most of this has been configured now, the remaining slightly difficult bit is configuring an alternate SNI cert for the domain on our new ats-tls termination.

Jan 17 2020, 3:08 PM · Research, Operations, Traffic
BBlack updated the task description for T242374: Set up git-driven static microsite for wikiworkshop.org.
Jan 17 2020, 3:07 PM · Research, Operations, Traffic

Jan 14 2020

BBlack added a comment to T230638: Move old transparency report pages to historical URLs and setup redirect.

Yes, more or less. The major caveat is some of our caches still have non-redirecting copies of various pages in https://transparency.wikimedia.org/ , but this will sort itself out over the next day or so at the most. To save anyone from trawling through the list of commits above, the changes in effect now are:

Jan 14 2020, 6:14 PM · Patch-For-Review, serviceops, Operations, WMF-Legal
BBlack added a comment to T190090: Offload pings to dedicated server.

+1 from me, this was one of the many things we made the ganeti clusters for :)

Jan 14 2020, 1:37 PM · Patch-For-Review, netops, Traffic, Operations

Jan 13 2020

BBlack added a comment to T242602: Sort out plan for install* servers in edge sites.

Seems like a good plan!

Jan 13 2020, 1:59 PM · Patch-For-Review, Operations

Jan 9 2020

BBlack closed T240303: Add wikiworkshop.org to the Foundation's DNS as Resolved.
Jan 9 2020, 9:32 PM · Research, Traffic, Operations, DNS
BBlack closed T240303: Add wikiworkshop.org to the Foundation's DNS, a subtask of T242374: Set up git-driven static microsite for wikiworkshop.org, as Resolved.
Jan 9 2020, 9:32 PM · Research, Operations, Traffic
BBlack added a parent task for T240303: Add wikiworkshop.org to the Foundation's DNS: T242374: Set up git-driven static microsite for wikiworkshop.org.
Jan 9 2020, 9:32 PM · Research, Traffic, DNS, Operations
BBlack added a subtask for T242374: Set up git-driven static microsite for wikiworkshop.org: T240303: Add wikiworkshop.org to the Foundation's DNS.
Jan 9 2020, 9:31 PM · Research, Operations, Traffic
BBlack triaged T242374: Set up git-driven static microsite for wikiworkshop.org as Medium priority.
Jan 9 2020, 9:31 PM · Research, Operations, Traffic

Jan 8 2020

BBlack added a comment to T242200: Docker registry needs cache to vary on Accept header value.

So long as the registry's responses do all the standards-based things correctly (they contain Vary: Accept, and the matching Accept values also match the Content-Type values in the responses), this should Just Work on a functional level.

Jan 8 2020, 1:15 PM · Traffic, Operations

Dec 18 2019

BBlack added a comment to T240813: HTTPS/Browser Recommendations page on Wikitech is outdated.

The wording issues here are actually a bit tricky. We've done several TLS standards upgrades over time, and there are still a few to go:

Dec 18 2019, 3:59 PM · Operations, Traffic
BBlack added a comment to T240794: /sec-warning page: please add an HTML comment that is more easily visible to API and transport-level inspection/debugging.

Or a patch to template this in. The problem is it's implemented from a standard template for the top 30-40 lines, which isn't specific to this case, in an attempt to standardize our error output templates.

Dec 18 2019, 3:11 PM · Operations, Traffic

Dec 16 2019

BBlack closed T239994: Implement DNS-over-TLS for AuthDNS as Resolved.
Dec 16 2019, 11:38 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

External queries now working (note they all return a codfw IP without edns-client-subnet in play, because codfw is closest to my laptop and PROXYv2 is working for sending the "real" client IP from haproxy to gdnsd).

Dec 16 2019, 11:37 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

Actually we can't realistically do global monitoring from icinga either, because icinga isn't on Buster and so it doesn't have the right library/tool access to check a TLSv1.3-only service, so we'll have to settle for the per-server NRPE checks for now.

Dec 16 2019, 11:34 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

Refactoring the dependencies a little here: Really (2) above's sub-point about shared ticket key rotation won't matter until we're anycasting, so I've made a separate task (+subtask) in T240863 to go look at that stuff later, blocking the anycast work.

Dec 16 2019, 5:09 PM · Operations, Traffic
BBlack created T240866: Create a system for distributed shared secret material to server tmps.
Dec 16 2019, 3:00 PM · Operations, Traffic
BBlack created T240863: Secure shared ticket key rotation for anycast authdns.
Dec 16 2019, 2:53 PM · Operations, Traffic

Dec 13 2019

BBlack added a comment to T240303: Add wikiworkshop.org to the Foundation's DNS.

All of this is irregular and outside of policies we like to adhere to, but I'll push a zonefile to our nameservers which supports the bare minimum (existing Stanford-hosted IPs for the insecure site http://wikiworkshop.org and the same IP for redirects from http://www.wikiworkshop.org , and nothing else ). At some point after the holidays are over, I'd like to find out what the overall intent and/or plan is here so that we can provide some additional guidance and get this onto some kind of more-acceptable path though.

Dec 13 2019, 2:25 AM · Research, Traffic, DNS, Operations
BBlack moved T240303: Add wikiworkshop.org to the Foundation's DNS from Triage to DNS Names on the Traffic board.
Dec 13 2019, 2:17 AM · Research, Traffic, DNS, Operations

Dec 12 2019

BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

This is now mostly-working, with heira flag controlling test deployment (currently only on dns4002, which doesn't have any public authserver IPs routed into it at this time).

Dec 12 2019, 11:10 PM · Operations, Traffic
BBlack added a comment to T239994: Implement DNS-over-TLS for AuthDNS.

P9867 <- First internal test query on a prod dns box :)

Dec 12 2019, 9:27 PM · Operations, Traffic
BBlack created P9867 AuthDNS-over-TLS.
Dec 12 2019, 9:23 PM · Traffic
BBlack triaged T240614: Fix acme-chief DNS validation correctly as High priority.
Dec 12 2019, 8:43 PM · Operations, Traffic
BBlack updated subscribers of T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).
Dec 12 2019, 3:24 PM · Wikimedia-Incident, Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T240497: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML).

I'm not even sure what the task is asking for, but yeah in general we're not going to make the sec-warning mechanism comply with all expected valid outputs from all possible APIs/URIs it's covering. It's designed to break things, in a way that at least provides some level of human info on what's going on if someone digs in and looks. The next step in the transition process after this is that whatever agent they're using which is getting the sec-warning output won't be able to establish a connection to our infrastructure at all, which is way more broken than this.

Dec 12 2019, 1:46 PM · Traffic, Operations
BBlack added a comment to T238038: Start warning and deprecation process for all legacy TLS.

BTW. We no longer have the cipher stats grafana board ? Too bad, that one was hella interesting.

Dec 12 2019, 1:34 PM · Operations, Traffic
BBlack added a comment to T240497: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML).

The way it works is that if the connection isn't using TLSv1.2, the user is served a 302 redirect to /sec-warning on the same domain, which in turn returns a cacheable 200 OK with the HTML warning content and the CT header as text/html; charset=utf-8. There are a lot of gory details in the compromises being made by that solution (vs. eg. we could have returned some kind of 4xx error immediately rather than 302->200), but we've learned this is the best pattern to avoid misbehavior of certain bots and scrapers out there in the world which spam-retry [45]xx return codes.

Dec 12 2019, 1:29 PM · Traffic, Operations
BBlack added a comment to T240495: investigate making 'notrack' the default on our ferm rules.

Yes, it's about that $notrack default. My hypothesis is that setting it to true wouldn't break any traffic, wouldn't change the security situation much, but would eliminate a bunch of potential for conntrack table size issues when various services get overwhelmed. Some thoughts about why that hypothesis might be false:

Dec 12 2019, 12:04 PM · Operations

Dec 10 2019

BBlack added a comment to T239993: Decom LVS recdns.

Status: The actual LVS portion of this is now completely removed globally. The IP addresses themselves are also completely unconfigured and removed from service at the all the edge sites, but not the core ones. What remains is that the legacy LVS recdns IPs 208.80.154.254 (eqiad) and 208.80.153.254 (codfw) are still statically-configured to avoid breaking any of the leftover dependencies on these IPs. Sniffer monitoring has shown at least the ircd instance on kraz is still using outdated resolv.conf data and hitting these IPs, several hardware PDUs are using them as well, and there are possibly other such cases which are rarer and thus harder to observe in short samples (I've done up to 1h samples).

Dec 10 2019, 6:10 PM · Patch-For-Review, Operations, Traffic
BBlack created P9846 authdns config.
Dec 10 2019, 3:30 PM
BBlack created P9845 The cacheable misses from v-fe with session cookies....
Dec 10 2019, 3:06 PM · Traffic
BBlack created P9844 Cookie/Vary request side.
Dec 10 2019, 2:22 PM · Traffic
BBlack moved T240285: Clean up DNS server puppetization from Triage to DNS Infra on the Traffic board.
Dec 10 2019, 1:07 PM · Operations, Traffic
BBlack added a comment to T240303: Add wikiworkshop.org to the Foundation's DNS.

I'm assuming that, for now, the hosting of the web service (and email?) is not moving, just the whois ownership and DNS service? We usually need a fair bit more information than this to handle such a case smoothly. At a glance it looks like there's potentially more to this (e.g. they have MX and SPF records, are there are also DMARC and such we need to copy?). Also, basic TLS doesn't seem to work on the target site, either. Is there a project-level task or something for whatever transition is happening here?

Dec 10 2019, 3:17 AM · Research, Traffic, Operations, DNS

Dec 9 2019

BBlack added a parent task for T240285: Clean up DNS server puppetization: T98006: Anycast AuthDNS.
Dec 9 2019, 10:27 PM · Operations, Traffic