Page MenuHomePhabricator

BBlack (Brandon Black)
Principal Site Reliability Engineer, SRE Traffic Team

Projects (10)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (359 w, 3 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Yesterday

BBlack added a comment to T251732: wikiworkshop.org has Facebook button, external statcounter, https to http redirect.

@Vgutierrez This site was setup by Brandon. Could you maybe ask him about that last question?

Fri, Sep 24, 4:22 PM · Patch-For-Review, Security-Team, Privacy, Research, Privacy Engineering, Traffic, SRE
BBlack closed T283061: Let's Encrypt chain size increase, a subtask of T283164: Let's Encrypt issuance chains update, as Resolved.
Fri, Sep 24, 2:04 PM · SRE, Traffic
BBlack closed T283061: Let's Encrypt chain size increase as Resolved.

We ended up deciding some time back at the size increase here was both within reason (in network packet / IW10 sorts of senses), and that the compatibility level it affords was necessary. Belatedly closing up this task for now!

Fri, Sep 24, 2:04 PM · Traffic, Acme-chief
BBlack added a comment to T283164: Let's Encrypt issuance chains update.

This is also being covered (for the public-facing side of things) in https://wikitech.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry , which Johan has kindly copied out to the upcoming Monday Tech News and to https://meta.wikimedia.org/wiki/HTTPS/2021_Let%27s_Encrypt_root_expiry for potential translations!

Fri, Sep 24, 2:03 PM · SRE, Traffic

Mon, Sep 20

BBlack added a comment to T289536: Deploy durum: check service for Wikidough.

Thanks for the clarity, makes a lot of sense!

Mon, Sep 20, 12:40 PM · Patch-For-Review, SRE, Traffic

Thu, Sep 16

BBlack added a comment to T291148: VarnishTrafficDrop alert false positives due to DCs depooled.

The solution to this in the icinga version of this check was to include an additional term in the prometheus query that would cause a null result if the absolute traffic level (before the drop) is below 15K rps, which the AM version doesn't have (perhaps because it can't handle that scheme and needs a real value? I'm not sure if that's the best solution or if the cutoff is exactly where it should be, but looking into it as I go through AM stuff.

Thu, Sep 16, 12:28 PM · SRE Observability (FY2021/2022-Q2), SRE, Traffic

Wed, Sep 1

BBlack added a comment to T289787: Clean up Traffic tag/workboard.

Traffic-Icebox now exists as a new tag with a process-informative description (click it and read!). I've bulk (+silent) moved all open Traffic tickets which had no activity for >= 6 months over to it as a first easy step, which moved 232 of the 379 open tasks (~61%). The moves contained an automated comment to help reduce any confusion, I hope. See example here: T81605#7327085 .

Wed, Sep 1, 11:40 PM · PM, SRE, Traffic
BBlack edited Description on Traffic-Icebox.
Wed, Sep 1, 10:47 PM
BBlack edited Description on Traffic-Icebox.
Wed, Sep 1, 7:40 PM
BBlack updated BBlack.
Wed, Sep 1, 6:41 PM
BBlack created Traffic-Icebox.
Wed, Sep 1, 6:37 PM

Thu, Aug 26

BBlack added a comment to T289787: Clean up Traffic tag/workboard.

I'm sorry if it sounds stupid or you already considered it but for the sake of being consistent with most other teams. You can have Traffic-team for tracking the ongoing work and Traffic staying as it is (which would make the transition much easier, you probably need to change to project to tag-type though). If you think this doesn't sound good, sorry for spamming here.

Thu, Aug 26, 4:34 PM · PM, SRE, Traffic
Aklapper awarded T289787: Clean up Traffic tag/workboard a Love token.
Thu, Aug 26, 3:56 PM · PM, SRE, Traffic
BBlack triaged T289787: Clean up Traffic tag/workboard as Medium priority.
Thu, Aug 26, 3:45 PM · PM, SRE, Traffic

Aug 11 2021

BBlack closed T281428: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra as Resolved.
Aug 11 2021, 6:39 PM · Wikimedia Enterprise (Okapi Wikimedia Enterprise), Patch-For-Review, SRE, Traffic

Jul 2 2021

BBlack added a comment to T286032: Switch buffer re-partition - Eqiad Row A.

Traffic-related bits:

  • dns1001 will need a manual depool so that it doesn't have knock-on effects on all of the other clusters/software in eqiad in other rows. Depool instructions are at: https://wikitech.wikimedia.org/wiki/Anycast#How_to_temporarily_depool_a_server
  • The cp servers should be fine (as we have no user traffic flowing in and most other services that would loop through it internally are running in codfw-only now), but they can easily be preemptively depooled to make things smoother and safer. The simplest way to remember how is just to execute "depool" as root on the affected cp nodes themselves before the switch changes, and "pool" after it's complete.
  • lvs1013 should probably be taken offline by disabling puppet and stopping pybal. All the LVSes connect to all services in all rows as routers and to some degree any true impact should be covered by other services dealing with things at their own layer(s), but explicitly depooling it before it loses its primary host interface is probably a smart idea!
Jul 2 2021, 1:19 PM · Traffic, cloud-services-team (Kanban), DBA, Infrastructure-Foundations, SRE, netops

Jun 29 2021

BBlack created P16740 mw2 CPUs.
Jun 29 2021, 2:50 PM

Jun 8 2021

BBlack added a comment to T284555: Consider using BindsTo instead of Requires to declare dependencies between systemd unit.

When we looked into this for the Bird-based anycast stuff, we found that the combination you want for strong service binding is both BindsTo= + After= on the same underlying service (cf https://www.freedesktop.org/software/systemd/man/systemd.unit.html ).

Jun 8 2021, 1:36 PM · SRE, serviceops, VPS-project-Codesearch, Traffic

Jun 4 2021

BBlack added a comment to T278729: cp1087 down with hardware issues.

rsyslogd was down for repeatedly segfaulting on startup. I was able to strace the failure and see that it kept segfaulting while reading one of its own files in /var/spool/rsyslog/ on startup, which was probably corrupted somehow during a prior crash. Deleting the spool files let rsyslog start up properly, but I think at this point we're better off reimaging instead of waiting to find (or never find) some other more-subtle corruption.

Jun 4 2021, 7:20 PM · ops-eqiad, SRE, Traffic

Jun 3 2021

BBlack added a comment to T281428: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra.

@RBrounley_WMF I think he's waiting on me, sorry! Will sync up with him

Jun 3 2021, 8:50 PM · Wikimedia Enterprise (Okapi Wikimedia Enterprise), Patch-For-Review, SRE, Traffic

May 20 2021

BBlack added a comment to T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain.

bump for testing purposes

May 20 2021, 12:42 PM · Patch-For-Review, Infrastructure-Foundations, SRE, Traffic

May 19 2021

BBlack added a comment to T281428: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra.

[...] now we're ready to check/peek of what we have at the AWS Route53.

May 19 2021, 4:33 PM · Wikimedia Enterprise (Okapi Wikimedia Enterprise), Patch-For-Review, SRE, Traffic

May 13 2021

BBlack updated the task description for T282806: Port traffic/netops grafana alerts to AlertManager.
May 13 2021, 6:13 PM · Patch-For-Review, Traffic, User-fgiunchedi, observability
BBlack created T282806: Port traffic/netops grafana alerts to AlertManager.
May 13 2021, 6:10 PM · Patch-For-Review, Traffic, User-fgiunchedi, observability
BBlack added a parent task for T282787: Configure dns and puppet repositories for new drmrs datacenter: Unknown Object (Task).
May 13 2021, 2:55 PM · Patch-For-Review, SRE, Traffic
Ladsgroup awarded T282787: Configure dns and puppet repositories for new drmrs datacenter a Love token.
May 13 2021, 2:37 PM · Patch-For-Review, SRE, Traffic
BBlack updated the task description for T282787: Configure dns and puppet repositories for new drmrs datacenter.
May 13 2021, 2:24 PM · Patch-For-Review, SRE, Traffic
BBlack updated the task description for T282787: Configure dns and puppet repositories for new drmrs datacenter.
May 13 2021, 2:22 PM · Patch-For-Review, SRE, Traffic
BBlack triaged T282787: Configure dns and puppet repositories for new drmrs datacenter as Medium priority.
May 13 2021, 2:10 PM · Patch-For-Review, SRE, Traffic

May 11 2021

BBlack added a comment to T281428: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra.

A lot of what's in that zonefile of course will change for the new DNS setup, or is irrelevant to any smooth transition, etc. The key parts to highlight:

May 11 2021, 4:55 PM · Wikimedia Enterprise (Okapi Wikimedia Enterprise), Patch-For-Review, SRE, Traffic
BBlack added a project to T281428: "wikimedia.com" DNS transfer to Wikimedia Enterprise's AWS infra: Traffic.
May 11 2021, 4:42 PM · Wikimedia Enterprise (Okapi Wikimedia Enterprise), Patch-For-Review, SRE, Traffic

May 6 2021

BBlack updated the task description for T281135: codfw: Relocate servers in 10G racks .
May 6 2021, 7:20 PM · Data-Persistence (Consultation), serviceops, SRE, ops-codfw

May 5 2021

BBlack closed T275046: provision more machines for eqsin caches as Resolved.

These are all pooled now and slowly filling their caches. Optimistically closing this task for now!

May 5 2021, 7:03 PM · SRE, Traffic
BBlack closed T275046: provision more machines for eqsin caches, a subtask of T274888: cp_upload @ eqsin cascading failures, February 2021, as Resolved.
May 5 2021, 7:03 PM · Patch-For-Review, SRE, Traffic
BBlack added a comment to T275046: provision more machines for eqsin caches.

The others were in the same state. All are fixed and rebooted now, icinga downtimes are removed, netbox status is set to Active, and confctl weights are set correctly, but the pooled attribute is still set to inactive.

May 5 2021, 6:34 PM · SRE, Traffic
BBlack added a comment to T275046: provision more machines for eqsin caches.

I checked the BIOS/iDRAC settings on cp5013 against https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_Documentation#Initial_System_Setup (+ the one custom setting we use on these modern cps, which is to disable the unused onboard NICs), and ended up making these 3 changes to bring it into conformance:

May 5 2021, 5:51 PM · SRE, Traffic
BBlack added a comment to T275046: provision more machines for eqsin caches.

These are just about ready and running correct puppetization, but don't pool these yet. I think they may have some bad BIOS settings or something, at least related to power mgmt. cpufreq keeps attempting to reset the governor on every puppet run. Will check tomorrow.

May 5 2021, 4:54 AM · SRE, Traffic

Apr 29 2021

BBlack added a comment to T278182: (Need By: TBD) rack/setup/install cp501[3-6].

Note - https://gerrit.wikimedia.org/r/c/operations/puppet/+/683026 has the production roles and config, but we'll need to reimage them into this rather than just applying it, in order to get the nvme storage and partman set up consistently.

Apr 29 2021, 9:59 PM · SRE, ops-eqsin, DC-Ops

Apr 28 2021

BBlack added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Continuing the thought above: varnishlog data may infer that most of the perf impact could be restored just by extending grace to something like 5 minutes.

Apr 28 2021, 11:42 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
BBlack added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Alternatively, can we get identical results just by incrementing grace by keep? (And possibly setting keep to 0 if it isn't doing anything for us w.r.t. If-Modified-Since?)

Apr 28 2021, 11:38 AM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Apr 27 2021

BBlack added a comment to T279457: Multiple host down alerts from rack C2.

Traffic lvs/cp/dns are all repooled, un-downtimed, and green.

Apr 27 2021, 6:37 PM · netops, SRE, ops-codfw
BBlack added a comment to T279457: Multiple host down alerts from rack C2.

Note to our future selves: we forgot to consider the cross-row LVS connections in this downtime: lvs2008 and lvs2010 do not live in row C at all, but had cross-row connections via C2 to reach all the rest of the service hosts in row C!

Apr 27 2021, 4:32 PM · netops, SRE, ops-codfw
BBlack added a comment to T279457: Multiple host down alerts from rack C2.

Traffic stuff (lvs/cp/dns) is depooled, downtimed, and ready for the network fixups.

Apr 27 2021, 2:54 PM · netops, SRE, ops-codfw

Apr 20 2021

BBlack updated the task description for T279457: Multiple host down alerts from rack C2.
Apr 20 2021, 9:56 PM · netops, SRE, ops-codfw

Apr 7 2021

BBlack added a comment to T279034: CentralNotice code to fix the banner bump with “pageview+1 with exceptions for infrequent visitors and as needed”.

Hi all! @BBlack, @bd808, @ema do you think this is a potentially feasible approach from a Varnish perspective? I think it's close to the most basic possible way to try out ESI...? If implemented, probably it should be rolled out gradually while cache load and performance are monitored?

Apr 7 2021, 7:22 PM · Readers-Web-Backlog (Tracking), Product Infrastructure Roadmap, SEO, Patch-For-Review, MediaWiki-extensions-CentralNotice, Fundraising-Backlog

Mar 31 2021

BBlack triaged T278964: Separate ingress IPs and/or infrastructure for large content uploads as Low priority.
Mar 31 2021, 1:43 PM · SRE, Traffic

Mar 30 2021

BBlack closed T278729: cp1087 down with hardware issues as Resolved.
Mar 30 2021, 7:58 PM · ops-eqiad, SRE, Traffic
BBlack added a comment to T278729: cp1087 down with hardware issues.

Seems ok for the ~14h it's been back online so far. I'm going to re-pool this and tentatively resolve the ticket hoping it's a fluke event, but not clear the SEL. If we get a recurrence, we'll re-open and kick this over to dcops.

Mar 30 2021, 7:56 PM · ops-eqiad, SRE, Traffic

Mar 16 2021

BBlack added a comment to T274592: Apple Business Manager: verify ownership of wikimedia.org.

@bcampbell - Updated with the new record, try again?

Mar 16 2021, 8:05 PM · Patch-For-Review, Traffic, DNS, SRE

Mar 15 2021

BBlack closed T276585: Add enterprise subdomain for OKAPI as Resolved.
Mar 15 2021, 2:24 PM · Wikimedia Enterprise, SRE, Traffic, Domains

Mar 12 2021

BBlack closed T274592: Apple Business Manager: verify ownership of wikimedia.org as Resolved.

@bcampbell sorry for the delays, this has repeatedly fallen through the cracks, but it's reviewed + merged now and should verify!

Mar 12 2021, 4:08 PM · Patch-For-Review, Traffic, DNS, SRE

Mar 10 2021

BBlack claimed T276585: Add enterprise subdomain for OKAPI.
Mar 10 2021, 9:13 PM · Wikimedia Enterprise, SRE, Traffic, Domains
BBlack added a comment to T276585: Add enterprise subdomain for OKAPI.

@crusnov - It's been followed up offline from phab in general with some meetings, the output of which aren't (yet) reflected in phab, if you're just looking for whether it's being ignored or not! :)

Mar 10 2021, 8:54 PM · Wikimedia Enterprise, SRE, Traffic, Domains
BBlack added a comment to T276673: Certificate Management for GitLab.

Yeah, I tend to prefer option 2 as well. The other option could work in the short term, although maybe we'd want to evaluate whether it's sufficient in all respects (e.g. how early does it renew, and how does it retry on failures, and does it burn up ratelimits aggressively in a failure scenario, in a way that could impact acme-chief use of the same ratelimits, and how do we monitor for renewal failures, etc, etc). It seems easier to just use our standardized solution where we're solving these problems centrally.

Mar 10 2021, 12:30 PM · GitLab (Initialization)

Mar 8 2021

BBlack added a comment to T274784: CDN cache revalidation on several wikis for desktop improvements deployment pt 2.

^ There was a last-minute change of plans, so we made a last-minute call to expend a little bit of our overcautious-ness budget and do all 7 wikis at once (as opposed to ptwiki separately from the other 6).

Mar 8 2021, 7:49 PM · Readers-Web-Backlog (Kanbanana-FY-2020-21), Bengali-Sites, Serbian-Sites, Turkish-Sites, Performance-Team (Radar), SRE, Traffic, Desktop Improvements

Mar 2 2021

BBlack added a comment to T274784: CDN cache revalidation on several wikis for desktop improvements deployment pt 2.

@ovasileva Yes, that plan seems reasonable!

Mar 2 2021, 6:00 PM · Readers-Web-Backlog (Kanbanana-FY-2020-21), Bengali-Sites, Serbian-Sites, Turkish-Sites, Performance-Team (Radar), SRE, Traffic, Desktop Improvements

Feb 26 2021

BBlack added a comment to T275809: cache_upload cache policy + large_objects_cutoff concerns.

Following up a bit on other paths through this problem:

Feb 26 2021, 3:49 PM · Patch-For-Review, SRE, Traffic

Feb 25 2021

BBlack added a comment to T275809: cache_upload cache policy + large_objects_cutoff concerns.

So, to expand a little bit on the text quoted at the top with some initial insights about cutoff vs nuke-limit tradeoffs and some of my current thinking and/or assumptions:

Feb 25 2021, 7:36 PM · Patch-For-Review, SRE, Traffic
BBlack added a comment to T274888: cp_upload @ eqsin cascading failures, February 2021.

I've spun out T275809 to go into some depth on the #1 part about large_objects_cutoff

Feb 25 2021, 7:07 PM · Patch-For-Review, SRE, Traffic
BBlack created T275809: cache_upload cache policy + large_objects_cutoff concerns.
Feb 25 2021, 7:04 PM · Patch-For-Review, SRE, Traffic
BBlack updated the task description for T274888: cp_upload @ eqsin cascading failures, February 2021.
Feb 25 2021, 7:02 PM · Patch-For-Review, SRE, Traffic
BBlack added a subtask for T275046: provision more machines for eqsin caches: Unknown Object (Task).
Feb 25 2021, 5:53 PM · SRE, Traffic
BBlack updated subscribers of T274888: cp_upload @ eqsin cascading failures, February 2021.

Updates on where we're at on some of the pain points above, in terms of solution analysis:

Feb 25 2021, 4:15 PM · Patch-For-Review, SRE, Traffic
BBlack updated the task description for T274888: cp_upload @ eqsin cascading failures, February 2021.
Feb 25 2021, 3:12 PM · Patch-For-Review, SRE, Traffic

Feb 24 2021

BBlack added a comment to T255568: Envoy should listen on ipv6 and ipv4.

@Joe yeah I'm not sure which layer is causing the logstash appearance there. It's from restbase1019 as a client towards something, maybe parsoid?

Feb 24 2021, 9:10 PM · Patch-For-Review, envoy, User-fgiunchedi, observability, serviceops
BBlack added a comment to T255568: Envoy should listen on ipv6 and ipv4.

Hi serviceops - I've run into some of the effects of this recently and tracked down this ticket, which seems a relevant/recent reference point.

Feb 24 2021, 8:44 PM · Patch-For-Review, envoy, User-fgiunchedi, observability, serviceops

Feb 18 2021

BBlack added a comment to T274784: CDN cache revalidation on several wikis for desktop improvements deployment pt 2.

This seems pretty straight-forward operationally, I think we can replicate the techniques used in T256750 for more wikis in general.

Feb 18 2021, 6:18 PM · Readers-Web-Backlog (Kanbanana-FY-2020-21), Bengali-Sites, Serbian-Sites, Turkish-Sites, Performance-Team (Radar), SRE, Traffic, Desktop Improvements

Feb 16 2021

Ladsgroup awarded T144187: Better handling for one-hit-wonder objects a Yellow Medal token.
Feb 16 2021, 3:54 PM · Performance-Team (Radar), Patch-For-Review, Traffic, SRE

Feb 11 2021

BBlack added a comment to T221388: Test dhcp-option 82.

I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id might be useful metadata here in addition to the abstract name of the vlan (e.g. scenarios where we might do vlan trunking on the main interface of the host and need to see or match that primary-vlan number in some interface setup scripts?)

Feb 11 2021, 1:41 PM · Infrastructure-Foundations, Patch-For-Review, SRE, netops

Feb 1 2021

BBlack added a comment to T273248: wikireplicas last-minute infra work to discuss / resolve.

The more interesting Netbox question here, is what the correct way is to define a new tagged virtual interface that doesn't exist yet (the loop I had with the interface-name dropdown and puppetdb and homer, etc)

Feb 1 2021, 2:28 PM · Infrastructure-Foundations, SRE, netops, Traffic, Data-Services, cloud-services-team (Kanban)
BBlack added a comment to T273248: wikireplicas last-minute infra work to discuss / resolve.

It looks like the /27 is what I manually created, and then the /32 was probably patched in later from puppetdb after everything was configured and running.

Feb 1 2021, 2:25 PM · Infrastructure-Foundations, SRE, netops, Traffic, Data-Services, cloud-services-team (Kanban)
BBlack added a comment to T271415: Investigate ms-be hosts performance during rebalances.

The interface::rps define is active for broadcom NICs at the moment; I noticed that some HP hosts use the i40e driver instead. AFAICT we don't have interface::rps applied already to any hosts using i40e. I tried testing interface-rps.py on ms-be2056.

The script worked in the sense that there were no errors, although I'd like confirmation from e.g. @BBlack or @faidon perhaps. Namely that things look as they should on ms-be2056 and/or the script's logic needs adjusting for i40e NICs, thanks!

Feb 1 2021, 12:12 PM · Patch-For-Review, User-fgiunchedi, SRE-swift-storage

Jan 29 2021

aborrero awarded T273248: wikireplicas last-minute infra work to discuss / resolve a Like token.
Jan 29 2021, 9:25 AM · Infrastructure-Foundations, SRE, netops, Traffic, Data-Services, cloud-services-team (Kanban)

Jan 28 2021

BBlack created T273248: wikireplicas last-minute infra work to discuss / resolve.
Jan 28 2021, 11:18 PM · Infrastructure-Foundations, SRE, netops, Traffic, Data-Services, cloud-services-team (Kanban)
BBlack added a comment to T272720: Allocate service IPs for new wikireplicas setup.

FTR: Yes it was a netbox thing, and these are now created as:

Jan 28 2021, 10:56 PM · Data-Services, cloud-services-team (Kanban)

Jan 20 2021

BBlack added a comment to T272258: lvs1015 interface errors.

All appears healthy now and downtimes are removed, and librenms isn't showing those errors on the interface anymore, either. Thanks!

Jan 20 2021, 8:02 PM · SRE, Traffic, ops-eqiad
BBlack added a comment to T272258: lvs1015 interface errors.

@Cmjohnson - we'll have it ready then.

Jan 20 2021, 4:31 PM · SRE, Traffic, ops-eqiad
BBlack added a comment to T272258: lvs1015 interface errors.

@Cmjohnson Yes, just let me know a timeframe and we'll get it ready

Jan 20 2021, 4:06 PM · SRE, Traffic, ops-eqiad

Jan 19 2021

BBlack added a comment to T272258: lvs1015 interface errors.

@Cmjohnson - let me know when you're ready to deal with this, and we'll stop service on the node and fail things over to lvs1016.

Jan 19 2021, 2:56 PM · SRE, Traffic, ops-eqiad

Jan 15 2021

BBlack added a comment to T257324: Consolidate edge bastion server into ganeti.

We actually do have some upcoming projects which might necessitate more Ganeti capacity. In general the plan is to move all the non-ganeti DNS boxes into ganeti as well if possible, and to spin up DoH instances in ganeti everywhere as well (which may turn out to need multiple instances and have real scaling issues). But we don't need more capacity there *now* just yet, and so long as they're kept powered up as online spares, we can always deal with the decision to move them into the cluster at a later time.

Jan 15 2021, 4:12 PM · Patch-For-Review, Traffic, SRE

Jan 14 2021

BBlack added a comment to T271087: lvs1016 interface down.

@Cmjohnson - Please do it at your earliest convenience. It's not in the flow of live traffic and doesn't need any "depool" AFAIK (but it is problematic that we don't have it as a reliable backup option!).

Jan 14 2021, 1:08 PM · Traffic, SRE, ops-eqiad

Jan 12 2021

BBlack added a comment to T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001.

There's some anomalies in network graphs on authdns1001 that I hadn't noticed until today, which go all the way back to Oct 26, which is probably around when this started. I'm not sure if they're artificial or not (nothing seems to be wrong), but I'm going to do a precautionary reboot anyways. More likely than not it's something to do with stats reporting itself, that may have become a bit confused with the root disk out of space and never truly recovered since we never rebooted.

Jan 12 2021, 10:03 PM · Traffic-Icebox, SRE

Dec 16 2020

BBlack added a comment to T269686: Create three Okapi sub-domains (okapi*.wikimedia.org).

There's probably a lot of context missing here, athough we can gather some from https://www.mediawiki.org/wiki/Okapi and https://meta.wikimedia.org/wiki/Okapi . Perhaps we could get a primer on where the project is at, what temporary purpose these names will be put to, where the IPs will be hosted at, what kind of software stack is deployed, and processes around deployment and management?

Dec 16 2020, 9:14 PM · DNS, Wikimedia Enterprise, SRE, Traffic
BBlack updated subscribers of T269686: Create three Okapi sub-domains (okapi*.wikimedia.org).
Dec 16 2020, 9:14 PM · DNS, Wikimedia Enterprise, SRE, Traffic

Dec 14 2020

BBlack added a comment to T270034: Send HSTS header on all Wordpress VIP-hosted domains.

We probably should reach out to them and push on this, though. We do have standards that apply ( https://wikitech.wikimedia.org/wiki/HTTPS ), it's just been a while since we've manually audited everything like in https://wikitech.wikimedia.org/wiki/HTTPS/Domains

Dec 14 2020, 7:09 PM · Technical blog, SRE, Traffic, HTTPS, Diff-blog

Dec 7 2020

BBlack added a comment to T263518: dns repository left in a broken state.

(I'm guessing they should probably be updated to the correct file, and also to mention that it has to be in state: production before deploying the DNS mock_etc part of things, but I'm not sure as I didn't change that stuff....)

Dec 7 2020, 10:09 PM · Traffic, DNS, SRE
BBlack added a comment to T263518: dns repository left in a broken state.

There are comments at the top of the DNS repo's utils/mock_etc/discovery-geo-resources and utils/mock_etc/discovery-metafo-resources about avoiding this scenario by updating things in the correct order. I think the comments themselves are outdated now, as they don't know about the monitoring_setup state and they point at a hieradata file that doesn't exist anymore...

Dec 7 2020, 10:06 PM · Traffic, DNS, SRE

Nov 25 2020

BBlack added a comment to T264378: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working.

(and to throw another dimension into the matrix of possibilities above - also whether the client is sending a session cookie to Vary on in either or both requests)

Nov 25 2020, 2:11 PM · SRE, Traffic
BBlack added a comment to T264378: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working.

I think especially if you start considering how Vary: Cookie works in all the above (both for MW on the related 200 and 304 outputs, and in the caches and our VCL), it's quite murky to me whether all of this works sanely in this case. For a given URI, I think we can assume (or at least hope) that Vary: Cookie would be consistently either emitted or not-emitted with all outputs for a given URI (even 304s). But whether we're tracing a V:C or non-V:C case probably changes how all the above plays out with the bgfetch as well due to vary-slotting, if the original was supposedly cacheable and the followup response has a Set-Cookie (which is hopefully uncacheable).

Nov 25 2020, 2:09 PM · SRE, Traffic

Nov 24 2020

BBlack added a comment to T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).

@Gilles - please excuse the extremely long response! :)

Nov 24 2020, 10:25 PM · Wikimedia-Incident, Performance-Team, SRE, Traffic
BBlack added a comment to T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1.

27.35.198.in-addr.arpa
wikimedia.org-global

Nov 24 2020, 4:41 PM · Patch-For-Review, SRE-tools, User-crusnov, netbox

Nov 23 2020

BBlack added a comment to T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001.

Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream.

Nov 23 2020, 2:56 PM · Traffic-Icebox, SRE

Nov 19 2020

BBlack updated subscribers of T268043: MW REST API should be routed to api_appserver MW cluster.

@ema - Reminder to me and you both - Can you take a peek at this Monday please?

Nov 19 2020, 7:11 PM · serviceops, Traffic, SRE, Platform Team Workboards (Green)

Nov 18 2020

BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

No reports of the PDF truncations in NEL for ~8 hours now, which is a significant break from recent trends. Can anyone else still repro this in any way?

Nov 18 2020, 9:18 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error
BBlack closed T252577: Maxmind data update issues for DNS (and others?) as Resolved.

This should be fixed now!

Nov 18 2020, 3:41 PM · SRE, Traffic
BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

The proposed changes are live now. It may take a a few hours to confirm that via NEL at our current sample rate. At least my own artificial reproductions seem to have gone away though, for whatever that's worth!

Nov 18 2020, 1:48 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error
BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

I'm not exactly sure as to why the pattern above emerged, but now I don't think it's relevant at all, just an artifact of the global distribution of various kinds of traffic.

Nov 18 2020, 1:13 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error
BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

I haven't been able to repro this on a public endpoint from my own home connection, even using the random-fetcher script, but that would all be against one cache in codfw.

Nov 18 2020, 11:41 AM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error

Nov 8 2020

BBlack added a comment to T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001.
  • for gdnsd-the-software:
Nov 8 2020, 2:53 PM · Traffic-Icebox, SRE

Nov 5 2020

BBlack added a comment to T258405: Deprecate TLSv1.2 weak ciphersuites.

We should probably also update https://wikitech.wikimedia.org/wiki/HTTPS with the new status quo

Nov 5 2020, 2:43 PM · User-notice, Patch-For-Review, SRE, Traffic