Page MenuHomePhabricator

BBlack (Brandon Black)
Engineering Manager, SRE Traffic Team

Projects (9)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (323 w, 4 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Yesterday

BBlack added a comment to T257324: Consolidate edge bastion server into ganeti.

We actually do have some upcoming projects which might necessitate more Ganeti capacity. In general the plan is to move all the non-ganeti DNS boxes into ganeti as well if possible, and to spin up DoH instances in ganeti everywhere as well (which may turn out to need multiple instances and have real scaling issues). But we don't need more capacity there *now* just yet, and so long as they're kept powered up as online spares, we can always deal with the decision to move them into the cluster at a later time.

Fri, Jan 15, 4:12 PM · Patch-For-Review, Traffic, SRE

Thu, Jan 14

BBlack added a comment to T271087: lvs1016 interface down.

@Cmjohnson - Please do it at your earliest convenience. It's not in the flow of live traffic and doesn't need any "depool" AFAIK (but it is problematic that we don't have it as a reliable backup option!).

Thu, Jan 14, 1:08 PM · Traffic, SRE, ops-eqiad

Tue, Jan 12

BBlack added a comment to T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001.

There's some anomalies in network graphs on authdns1001 that I hadn't noticed until today, which go all the way back to Oct 26, which is probably around when this started. I'm not sure if they're artificial or not (nothing seems to be wrong), but I'm going to do a precautionary reboot anyways. More likely than not it's something to do with stats reporting itself, that may have become a bit confused with the root disk out of space and never truly recovered since we never rebooted.

Tue, Jan 12, 10:03 PM · SRE, Traffic

Dec 16 2020

BBlack added a comment to T269686: Create three Okapi sub-domains (okapi*.wikimedia.org).

There's probably a lot of context missing here, athough we can gather some from https://www.mediawiki.org/wiki/Okapi and https://meta.wikimedia.org/wiki/Okapi . Perhaps we could get a primer on where the project is at, what temporary purpose these names will be put to, where the IPs will be hosted at, what kind of software stack is deployed, and processes around deployment and management?

Dec 16 2020, 9:14 PM · DNS, Okapi, SRE, Traffic
BBlack updated subscribers of T269686: Create three Okapi sub-domains (okapi*.wikimedia.org).
Dec 16 2020, 9:14 PM · DNS, Okapi, SRE, Traffic

Dec 14 2020

BBlack added a comment to T270034: Send HSTS header on all Wordpress VIP-hosted domains.

We probably should reach out to them and push on this, though. We do have standards that apply ( https://wikitech.wikimedia.org/wiki/HTTPS ), it's just been a while since we've manually audited everything like in https://wikitech.wikimedia.org/wiki/HTTPS/Domains

Dec 14 2020, 7:09 PM · Technical blog, SRE, Traffic, HTTPS, Diff-blog

Dec 7 2020

BBlack added a comment to T263518: dns repository left in a broken state.

(I'm guessing they should probably be updated to the correct file, and also to mention that it has to be in state: production before deploying the DNS mock_etc part of things, but I'm not sure as I didn't change that stuff....)

Dec 7 2020, 10:09 PM · Traffic, DNS, SRE
BBlack added a comment to T263518: dns repository left in a broken state.

There are comments at the top of the DNS repo's utils/mock_etc/discovery-geo-resources and utils/mock_etc/discovery-metafo-resources about avoiding this scenario by updating things in the correct order. I think the comments themselves are outdated now, as they don't know about the monitoring_setup state and they point at a hieradata file that doesn't exist anymore...

Dec 7 2020, 10:06 PM · Traffic, DNS, SRE

Nov 25 2020

BBlack added a comment to T264378: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working.

(and to throw another dimension into the matrix of possibilities above - also whether the client is sending a session cookie to Vary on in either or both requests)

Nov 25 2020, 2:11 PM · SRE, Traffic
BBlack added a comment to T264378: ATS-BE Lua mitigations for cacheable responses w/ Set-Cookie seemingly not working.

I think especially if you start considering how Vary: Cookie works in all the above (both for MW on the related 200 and 304 outputs, and in the caches and our VCL), it's quite murky to me whether all of this works sanely in this case. For a given URI, I think we can assume (or at least hope) that Vary: Cookie would be consistently either emitted or not-emitted with all outputs for a given URI (even 304s). But whether we're tracing a V:C or non-V:C case probably changes how all the above plays out with the bgfetch as well due to vary-slotting, if the original was supposedly cacheable and the followup response has a Set-Cookie (which is hopefully uncacheable).

Nov 25 2020, 2:09 PM · SRE, Traffic

Nov 24 2020

BBlack added a comment to T238494: 15% response start regression as of 2019-11-11 (Varnish->ATS).

@Gilles - please excuse the extremely long response! :)

Nov 24 2020, 10:25 PM · Wikimedia-Incident, Performance-Team, Traffic, SRE
BBlack added a comment to T258729: netbox DNS Automation Workflow checklist for Commissioning and Decommissioning 2020Q1.

27.35.198.in-addr.arpa
wikimedia.org-global

Nov 24 2020, 4:41 PM · Patch-For-Review, SRE-tools, User-crusnov, netbox

Nov 23 2020

BBlack added a comment to T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001.

Various related gdnsd fixes were deployed to production with version 3.4.1 of upstream.

Nov 23 2020, 2:56 PM · SRE, Traffic

Nov 19 2020

BBlack updated subscribers of T268043: MW REST API should be routed to api_appserver MW cluster.

@ema - Reminder to me and you both - Can you take a peek at this Monday please?

Nov 19 2020, 7:11 PM · serviceops, Traffic, SRE, Platform Team Workboards (Green)

Nov 18 2020

BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

No reports of the PDF truncations in NEL for ~8 hours now, which is a significant break from recent trends. Can anyone else still repro this in any way?

Nov 18 2020, 9:18 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error
BBlack closed T252577: Maxmind data update issues for DNS (and others?) as Resolved.

This should be fixed now!

Nov 18 2020, 3:41 PM · SRE, Traffic
BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

The proposed changes are live now. It may take a a few hours to confirm that via NEL at our current sample rate. At least my own artificial reproductions seem to have gone away though, for whatever that's worth!

Nov 18 2020, 1:48 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error
BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

I'm not exactly sure as to why the pattern above emerged, but now I don't think it's relevant at all, just an artifact of the global distribution of various kinds of traffic.

Nov 18 2020, 1:13 PM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error
BBlack added a comment to T266373: Connection closed while downloading PDF of articles.

I haven't been able to repro this on a public endpoint from my own home connection, even using the random-fetcher script, but that would all be against one cache in codfw.

Nov 18 2020, 11:41 AM · Traffic, Readers-Web-Backlog (Tracking), Proton, Product-Infrastructure-Team-Backlog, serviceops, SRE, Desktop Improvements, Wikimedia-production-error

Nov 8 2020

BBlack added a comment to T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001.
  • for gdnsd-the-software:
Nov 8 2020, 2:53 PM · SRE, Traffic

Nov 5 2020

BBlack added a comment to T258405: Deprecate TLSv1.2 weak ciphersuites.

We should probably also update https://wikitech.wikimedia.org/wiki/HTTPS with the new status quo

Nov 5 2020, 2:43 PM · User-notice, Patch-For-Review, SRE, Traffic

Nov 1 2020

Ladsgroup awarded T137979: Support brotli compression a Love token.
Nov 1 2020, 4:16 AM · Performance-Team (Radar), SRE, Traffic

Oct 29 2020

BBlack added a comment to T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001.

All the authdns are restarted with the infinite limit applied. There's been some IRC discussion about a few possible spinoff tickets here:

Oct 29 2020, 2:19 PM · SRE, Traffic
BBlack added a comment to T266702: Move WDQS UI to microsites.

We can route different URI subspaces differently at the edge layer, based on URI regexes, as shown here for the split of the API namespace of the primary wiki sites:

Oct 29 2020, 1:43 PM · Patch-For-Review, User-Addshore, Wikidata Query UI, SRE, Wikidata

Oct 26 2020

BBlack added a comment to T266040: Large text objects are randomized to cache backends.

Notes on the large increase in large_objects_cutoff from late last week:

Oct 26 2020, 2:03 PM · Patch-For-Review, SRE, Traffic

Oct 21 2020

BBlack added a comment to T266118: Revisit use of swap and related kernel settings.

Recording from IRC for posterity:

11:07 < bblack> so I was checking out https://gerrit.wikimedia.org/r/c/operations/puppet/+/633704 (which is one of the partman cleanup commits, this one 
                affecting our cacheproxy disk layout), and I've fallen back down the rabbithole of swap space considerations
11:07 < bblack> because the new defaults basically take the partman defaults without any specifics, which is apparently going to create a 1G swap partition 
                (at least, until some future change of defaults?)
11:08 < bblack> in the recent past we didn't have swap partitions on the cache boxes
11:08 < bblack> (and of course, modules/base for better or worse has vm.swappiness = 0, along with some other sometimes-questionable tunables)
11:10 < bblack> I get all the arguments for why disabling swap is probably a dumb idea in most common scenarios
11:11 < bblack> but, I think, if we're taking that angle as a reason to configure swap, we'd probably also relax that swapiness=0 setting as well to make it 
                more useful, and perhaps configure something other than a fixed 1G value for wildly varying workloads and phys memory sizes, too
11:11 < bblack> the current setup (yes, do swap, but at a small fixed sizes with swapiness=0) seems to be in some less-ideal intersection of competing ideas
11:12 < bblack> but in any case, rewinding to the cacheproxy case in particular
11:14 < bblack> these boxes have: 384GB of RAM, the bulk of which is hopefully a fast ram cache for http objects, and a 1.6TB super-fast nvme that's used as 
                an http object disk cache as well (it's the backing for the earlier ram cache), and then a mere ~300G of standard-issue (slower) SSDs for this 
                rootfs stuff
11:16 < bblack> we really don't have any hope of a substantially-useful swap config (where e.g. some mostly-idle or inefficient meta-daemons related to 
                monitoring or something might swap out significant RAM that matters to us), the disks we'd potentially use for swap are actually-smaller than 
                the real RAM, and we're very sensitive to the fact that we'd rather risk OOM than have the kernel make a dumb 
11:16 < bblack> decision on swap (e.g. algorithmically make the mistake of swapping out some critical varnishd cache memory and then need to swap it back in 
                during an cache hit for users)
11:17 < bblack> but I don't think our new regime of standard configs allows for a swapless setup?
11:23 < bblack> or: we could make some noswap variants so that it's semi-standardized?
11:24 < bblack> I don't want to derail standardization efforts, and I feel like allowing exceptions can turn into a lot of exceptions over time
11:24 < bblack> really we should revisit the swap question in general, but I think that goes well out of the partman-cleanup scope
11:25 < bblack> (but affects partman, too)
11:30 < bblack> for future sorting out of swap questions in general: we could also make the argument to just never configure swap *partitions*, and have some 
                base/standard puppetization create/manage swap *files* on the rootfs instead, which makes tuning and runtime changes simpler, etc.  I doubt 
                there's a perf diff we'd care about in that particular case.
Oct 21 2020, 12:05 PM · User-MoritzMuehlenhoff, SRE
BBlack added a comment to T266040: Large text objects are randomized to cache backends.

That is what I was thinking too, but I'm not sure if the VCL state diagram allows us to see that at the right point in time to make the decision or not. We'd have to store some state with the pass object somehow, at least? The flow in the particular case in question that I was observing is like this:

Oct 21 2020, 10:54 AM · Patch-For-Review, SRE, Traffic

Oct 20 2020

BBlack added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

I stumbled on T266040 while looking at something unrelated, but now I'm remembering that earlier in this ticket, there was some mention of a possible correlation with larger response sizes here in this report as well. The stuff in T266040 would have some statistical negative effect primarily on objects >= 256KB (the hieradata-controlled large_objects_cutoff), and a secondary effect on overall backend caching efficiency for everything else (by wasting space on pointless duplications of content). AFAIK what we're looking at in that ticket isn't something that changed from V5 to V6, though... but I could be wrong, or it could be that some other subtle change in V6 behavior exacerbated the impact.

Oct 20 2020, 4:28 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
BBlack triaged T266040: Large text objects are randomized to cache backends as Medium priority.
Oct 20 2020, 4:05 PM · Patch-For-Review, SRE, Traffic
BBlack created T266040: Large text objects are randomized to cache backends.
Oct 20 2020, 4:05 PM · Patch-For-Review, SRE, Traffic

Oct 19 2020

Ladsgroup awarded T133548: Create a secure redirect service for large count of non-canonical / junk domains a Like token.
Oct 19 2020, 8:38 AM · Goal, Patch-For-Review, HTTPS, Traffic, SRE

Oct 16 2020

BBlack assigned T265729: decommission cp2003, cp2009, cp2015, cp2021 to Papaul.
Oct 16 2020, 3:15 PM · SRE, ops-codfw, decommission-hardware
BBlack updated the task description for T265729: decommission cp2003, cp2009, cp2015, cp2021.
Oct 16 2020, 2:54 PM · SRE, ops-codfw, decommission-hardware
BBlack created T265729: decommission cp2003, cp2009, cp2015, cp2021.
Oct 16 2020, 2:49 PM · SRE, ops-codfw, decommission-hardware

Oct 8 2020

BBlack added a comment to T264888: Review default ferm INPUT policy.

FWIW, I am in general a fan of REJECT over DROP, especially when there's not even a great obscurity argument, as is the case here. It will be a change for us internally on the debugging woes mentioned, but it's more-correct in the overall, and letting things that will eventually fail do so faster seems like it's always a net win :)

Oct 8 2020, 1:44 PM · Patch-For-Review, Security, SRE, netops, User-jbond

Oct 7 2020

BBlack added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

@ema @BBlack before I build it, I want to confirm that some complimentary information you're looking for is the ability to break down the RUM response start metric by "hit-front", "hit-local", etc. response type in addition to DC and host.

Oct 7 2020, 4:49 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Oct 6 2020

BBlack added a comment to T264273: DNS: per prefix zone-file limitation.

For IPv4 you'll have to break up the <24 cases into /24 containers somehow, because they DNS zones themselves are /24 in that case. Perhaps we have to make some new containers in netbox, which represent the DNS-level abstraction, in this case?

Oct 6 2020, 8:53 PM · netbox

Oct 5 2020

BBlack added a comment to T238285: Pages whose title ends with semicolon (;) are intermittently inaccessible.

With the dupe merger, maybe we owe a status update here:

Oct 5 2020, 4:34 PM · Wikimedia-General-or-Unknown, SRE, Traffic, User-DannyS712

Oct 2 2020

BBlack moved T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory from Caching to Bug Reports on the Traffic board.
Oct 2 2020, 1:59 PM · Sustainability (Incident Followup), SRE, Traffic
BBlack moved T106517: upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 from Caching to Bug Reports on the Traffic board.
Oct 2 2020, 1:59 PM · Sustainability (Incident Followup), Traffic, SRE
BBlack moved T128188: Make CI run Varnish VCL tests from Caching to Epic Wishlist on the Traffic board.
Oct 2 2020, 1:58 PM · Varnish, Patch-For-Review, SRE, Continuous-Integration-Infrastructure, Traffic
BBlack moved T122867: Evaluate the feasibility of cache invalidation for the action API from Caching to Epic Ideas on the Traffic board.
Oct 2 2020, 1:58 PM · SRE, Traffic, Varnish, MediaWiki-API
BBlack added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Eh maybe a few more to think about too:

Oct 2 2020, 1:12 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic
BBlack added a comment to T264398: 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1).

Just throwing in some random points/counterpoints to ponder:

Oct 2 2020, 1:01 PM · Patch-For-Review, Performance-Team (Radar), SRE, Traffic

Oct 1 2020

BBlack closed T264227: lvs1016 enp5s0f0 interface errors as Resolved.

@Cmjohnson replaced the SFPs on both ends of this link before my reboot above. Since the reboot, we don't seem to have any abnormal rate of interface failures, neither do we yet observe ProxyFetch failures from pybal's logs.

Oct 1 2020, 5:19 PM · ops-eqiad, netops, Traffic, SRE
BBlack added a comment to T264227: lvs1016 enp5s0f0 interface errors.

The link has gotten worse and began flapping up and down rapidly since last update, causing a loss of routing to the row. I've downtimed the whole host now in icinga, disabled puppet on the host, and manually downed the interface to stop the flapping.

Oct 1 2020, 3:55 PM · ops-eqiad, netops, Traffic, SRE

Sep 30 2020

BBlack updated the task description for T264227: lvs1016 enp5s0f0 interface errors.
Sep 30 2020, 6:36 PM · ops-eqiad, netops, Traffic, SRE
BBlack triaged T264227: lvs1016 enp5s0f0 interface errors as High priority.
Sep 30 2020, 6:32 PM · ops-eqiad, netops, Traffic, SRE

Sep 29 2020

BBlack moved T256302: Certain links being rejected by caching if opened in Internet Explorer with a HTTP 400 error from Caching to Bug Reports on the Traffic board.
Sep 29 2020, 9:44 PM · SRE, Traffic
BBlack moved T236754: Discarded VCL files stuck in auto/busy state cause high number of backend probe requests from Caching to Bug Reports on the Traffic board.
Sep 29 2020, 9:44 PM · Patch-For-Review, SRE, Traffic
BBlack moved T263288: experiment with reenabling compression between applayer's TLS terminators and edge caches from Caching to Feature Requests on the Traffic board.
Sep 29 2020, 9:28 PM · netops, Traffic, SRE
BBlack moved T264074: varnishkafka 1.1.0 CPU usage increase from Caching to Bug Reports on the Traffic board.
Sep 29 2020, 9:28 PM · Patch-For-Review, Analytics-Clusters, Traffic, SRE
BBlack moved T263275: Capacity planning for (& optimization of) transport backhaul vs edge egress from Caching to Epic Ideas on the Traffic board.
Sep 29 2020, 9:28 PM · netops, Traffic, SRE, Epic
BBlack closed T133821: Make CDN purges reliable as Resolved.

This should've been closed back when T250781 closed - all purge traffic now goes via kafka queues and multicast purging is no more. We might have more to do on rate reduction separately in T250205 , but I don't think that needs to hold this ancient, epic, somewhat ambiguous task open.

Sep 29 2020, 9:28 PM · Sustainability, serviceops, Performance-Team (Radar), Traffic, SRE
BBlack closed T133821: Make CDN purges reliable, a subtask of T119038: Image cache issue when 'over-writing' an image on commons, as Resolved.
Sep 29 2020, 9:27 PM · Patch-For-Review, Traffic, Multimedia, MediaWiki-File-management, SRE, Commons
BBlack closed T133821: Make CDN purges reliable, a subtask of T109331: Deleted files sometimes remain visible to non-privileged users if permanently linked, as Resolved.
Sep 29 2020, 9:27 PM · Security, SRE-swift-storage, SRE, Vuln-Infoleak, Traffic, Commons
BBlack closed T133821: Make CDN purges reliable, a subtask of T133819: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files, as Resolved.
Sep 29 2020, 9:27 PM · Security, SRE-swift-storage, SRE, Vuln-Infoleak, Traffic, Commons
BBlack closed T128374: Sort out analytics service dependency issues for cp* cache hosts as Declined.

This is too-stale now and a lot of these bits have been replaced over time and are known to have their deps correct.

Sep 29 2020, 9:19 PM · User-Elukey, Varnish, Traffic, Analytics, SRE
BBlack moved T129839: restrict upload cache access for private wikis from Caching to Feature Requests on the Traffic board.
Sep 29 2020, 9:17 PM · Traffic, SRE
BBlack moved T130904: Host rewrite for /static/ not applied to purges from Caching to Bug Reports on the Traffic board.
Sep 29 2020, 9:17 PM · Patch-For-Review, SRE, Traffic
BBlack moved T262428: Cache Accept-language optimisation from Caching to Epic Ideas on the Traffic board.
Sep 29 2020, 9:17 PM · SRE, Traffic
BBlack moved T263291: experiment with a "unified" ATS-BE pool from Caching to Epic Wishlist on the Traffic board.
Sep 29 2020, 9:17 PM · Performance-Team (Radar), Traffic, SRE
BBlack moved T198620: Consider using vmod_var instead of temporary headers in VCL from Caching to Epic Wishlist on the Traffic board.
Sep 29 2020, 9:16 PM · Traffic, SRE
BBlack added a comment to T159412: Convert all of our site.pp/roles to the role/profile paradigm.

Added a subtask for the one Traffic case I can find here, and removing our tag from this.

Sep 29 2020, 9:15 PM · Patch-For-Review, Cloud-Services, Technical-Debt, Puppet, SRE
BBlack removed a project from T159412: Convert all of our site.pp/roles to the role/profile paradigm: Traffic.
Sep 29 2020, 9:14 PM · Patch-For-Review, Cloud-Services, Technical-Debt, Puppet, SRE
BBlack moved T264132: Fix rule violation in the lvs balancer role from Triage to Bug Reports on the Traffic board.
Sep 29 2020, 9:13 PM · Traffic, Technical-Debt, Puppet, SRE
BBlack triaged T264132: Fix rule violation in the lvs balancer role as Low priority.
Sep 29 2020, 9:12 PM · Traffic, Technical-Debt, Puppet, SRE
BBlack created T264132: Fix rule violation in the lvs balancer role.
Sep 29 2020, 9:12 PM · Traffic, Technical-Debt, Puppet, SRE
BBlack moved T159411: Uniform cluster nomenclature across puppet from General to Epic Ideas on the Traffic board.
Sep 29 2020, 9:06 PM · Cloud-Services, Traffic, Technical-Debt, Puppet, SRE
BBlack moved T116132: Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles from General to Epic Ideas on the Traffic board.
Sep 29 2020, 9:05 PM · Performance-Team (Radar), SRE, Traffic
BBlack added a comment to T116132: Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles.

All the perf tradeoffs and relatively-trivial work aside, the major blocker we still face here is the likely problems created by either of the simplest methods of handling this at the tls / varnish-fe layers:

Sep 29 2020, 9:05 PM · Performance-Team (Radar), SRE, Traffic
BBlack moved T129682: Look into solutions for replaying traffic to testing environment(s) from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:52 PM · Platform Team Legacy (Later), Services (later), SRE, Performance Issue, Traffic
BBlack moved T133178: RESTBase support for www.wikimedia.org missing from General to Bug Reports on the Traffic board.
Sep 29 2020, 8:51 PM · Platform Team Legacy (Later), SRE, Traffic, Services (next), RESTBase-API, RESTBase
BBlack moved T177742: Investigate Chrony as a replacement for ISC ntpd from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:50 PM · Patch-For-Review, Traffic, SRE
BBlack removed a project from T185239: Puppet hosts with signed certificate present on agent but not master: Traffic.

lvs100[789] don't exist anymore, removing Traffic from this.

Sep 29 2020, 8:49 PM · User-herron, Puppet, SRE
BBlack updated the task description for T185239: Puppet hosts with signed certificate present on agent but not master.
Sep 29 2020, 8:49 PM · User-herron, Puppet, SRE
BBlack moved T191017: Unwanted service startups and their triggers from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:47 PM · Traffic, SRE
BBlack moved T228533: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 from General to Feature Requests on the Traffic board.
Sep 29 2020, 8:45 PM · Analytics-Radar, User-jbond, Traffic, SRE
BBlack moved T230075: Setting up static maintenance page on Foundation servers for Foundation website from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:44 PM · Security, Traffic, wikimediafoundation.org, SRE
BBlack moved T175691: Geoip lookup - Misidentifying country due to travelling from General to Bug Reports on the Traffic board.
Sep 29 2020, 8:42 PM · SRE, Traffic, FR-Q2-FY2019-20-cleanup-list, Fundraising-Backlog, MediaWiki-extensions-CentralNotice
BBlack removed a project from T238803: Retire fixcopyright.wikimedia.org: Traffic.
Sep 29 2020, 8:40 PM · Release-Engineering-Team-TODO, Projects-Cleanup, fixcopyright.wikimedia.org, Wiki-Setup (Delete / Redirect), SRE
BBlack moved T206951: Puppet doesn't restart ferm on failure from General to Bug Reports on the Traffic board.
Sep 29 2020, 8:37 PM · Sustainability (Incident Followup), Traffic, SRE
BBlack moved T209785: INMARSAT geolocates to the UK, leading to requests going to esams from General to Bug Reports on the Traffic board.
Sep 29 2020, 8:37 PM · SRE, Traffic
BBlack moved T215071: Merge Wikipedia subdomains into one, to discourage censorship from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:37 PM · Domains, DNS, Traffic, SRE, HTTPS
BBlack moved T120085: RFC: Serve Main Page of Wikimedia wikis from a consistent URL from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:36 PM · Readers-Web-Backlog (Tracking), Fundraising-Backlog, Editing-team, Parsing-Team--ARCHIVED, User-notice, Platform Engineering, Performance-Team, SRE, Traffic, TechCom-RFC, SEO, Wikimedia-Site-requests
BBlack moved T236208: interface-rps.py should have a flag to avoid CPU0 from General to Feature Requests on the Traffic board.
Sep 29 2020, 8:36 PM · SRE, Traffic
BBlack moved T237243: Network unreachable after network-online.target is brought up from General to Bug Reports on the Traffic board.
Sep 29 2020, 8:35 PM · netops, SRE, Traffic
BBlack moved T240866: Create a system for distributed shared secret material to server tmps from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:34 PM · SRE, Traffic
BBlack moved T246902: switch to irate() instead of rate() for traffic graphs from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:34 PM · SRE, Traffic
BBlack moved T250251: Audit and harmonize timeouts across the stack from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:30 PM · DBA, Traffic, serviceops, SRE
BBlack moved T253655: Document and/or improve navigation of the various HTTP frontend Grafana dashboards from General to Epic Ideas on the Traffic board.
Sep 29 2020, 8:29 PM · Sustainability (Incident Followup), Performance-Team (Radar), Traffic, SRE, observability
BBlack moved T257324: Consolidate edge bastion server into ganeti from General to Epic Wishlist on the Traffic board.
Sep 29 2020, 8:28 PM · Patch-For-Review, Traffic, SRE
BBlack moved T257323: Consolidate misc servers at edge sites from General to Epic Wishlist on the Traffic board.
Sep 29 2020, 8:28 PM · SRE, Traffic
BBlack moved T264021: ~1 request/minute to intake-logging.wikimedia.org times out at the traffic/service interface from Triage to Bug Reports on the Traffic board.
Sep 29 2020, 8:27 PM · Analytics-Kanban, Traffic, Analytics, SRE
BBlack moved T238305: Servers freezing across the caching cluster from Hardware to Watching on the Traffic board.
Sep 29 2020, 8:21 PM · SRE, Traffic
BBlack moved T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems from Hardware to Watching on the Traffic board.
Sep 29 2020, 8:16 PM · DC-Ops, Traffic, ops-esams, SRE

Sep 24 2020

BBlack added a comment to T263212: Consider balancing VRRP primaries to cr1/cr2.

Ideally we would take the links state into consideration: If the twin link is down alert at 80%, if it's up alert when the sum is at 80% of the ifSpeed of a single link. Which might be doable with LibreNMS custom SQL alerts.

Sep 24 2020, 12:23 PM · SRE, netops

Sep 23 2020

BBlack closed T235736: cp3032 and cp3040 occasional failed fetches as Declined.

Probably related to the transient memory issues discussed in various tickets: T164768 T165063 T249809 . In any case this is almost a year old with no investigation, at least needs a fresher report at this point.

Sep 23 2020, 5:58 PM · SRE, Traffic
BBlack closed T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions as Declined.

Declining for now, as multiple implicated parts of the software stack have changed significantly since this report, and nothing was conclusively found back then. Please file a new one if there's more to look at with current errors!

Sep 23 2020, 5:53 PM · SRE, Traffic
BBlack closed T226375: Investigate esams text varnish backend fetch failures as Resolved.

Long-ago dealt with it looks like, and in any case varnish-be doesn't exist anymore.

Sep 23 2020, 5:52 PM · Patch-For-Review, SRE, Traffic