BBlack (Brandon Black)
WMF Operations Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (138 w, 1 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF)

Recent Activity

Today

BBlack added a comment to T155806: Add CAA records to our domains.

ssllabs confirms the expected changes above. The sslmate generator doesn't allow for custom entries to get globalsign.com in early (as a likely guess for when they flip their switch before the impending CA/B deadline).

Thu, Jun 29, 2:41 PM · Patch-For-Review, HTTPS, Traffic, Operations

Tue, Jun 27

BBlack closed T164610: Unprovision cache_misc @ ulsfo as Resolved.

I had to manually fix up salt keys and do final reboots on 4001+4003, all should be sane and consistent now (except for a couple of IPMI temp checks showing UNKNOWN in icinga).

Tue, Jun 27, 11:22 PM · Patch-For-Review, Operations, Traffic
BBlack closed T164610: Unprovision cache_misc @ ulsfo, a subtask of T164327: replace ulsfo aging servers, as Resolved.
Tue, Jun 27, 11:22 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo
BBlack created T169020: Decommission cp400[1-4].
Tue, Jun 27, 11:00 PM · ops-ulsfo, hardware-requests, Operations
BBlack added a comment to T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches.

Answering my own timeline question, it looks like it was announced that RCStream goes away July 7th!

Tue, Jun 27, 6:01 PM · Operations, Traffic
BBlack added a parent task for T156919: Port RCStream clients to EventStreams: T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches.
Tue, Jun 27, 5:59 PM · Analytics-Kanban, Wikimedia-Stream
BBlack added a subtask for T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches: T156919: Port RCStream clients to EventStreams.
Tue, Jun 27, 5:59 PM · Operations, Traffic
BBlack moved T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches from Triage to TLS on the Traffic board.
Tue, Jun 27, 5:59 PM · Operations, Traffic
BBlack updated subscribers of T161517: Allow anonymous users to change interface language on Commons with ULS.

Ok I think I was confused as to the state of the uselang hack. It looks like it already works, in uncacheable form, on all wikis? When I try it on enwiki with uselang set to es, I get an uncacheable copy of the article with the UI header stuff in Spanish, the content-language response header still set to en, and of course the article text still in en. On that front, it would still be an improvement if uselang outputs were cacheable, IMHO.

Tue, Jun 27, 2:30 PM · Operations, Traffic, Patch-For-Review, Commons, Wikimedia-Site-requests, I18n

Mon, Jun 26

BBlack added a comment to T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches.

( Note also ori did a soft announce of HTTPS transition for it about a year ago, but with no target date for disabling plain HTTP: https://lists.gt.net/wiki/wikitech/719999 . This was around the same time RCStream's wikitech docs had their URLs switched to HTTPS as well ).

Mon, Jun 26, 11:52 PM · Operations, Traffic
BBlack updated subscribers of T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches.

@Ottomata - Any high level new info about timetables for deprecating and then removing the RCStream stuff in favor of EventStreams ( T130651 )? If it looks like it might drag on a while, we might want to go back to the idea of announcing an HTTPS-only transition ahead of the removal, perhaps. The main issue there was that at least some RCStream clients don't seem to follow redirects to HTTPS, and therefore would need manual updates of their configs to use https:// or wss:// or they get broken.

Mon, Jun 26, 11:50 PM · Operations, Traffic
BBlack updated the task description for T104681: HTTPS Plans (tracking / high-level info).
Mon, Jun 26, 11:27 PM · Tracking, Operations, Traffic, HTTPS
BBlack updated the task description for T104681: HTTPS Plans (tracking / high-level info).
Mon, Jun 26, 11:26 PM · Tracking, Operations, Traffic, HTTPS
BBlack updated the task description for T104681: HTTPS Plans (tracking / high-level info).
Mon, Jun 26, 11:26 PM · Tracking, Operations, Traffic, HTTPS
BBlack updated the task description for T104681: HTTPS Plans (tracking / high-level info).
Mon, Jun 26, 11:13 PM · Tracking, Operations, Traffic, HTTPS
BBlack added a subtask for T92002: implement Public Key Pinning (HPKP) for Wikimedia domains: T148131: Deploy redundant unified certs.
Mon, Jun 26, 11:09 PM · Operations, Traffic, HTTPS
BBlack added a parent task for T148131: Deploy redundant unified certs: T92002: implement Public Key Pinning (HPKP) for Wikimedia domains.
Mon, Jun 26, 11:09 PM · Wikimedia-Incident, Traffic, Operations
BBlack removed a parent task for T153563: Consider switching to HTTPS for Wikidata query service links: T104681: HTTPS Plans (tracking / high-level info).
Mon, Jun 26, 11:08 PM · Operations, Traffic, HTTPS, Wikidata, Discovery, Wikidata-Query-Service
BBlack removed a subtask for T104681: HTTPS Plans (tracking / high-level info): T153563: Consider switching to HTTPS for Wikidata query service links.
Mon, Jun 26, 11:08 PM · Tracking, Operations, Traffic, HTTPS
BBlack removed a parent task for T148131: Deploy redundant unified certs: T92002: implement Public Key Pinning (HPKP) for Wikimedia domains.
Mon, Jun 26, 11:07 PM · Wikimedia-Incident, Traffic, Operations
BBlack removed a subtask for T92002: implement Public Key Pinning (HPKP) for Wikimedia domains: T148131: Deploy redundant unified certs.
Mon, Jun 26, 11:07 PM · Operations, Traffic, HTTPS
BBlack removed a parent task for T92002: implement Public Key Pinning (HPKP) for Wikimedia domains: T104681: HTTPS Plans (tracking / high-level info).
Mon, Jun 26, 11:07 PM · Operations, Traffic, HTTPS
BBlack removed a subtask for T104681: HTTPS Plans (tracking / high-level info): T92002: implement Public Key Pinning (HPKP) for Wikimedia domains.
Mon, Jun 26, 11:07 PM · Tracking, Operations, Traffic, HTTPS
BBlack added a comment to T104681: HTTPS Plans (tracking / high-level info).

The original point of this (now ~2 years old) tracking task was to track the very long tail of known but relatively-minor issues preventing us from reaching a full transition to modern HTTPS-only for all things public-facing that the Foundation has control over, in the wake the transition for our major public hostnames announced in https://blog.wikimedia.org/2015/06/12/securing-wikimedia-sites-with-https/ . This ticket has a danger of becoming an undead meta-task as more sub-tasks might topically accrete to it over time. It's not intended to replace the idea of a tag or a workboard column, after all, and we have such a thing in the TLS column of the Traffic workboard for tracking all other ongoing TLS improvements. The remaining set of valid open tasks is fairly small at this point and a few of them are near closing, so we're going to try to wind this ticket down completely over the next several months, and I'm going to unlink a few on categorical grounds today.

Mon, Jun 26, 11:06 PM · Tracking, Operations, Traffic, HTTPS
BBlack added a subtask for T104681: HTTPS Plans (tracking / high-level info): T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches.
Mon, Jun 26, 11:02 PM · Tracking, Operations, Traffic, HTTPS
BBlack added a parent task for T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches: T104681: HTTPS Plans (tracking / high-level info).
Mon, Jun 26, 11:02 PM · Operations, Traffic
BBlack created T168919: stream.wikimedia.org: remove legacy rcstream/socket.io HTTPS redirect hole punches.
Mon, Jun 26, 11:01 PM · Operations, Traffic
BBlack closed T70528: stream.wikimedia.org - redirect http(s) to docs as Resolved.

This has been working for some time, at least for the HTTPS issue at the root as tasked here! The other part about docs probably isn't relevant anymore, as the service is being replaced.

Mon, Jun 26, 10:59 PM · Operations, Traffic, Wikimedia-Stream
BBlack closed T131131: Canonical URL in Store points to HTTP address, should be HTTPS as Resolved.

Currently this looks to be fixed. The relevant snippet on the live store site is now:

<script>
        if (window.location.protocol == "http:") {
                var restOfUrl = window.location.href.substr(5);
                window.location.replace("https:" + restOfUrl);
        }
</script>
<link rel="canonical" href="https://store.wikimedia.org/" />

(and yes, they do seem to properly 301-redirect HTTP to HTTPS, so I'm not sure why the hacky JS protocol redirect is still there, but it doesn't hurt anything)

Mon, Jun 26, 10:27 PM · Operations, HTTPS, Wikimedia-Shop, Traffic
BBlack closed T131131: Canonical URL in Store points to HTTP address, should be HTTPS, a subtask of T128559: store.wikimedia.org HTTPS issues, as Resolved.
Mon, Jun 26, 10:27 PM · Operations, Traffic, Wikimedia-Shop, HTTPS
BBlack added a comment to T128559: store.wikimedia.org HTTPS issues.

@Jseddon @MBeat33 - ping again? The redirect appears to work currently, but still no HSTS header.

Mon, Jun 26, 10:24 PM · Operations, Traffic, Wikimedia-Shop, HTTPS
BBlack added a comment to T137161: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains.

Yeah we can close this task if the sites are gone. We'll want to remove the current IP address mapping for these hostnames from our DNS when this happens (or now, if they're already no longer in use), to complete the removal of the concern. Is it just benefactorevents.wikimedia.org and eventdonations.wikimedia.org that are going away, or also others like benefactors.wikimedia.org?

Mon, Jun 26, 10:07 PM · Traffic, Operations
BBlack added a comment to T166782: wikimediafoundation.org's language selector is confusing to most visitors who don't have accounts there.

I guess that the traffic to wikimediafoundation.org is relatively low, and anonymous selection can be enabled. There's a long discussion about enabling it on Commons, where the traffic is higher and there are concerns about caching issues, but it shouldn't be a problem on wikimediafoundation.org. If ops don't object, it should be done.

Mon, Jun 26, 7:18 PM · Operations, Wikimedia-General-or-Unknown, I18n
BBlack added a comment to T161517: Allow anonymous users to change interface language on Commons with ULS.

This task has gotten a bit confusing. Stepping back a bit from the specific case of Commons (because I think the same issues apply everywhere?)... let me try to recap here a little, and correct me please if I've gotten some of this wrong:

Mon, Jun 26, 7:11 PM · Operations, Traffic, Patch-For-Review, Commons, Wikimedia-Site-requests, I18n

Fri, Jun 23

BBlack added a comment to T164768: Explicitly limit varnishd transient storage.

On text, transient storage usage seems pretty reasonable; we could cap as follows, leaving plenty of room for spikes:

cache_typelayercap
textfrontend5G
textbackend2G
Fri, Jun 23, 2:46 PM · Patch-For-Review, Operations, Traffic

Wed, Jun 21

BBlack added a comment to T156256: Select or Acquire Address Space for Asia Cache DC.

No updates yet, we're still finalizing DC vendor selection (one of several steps before APNIC will possibly give us new address space). Any firmer timeline on how long after acquisition of the address space the partner update process will take?

Wed, Jun 21, 8:17 PM · Traffic, Operations
BBlack added a comment to T133178: RESTBase support for www.wikimedia.org missing.

I think it's fair to at least put forward arguments for a 3rd point of view as well:

Wed, Jun 21, 4:09 PM · Operations, Traffic, Services (next), RESTBase-API, RESTBase
BBlack added a comment to T168529: Upgrade to Varnish 5.

We've discussed the V5 upgrade a few times in the past (although not much recently or explicitly), so I'll try to recap here some related thoughts:

Wed, Jun 21, 2:57 PM · Performance-Team, Operations, Traffic

Mon, Jun 19

BBlack added a comment to T168033: Json queries fail "Too Many Requests".

But I don't see how is it reasonable to fail requests when some metric is exceeded, vs. delaying responses.

Mon, Jun 19, 2:35 PM · Operations, Wikimedia-General-or-Unknown
BBlack added a comment to T167920: Impending load test.

We've also been tweaking and tuning our ratelimits in general to try to find a happy medium. Both of the API endpoints should now be limiting at the same rate of 1000 reqs per 10s per client IP (as a burstable token bucket filter).

Mon, Jun 19, 2:28 PM · Traffic, Wikimedia-General-or-Unknown, Operations

Fri, Jun 16

BBlack added a comment to T167400: Disable serving unpatrolled new files to Wikipedia Zero users.

Why restrict this mechanism to Zero, making Zero different from other access? We could instead deny access to unpatrolled files for users that aren't logged-in.

But then Commons would no longer be a wiki.

Fri, Jun 16, 3:01 PM · Traffic, Operations, media-storage, Commons, Multimedia, Zero
BBlack added a comment to T167400: Disable serving unpatrolled new files to Wikipedia Zero users.

Wikipedia Zero traffic is tied to IP addresses, not users. So it definitely could be performant. Have MediaWiki set an unpatrolled header and purge on patrol. Then (somehow) configure Varnish to understand WP0 IP ranges and block if the unpatrolled header is set.

So the idea would be:

  • MediaWiki sets something along the lines of MediaWiki-patrol-status: unpatrolled in File::getContentHeaders()
  • Varnish looks for that header when getting files from swift. If the file is unpatrolled, and (maybe) its above a certain size, and the IP address is Zero-rated: Give a 403. Also make sure that cache varrying is set for unpatrolled files based on Zero-ratedness of IP
  • On patrol, MediaWiki makes swift backend remove the header, and sends purge to varnish.

    Downsides: If anyone using a zero-rated connection is a file patroller, they won't be able to see the file.
Fri, Jun 16, 2:28 PM · Traffic, Operations, media-storage, Commons, Multimedia, Zero

Thu, Jun 15

BBlack added a comment to T167842: Find a new PIM RP IP.

Multicast has its uses in general. Even if we kill HTCP another use may pop up. I did quick survey to try to find active uses. Filtering for just v4, removing the standard references you'd see to 224.0.0.1 everywhere, and ignoring the excepted cp machines' multicast address as documented at https://wikitech.wikimedia.org/wiki/Multicast_HTCP_purging#Multicast_Addressing , these were the surprises:

Thu, Jun 15, 8:16 PM · Operations, netops
BBlack created T167966: Look into feasibility of disabling sha-1 host keys on our ssh daemons.
Thu, Jun 15, 1:37 PM · Operations
BBlack added a comment to T118365: Increase request limits for GETs to /api/rest_v1/.

That top client appears to be CrossRefEventDataBot from https://www.crossref.org/services/event-data/ , running on a hosted server at Hetzner in DE.

Thu, Jun 15, 12:30 PM · Analytics, Operations, Traffic
BBlack added a comment to T167920: Impending load test.

we actually haven't been given the numbers yet, but I expect us to handle a few million requests over the course of a couple of hours. We only make calls to you on cache miss, so all of the traffic won't roll over, but from what we saw last night we tend to cache miss on 3-9% of requests.

Thu, Jun 15, 12:16 AM · Traffic, Wikimedia-General-or-Unknown, Operations

Wed, Jun 14

BBlack added a comment to T167840: Merge AS14907 with AS43281.

What are the real pros and cons on this? We could even go in the other direction and have a unique ASN per region/continent. How does the impact future anycasting? Note https://tools.ietf.org/html/rfc6382 talks about best practice for anycast being to have distinct ASNs per region, but I don't pretend to understand all the finer details and arguments in that RFC.

Wed, Jun 14, 1:22 AM · Operations, netops

Tue, Jun 13

BBlack added a comment to T118557: Replace Analytics XFF/client.ip data with X-Client-IP.

No I don't think we need it for non-immediate analysis like this. We still zero, zeronet and proxy in the X-Analytics string too (which is also in webreq data)

Tue, Jun 13, 3:31 PM · Analytics-Kanban, Patch-For-Review, Traffic, Operations

Mon, Jun 12

BBlack added a comment to T167691: High amount of unexpected ICMP dest unreachable toward esams cache clusters.

The cp* should at least occasionally be sending normal ICMP responses correlated with their TCP flows, e.g. "Time Exceeded" and such. The LVSes are configured to schedule inbound ICMPs to the caches as well, IIRC. As for the original problem, the inbound Dest Unreach aren't necessarily in response to ICMP echo or similar. I think they could be the natural result of TCP SYN spoofing towards us (e.g. attacker sends spoofed SYN to us with unreachable source address, we send back SYN+ACK to said unreachable dest address, then some router sends us an ICMP Dest Unreach in response to our SYN+ACK).

Mon, Jun 12, 7:42 PM · netops, Operations, Traffic

Fri, Jun 9

BBlack added a project to T167492: Accessing zh-classical.wikipedia.org on a mobile device does not redirect to zh-classical.m.wikipedia.org: Traffic.

I think this is because our mobile-redirect logic doesn't support dashes, but probably should due to these cases? Patch above proposed, if it makes logical sense to support mobile-redirects on "language" subdomains containing dashes (\w is only alphanumerics and underscore).

Fri, Jun 9, 2:29 PM · Operations, Traffic, Patch-For-Review, Wikimedia-Apache-configuration, Mobile

Thu, Jun 8

BBlack closed T162132: cp3003 network interface issues as Declined.

cp3003 is decomming for good in T167376

Thu, Jun 8, 3:58 AM · Traffic, Operations, ops-esams
BBlack created T167377: Decommission cp4011, cp4012, cp4019, cp4020 .
Thu, Jun 8, 3:56 AM · ops-ulsfo, hardware-requests, Operations
BBlack created T167376: Decommission cp300[3456].
Thu, Jun 8, 3:55 AM · hardware-requests, Operations, ops-esams
BBlack closed T164608: Merge cache_maps into cache_upload functionally as Resolved.
Thu, Jun 8, 3:51 AM · Patch-For-Review, Traffic, Operations

Wed, Jun 7

BBlack added a comment to T163233: Implement Varnish-level rough ratelimiting.

Stared at hashtable implementation some more, as well as the linux iptables hashlimit one (which I consider a sort of baseline canonical efficient implementation). The linux one is definitely more-configurable, but I think we can work with what vsthrottle gives us in terms of configurability of the bucket itself. The lack of a cost parameter is still an issue, but it's a long-term issue if/when we get to the point where we define a header that applications can send in their response to indicate heavier costs.

Wed, Jun 7, 3:32 PM · Analytics, Patch-For-Review, Traffic, Operations
BBlack added a comment to T163233: Implement Varnish-level rough ratelimiting.

re: vsthrottle, my thoughts after a quick look this morning:

Wed, Jun 7, 2:50 PM · Analytics, Patch-For-Review, Traffic, Operations
BBlack added a project to T167299: Upgrade BIOS/RBSU/etc on lvs1007: ops-eqiad.
Wed, Jun 7, 2:05 PM · ops-eqiad, Traffic, netops, Operations
BBlack created T167299: Upgrade BIOS/RBSU/etc on lvs1007.
Wed, Jun 7, 2:05 PM · ops-eqiad, Traffic, netops, Operations
BBlack reopened T167299: Upgrade BIOS/RBSU/etc on lvs1007, a subtask of T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006, as Open.
Wed, Jun 7, 2:05 PM · Patch-For-Review, Traffic, netops, Operations

Tue, Jun 6

BBlack added a comment to T108435: Add proper expiry headers to kartotherian's responses.

My understanding from @BBlack is that we don't want to add application specific configuration in varnish, my understanding from @MaxSem is that this is something that is already done for multiple applications.

Tue, Jun 6, 10:03 PM · Maps (Kartotherian), Discovery, Interactive-Sprint

Mon, Jun 5

BBlack added a comment to T167046: Map tiles load way slower than before.

Another thought - could we be maxing out parallel connections to the kartotherian machines? We've always had a max_connections of 1000 (per varnish backend, to kartotherian), but the number of varnish backends has multiplied with this change, which could have multiplied the overall limit on connections opened from varnish to kartotherian, and then hit some limit causing further connections to stall?

Mon, Jun 5, 7:57 PM · Operations, Regression, Traffic, Maps (Kartographer), Interactive-Sprint
BBlack added a comment to T167046: Map tiles load way slower than before.

Are you comparing cache hits to cache misses? From where? What was the timing like before?

Mon, Jun 5, 7:46 PM · Operations, Regression, Traffic, Maps (Kartographer), Interactive-Sprint

Fri, Jun 2

BBlack added a comment to T166888: CI for operations/puppet is taking too long.

What setting were we on before we moved to FF-only? Whatever the prior setting was, it tended to create spurious merge commits all over our actual git history which makes a mess of it, like: https://github.com/wikimedia/puppet/commit/1df0dab661e57fc73942defa4ba4a57436b1b240 . Is "Rebase if necessary" new-ish since then or did we just fail to know it was the optimal option?

Fri, Jun 2, 5:09 PM · Release-Engineering-Team (Kanban), Patch-For-Review, Operations, Continuous-Integration-Infrastructure
BBlack added a comment to T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.

[lvs1011 above just had some minor salt keying issues, fixed+rebooted]

Fri, Jun 2, 4:54 PM · Patch-For-Review, Traffic, netops, Operations

Thu, Jun 1

BBlack created P5528 lvs1007 lspci stuff.
Thu, Jun 1, 4:28 PM · Operations
BBlack added a comment to T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan.

Yeah that was the plan, for XKey to help here by consolidating that down to a single HTCP / PURGE per article touched. It's not useful for the mass-scale case (e.g. template/link references), as it doesn't scale well in that direction. But for the case like "1 article == 7 URLs for different formats/variants/derivatives" it should work great. The varnish module for it is deployed, but we haven't ever found/made the time to loop back to actually using it (defining standards for how to transmit it over the existing HTCP protocol or the new EventBus and pushing developers to make use of it). I think last we talked we were going to move cache-purge traffic over to EventBus before tackling this (with kafka consumers on the cache nodes pulling the purges), but I'm not sure what the relative timelines on all related projects look like anymore.

Thu, Jun 1, 2:32 PM · Performance-Team, Wikidata, MediaWiki-Cache, MediaWiki-JobQueue, Traffic, Operations

Wed, May 31

BBlack added a comment to T145661: varnish backends start returning 503s after ~6 days uptime.

This is what we had before (copied from far above):

Wed, May 31, 2:09 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan.

We can get broader averages by dividing the values seen in the aggregate client status code graphs using eqiad's text cluster (the remote sites would expect fewer due to some of the bursts being more likely to be dropped by the network)

Wed, May 31, 11:14 AM · Performance-Team, Wikidata, MediaWiki-Cache, MediaWiki-JobQueue, Traffic, Operations

Tue, May 30

BBlack added a comment to T155806: Add CAA records to our domains.

We talked a bit on IRC. Probably the first step is to include all the canonical domains (the 14-domain set in our big unified cert):

Tue, May 30, 11:19 PM · Patch-For-Review, HTTPS, Traffic, Operations
BBlack added a comment to T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan.

The lack of graph data from falling off the history is a sad commentary on how long this has remained unresolved :(

Tue, May 30, 11:08 PM · Performance-Team, Wikidata, MediaWiki-Cache, MediaWiki-JobQueue, Traffic, Operations
BBlack added a comment to T155806: Add CAA records to our domains.

(even the non-canonicals, IMHO).

Tue, May 30, 4:48 PM · Patch-For-Review, HTTPS, Traffic, Operations
BBlack added a comment to T155806: Add CAA records to our domains.

Why start with just wikipedia and wikimedia? We could go after our lower-traffic domains first as a test, but since we don't issue individual certs for them there's no real functional testing to happen there. Perhaps we could use a lesser domain just to validate that other tools parse and validate the CAA correctly from the public DNS view, though. Once we turn on the big two, we may as well turn on all the others as well (even the non-canonicals, IMHO).

Tue, May 30, 4:47 PM · Patch-For-Review, HTTPS, Traffic, Operations

May 30 2017

BBlack closed T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) as Resolved.

There is an apparent performance improvement that coincides in timing, but on a simulated slow internet connection:

T166373: Investigate apparent performance improvement around 2017-05-24

The improvement is only experienced on the large articles + slow connection combo. Could that be it?

May 30 2017, 1:55 PM · Patch-For-Review, Performance-Team, Traffic, Operations

May 26 2017

BBlack created T166397: Cumin fails on huge nodelists emitted by its own outputs.
May 26 2017, 5:16 PM · Operations-Software-Development

May 25 2017

BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

So I've stared at NavTiming graphs, and honestly it's hard to read any notable difference in the tea leaves.

May 25 2017, 4:00 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T118557: Replace Analytics XFF/client.ip data with X-Client-IP.

From my perspective, where we last stalled out is waiting for Analytics to say it's ok to merge https://gerrit.wikimedia.org/r/#/c/253474 (which removes XFF data from the webrequest stream). I'm not sure what blockers/deps/validation may be pending over in the Analytics side before that.

May 25 2017, 1:10 PM · Analytics-Kanban, Patch-For-Review, Traffic, Operations

May 24 2017

BBlack added a comment to T166229: Mediawiki replies with 500 on wrongly formatted CSP report.

Agreed this should be 4xx rather than 5xx

May 24 2017, 5:12 PM · MW-1.30-release-notes (WMF-deploy-2017-06-06_(1.30.0-wmf.4)), User-fgiunchedi, Operations, Traffic, Security-Team, MediaWiki-General-or-Unknown

May 22 2017

BBlack added a comment to T162850: CPU throttling on DELL PowerEdge R320.

acamar hit this again on Sunday, in spite of the (working) acpi_pad blacklist. A simple reboot seems to have cleared it. The next- best advice (based on that old Dell info) would be to blacklist mei. I've rmmod'd it on acamar for now to see if it causes additional issues before we try blacklisting it on all.

May 22 2017, 6:47 PM · Patch-For-Review, Operations

May 20 2017

BBlack added a comment to T137161: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains.

@Jgreen - re: civicrm, it needs to emit the HSTS header on all HTTPS responses.

May 20 2017, 3:19 AM · Traffic, Operations

May 19 2017

BBlack reopened T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan, a subtask of T133821: Content purges are unreliable, as Open.
May 19 2017, 3:26 PM · Operations, Traffic
BBlack reopened T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan as "Open".

Not resolved, as the purge graphs can attest!

May 19 2017, 3:26 PM · Performance-Team, Wikidata, MediaWiki-Cache, MediaWiki-JobQueue, Traffic, Operations
BBlack updated the task description for T165765: Refactor pybal/LVS config for shared failover.
May 19 2017, 2:21 PM · Patch-For-Review, Operations, Traffic
BBlack created T165765: Refactor pybal/LVS config for shared failover.
May 19 2017, 2:20 PM · Patch-For-Review, Operations, Traffic
BBlack created T165764: Fully-redundant LVS clusters using Pybal per-service MED feature.
May 19 2017, 2:04 PM · Pybal, Operations, Traffic

May 18 2017

BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

So from the above, apache really has 3 different modes of operation:

May 18 2017, 9:20 PM · Pybal, Traffic, Operations, netops
BBlack added a comment to T96852: Define 3-host infra cluster for traffic pops.

The tentative and limited plan for now is to deploy 3x misc/infra hosts (meaning all the hosts other than lvs and cp) at each cache site and not use virtualization. We might revisit this at a later date. The basic layout looks like:

May 18 2017, 2:45 PM · Operations, Traffic
BBlack added a comment to T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.

I'm probably backtracking into territory that was once known here, but after the long delay I felt I had to go back and re-validate what's going on with the ports and the thinking.

May 18 2017, 1:49 PM · Patch-For-Review, Traffic, netops, Operations

May 17 2017

BBlack added a comment to T165614: LLDP on cache hosts.

Answering for @ema I think this mostly came up as a consequence of trying to map out the data in T150256#3271004 using lldpcli to confirm port connections. That led to an in-depth conversation about how our racks and rows and switches and vlans are set up and how and where redundancy matters, which led to him looking at our caches and their port/rack/row mapping (also in racktables) and how they're not laid out very ideally in eqiad. Basically these are just questions spawned from exploration.

May 17 2017, 5:40 PM · Traffic, netops, Operations
BBlack added a comment to T165618: Audit / document reasons for not enabling HT?.

I think that almost universally, HT is a win for the host as a whole. There's always more things going on than there are cpu cores. If nothing else, picture it in your head as "puppet agent and stats outputting stuff can run in that extra headroom without impacting the important stuff" or whatever.

May 17 2017, 5:28 PM · Operations
BBlack added a comment to T165618: Audit / document reasons for not enabling HT?.

Edited the top part, re-ran excluding virtuals.

May 17 2017, 5:12 PM · Operations
BBlack updated the task description for T165618: Audit / document reasons for not enabling HT?.
May 17 2017, 5:12 PM · Operations
BBlack created T165618: Audit / document reasons for not enabling HT?.
May 17 2017, 5:01 PM · Operations
BBlack added a comment to T165614: LLDP on cache hosts.

So, a couple points:

May 17 2017, 4:51 PM · Traffic, netops, Operations
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

On the reboot issue: I've tested cp4021 and the existing puppetization works fine on reboot (even given the other stuff below).

May 17 2017, 4:46 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

I wonder if Chrome (which is the dominant browser now, not MSIE as indicated in that nginx source comment) sends the close notify?

May 17 2017, 4:21 PM · Pybal, Traffic, Operations, netops
BBlack added a comment to T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.

Also notable: lvs1009 and lvs1012 connections to row B (eth2) are using 1GbE ports rather than 10GbE?

May 17 2017, 1:35 PM · Patch-For-Review, Traffic, netops, Operations
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

Also while I'm thinking about it - we should validate that the sysctl setting for fq as default qdisc "sticks" on reboot and isn't affected by some kind of ordering race...

May 17 2017, 12:49 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

There's not a lot of good data on how BBR behaves in datacenter-like networks (high bandwidth, low latency, low loss, etc). It's not really the use case it was designed for, and the reports from others have been mixed. It probably won't turn out completely awful or anything, but I don't know if it would actually fix the port saturation problem or not.

May 17 2017, 12:48 PM · Patch-For-Review, Performance-Team, Traffic, Operations

May 16 2017

BBlack added a comment to T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.

Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 . The idea was to try our best to ensure that a given vlan/row's LVS connections are FPC-redundant between the primaries and secondaries. e.g. if lvs1007 (primary for high-traffic1) connects to row C in asw-c-eqiad FPC 5, then lvs1010 (secondary for high-traffic1) needs to connect to row C / asw-c-eqiad in some FPC other than 5. Row D connections have probably changed entirely since that last table was made, and there were some pending moves/fixups listed there as well which may or may not have already happened.

May 16 2017, 1:53 PM · Patch-For-Review, Traffic, netops, Operations

May 15 2017

BBlack created T165252: cp1053 possible hardware issues.
May 15 2017, 1:28 AM · ops-eqiad, Traffic, Operations