BBlack (Brandon Black)
WMF Operations Engineer

Projects (6)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (193 w, 2 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF) [ Global Accounts ]

Recent Activity

Today

BBlack added a comment to T199816: Sunset Watchmouse's status.wikimedia.org.

IIRC, there's a bunch of crazy config supporting it on the same rackspace server that hosts wikitech-static. Some hacky stuff I threw together to proxy from watchmouse because they didn't support HTTPS. So, we should probably ssh over to there and kill that stuff, too, after the patches above are done.

Thu, Jul 19, 1:14 PM · monitoring, Patch-For-Review, Operations

Yesterday

BBlack added a comment to T198922: Setup wikimediafoundation.org domain for July 30 launch of new site.

I guess me!

Wed, Jul 18, 6:04 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T198922: Setup wikimediafoundation.org domain for July 30 launch of new site.

Thanks! Since that's also the IP they use for policy.wikimedia.org, we can at least have some confidence in the basics of the TLS config, from our auditing on that other hostname.

Wed, Jul 18, 4:47 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches.

I should mention a few other technical issues that have been crossing my mind as we try to wade through this:

Wed, Jul 18, 1:13 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, Wiki-Setup (Rename), Release-Engineering-Team, Traffic, DNS, Operations, WMF-Communications
BBlack added a comment to T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches.

Sorry if the above was not clear - the corp site's primary URL should NOT be redirecting to the old site's new URL.

So to be clear, wikimediafoundation.org should NOT redirect to foundation.wikimedia.org.

Wed, Jul 18, 12:58 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, Wiki-Setup (Rename), Release-Engineering-Team, Traffic, DNS, Operations, WMF-Communications

Tue, Jul 17

BBlack added a comment to T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches.

Human status update, amidst the flurry of gerritbot notes:

Tue, Jul 17, 7:40 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, Wiki-Setup (Rename), Release-Engineering-Team, Traffic, DNS, Operations, WMF-Communications
BBlack added a comment to T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches.

So, trying to sum a bit of process on the above as I see it:

Tue, Jul 17, 2:50 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, Wiki-Setup (Rename), Release-Engineering-Team, Traffic, DNS, Operations, WMF-Communications
BBlack added a comment to T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches.

I don't think redirects are really an acceptable solution for the policy pages, as we'd still be sending users through a third party under different policies in order to read the policies for the site they're using... The links need to actually be changed on the production wikis (e.g. to foundation.wikimedia.org). This should happen in time for cached copies of the links to expire out (a week would be ideal), which implies the foundation.wikimedia.org site needs to be working and canonical for itself by the end of this week.

Tue, Jul 17, 2:32 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, Wiki-Setup (Rename), Release-Engineering-Team, Traffic, DNS, Operations, WMF-Communications
BBlack added a comment to T198922: Setup wikimediafoundation.org domain for July 30 launch of new site.

The key thing we're missing here on our end, by the actual transition date, is an IP address from Automattic to put in our DNS for this domain.

Tue, Jul 17, 12:51 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T188776: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches.

It's pretty unclear to me (perhaps I'm failing at reading!) exactly what needs to happen here at various levels of detail. What's clear is "move the foundation wiki from wikimediafoundation.org to foundation.wikimedia.org", but what's unclear even in the abstract is what date that should take effect (ASAP?) whether there's intended to be any transition period (redirects?).

Tue, Jul 17, 12:22 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, Wiki-Setup (Rename), Release-Engineering-Team, Traffic, DNS, Operations, WMF-Communications

Mon, Jul 16

BBlack added a comment to T198922: Setup wikimediafoundation.org domain for July 30 launch of new site.

Certs: yes, they should use Letsencrypt, which we'll authorize via CAA records in our DNS.
HSTS: yes, with a 1-year lifetime and preloading enabled. This and other HTTPS policy details are covered (at least lightly to basic minimums) here, if you want to point Automattic at it: https://wikitech.wikimedia.org/wiki/HTTPS#For_all_public-facing_HTTP[S]_sites_and_services_under_Wikimedia_control

Mon, Jul 16, 6:18 PM · Patch-For-Review, Traffic, Operations
BBlack moved T199720: Deploy initial ATS test clusters in core DCs from Triage to Caching on the Traffic board.
Mon, Jul 16, 6:14 PM · Traffic, Operations
BBlack moved T199711: Deploy a scalable service for ACME (LetsEncrypt) certificate management from Triage to TLS on the Traffic board.
Mon, Jul 16, 6:14 PM · Traffic, Operations, Goal
BBlack moved T199717: Pick up a suitable ACME library for certcentral from Triage to TLS on the Traffic board.
Mon, Jul 16, 6:14 PM · Patch-For-Review, Traffic, Operations
BBlack added a parent task for T96853: Evaluate Apache Traffic Server: T199720: Deploy initial ATS test clusters in core DCs .
Mon, Jul 16, 3:55 PM · Operations, Traffic
BBlack added a subtask for T199720: Deploy initial ATS test clusters in core DCs : T96853: Evaluate Apache Traffic Server.
Mon, Jul 16, 3:55 PM · Traffic, Operations
BBlack created T199720: Deploy initial ATS test clusters in core DCs .
Mon, Jul 16, 3:54 PM · Traffic, Operations
BBlack raised the priority of T187157: cp5006 unresponsive from Normal to High.

Turning priority to "high" for this and the 5001 ticket ( T199675 ), as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes.

Mon, Jul 16, 3:35 PM · Traffic, Operations, ops-eqsin
BBlack raised the priority of T199675: cp5001 unreachable since 2018-07-14 17:49:21 from Normal to High.

Turning priority to "high" for this and the 5006 ticket, as between the two of them they leave the upload@eqsin at its design limit of 4 reliable nodes.

Mon, Jul 16, 3:33 PM · Operations, ops-eqsin, Traffic

Wed, Jul 11

BBlack added a comment to T199219: WDQS should use internal endpoint to communicate to Wikidata.

It's a complicated topic I think, on our end. There are ways to make it work today, but when I try to write down generic steps any internal service could take to talk to any other (esp MW or RB), it bogs down in complications that are probably less than ideal in various language/platform contexts.

Wed, Jul 11, 6:41 PM · Wikidata, Wikidata-Query-Service

Tue, Jul 10

BBlack added a comment to T199252: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest.

Was the "temporary" JS redirect a 301 perhaps?

Tue, Jul 10, 7:42 PM · Performance-Team, Operations, Traffic, Wikimedia-General-or-Unknown
BBlack added a comment to T199252: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest.

wgCacheEpoch is probably about the parser cache, which is separate from Traffic 's Varnish caching. Either one could be an issue here, or both.

Tue, Jul 10, 5:11 PM · Performance-Team, Operations, Traffic, Wikimedia-General-or-Unknown
BBlack added a comment to T190941: Our HTTPS private keys seem to be obtainable under non-US surveillance laws, creating a worldwide MITM attack vector.

+1, I don't think there's anything too sensitive in here, although I might have chosen different wording in places if this were a highly visible blog post or something :)

Tue, Jul 10, 1:13 PM · Security

Mon, Jul 9

BBlack added a comment to T199146: "Blocked" response when trying to access constraintsrdf action from production host.

This raises some questions that are probably unrelated to the problem at hand, but might affect things indirectly:

  • Why is an internal service (wdqs) querying a public endpoint? It should probably use private internal endpoints like appservers.svc or api.svc, but there may be arguments about desirability of [Varnish] caching. This is something we're grappling with in general in the longer-term (trying to understand and/or eliminate private internal service<->service traffic routing through the public edge unnecessarily).
  • Why is it using webproxy to access it? It should be able to reach www.wikidata.org without any kind of proxy.
Mon, Jul 9, 7:45 PM · Wikidata-Campsite, MW-1.32-release-notes (WMF-deploy-2018-07-10 (1.32.0-wmf.12)), Patch-For-Review, Wikibase-Quality-Constraints, Wikidata, Wikidata-Query-Service
BBlack assigned T187157: cp5006 unresponsive to RobH.
Mon, Jul 9, 4:02 PM · Traffic, Operations, ops-eqsin

Thu, Jun 28

BBlack added a comment to T192559: Establish timeline and methodology for upcoming deprecation of non-forward-secret ciphers and TLSv1.0.

Going a bit beyond the explicit scope of this ticket, there are really a few different legacy-support risks we'd like to get rid of in our TLS configuration ASAP (but ASAP might be a year or more out in some cases!), so it helps to think about them all in concert. Roughly in expected order of removal based on expected feasibility:

Thu, Jun 28, 1:04 PM · Traffic, Operations, Goal

Jun 14 2018

BBlack added a comment to T196371: Provide a multi-language user-faced warning regarding AES128-SHA deprecation.

We've got some overlapping timelines on these long-form posts :)

Jun 14 2018, 3:03 PM · User-notice, User-Johan, Operations, Traffic
BBlack added a comment to T196371: Provide a multi-language user-faced warning regarding AES128-SHA deprecation.

Isn"t there a way for the wiki server to autodetect those browsers that are still using the legacy TLS implementation and add some flag whose fvlue that can be used to conditionally display the warning?

Jun 14 2018, 1:51 PM · User-notice, User-Johan, Operations, Traffic

Jun 12 2018

BBlack moved T196946: switch port configuration for lvs200[7-10] from Triage to LoadBalancer on the Traffic board.
Jun 12 2018, 4:46 PM · netops, ops-codfw, Traffic, Operations

Jun 11 2018

BBlack moved T196432: Configure interface damping on primary links from Triage to Network on the Traffic board.
Jun 11 2018, 5:35 PM · Operations, Traffic, netops
BBlack moved T196493: rack/setup/install dns200[12].wikimedia.org from Triage to Hardware on the Traffic board.
Jun 11 2018, 5:35 PM · Traffic, DNS, Operations
BBlack moved T196560: rack/setup/install LVS200[7-10] from Triage to Hardware on the Traffic board.
Jun 11 2018, 5:35 PM · Patch-For-Review, ops-codfw, Traffic, Operations
BBlack moved T196691: rack/setup/install dns100[12].wikimedia.org from Triage to Hardware on the Traffic board.
Jun 11 2018, 5:35 PM · DNS, ops-eqiad, Traffic, Operations
BBlack moved T196693: rack/setup/install authdns1001.wikimedia.org from Triage to Hardware on the Traffic board.
Jun 11 2018, 5:35 PM · Traffic, DNS, ops-eqiad, Operations

Jun 6 2018

BBlack added a comment to T196560: rack/setup/install LVS200[7-10].

This plan looks good, thanks! We should be able to do the decoms for cable re-use you're recommending as well. We might need to leave lvs2007 for last after the others are brought online, to make it easier by first switching one of lvs200[1-3] 's work over to one of the new LVSes first.

Jun 6 2018, 3:12 PM · Patch-For-Review, ops-codfw, Traffic, Operations

Jun 5 2018

BBlack added a comment to T196371: Provide a multi-language user-faced warning regarding AES128-SHA deprecation.

We faced this issue last time and went with "Wikipedia" as well, because:

Jun 5 2018, 11:12 PM · User-notice, User-Johan, Operations, Traffic
BBlack added a comment to T196371: Provide a multi-language user-faced warning regarding AES128-SHA deprecation.

Yes, but on a shorter timescale. The 3DES one, in retrospect, dragged on a bit longer than it had to, and in this case (a) We're starting at a lower percentage of real end-user UAs affected (b) there's a timing correlation with the PCI-DSS cutoff disabling most of these same devices for a lot of commercial sites as well, for slightly-different reasons. We're looking at ~6 weeks total process: 3 weeks ramping out the message percentage, 3 weeks of total blockage with the message, then disabling at the protocol level.

Jun 5 2018, 7:29 PM · User-notice, User-Johan, Operations, Traffic

Jun 4 2018

BBlack updated the task description for T195923: rack/setup/install cp1075-cp1090.
Jun 4 2018, 10:21 PM · Patch-For-Review, ops-eqiad, Traffic, Operations
BBlack added a comment to T196248: TLS certificates renewal process.

As for the rest, especially with the one-offs using LetsEncrypt scripting today, we definitely don't have this kind of resiliency, or any kind of deployment lead time built in. We should at least fix the deployment lead-time clock skew problem in the new solution somehow (delay deployment by N days after issuance, unless it's a brand-new cert with only a self-signed/nothing preceding it for replacement).

Jun 4 2018, 8:30 PM · Performance-Team (Radar), HTTPS, Traffic, Operations
BBlack added a comment to T196248: TLS certificates renewal process.

Speaking for the big unified certs we get from commercial vendors: we generally do wait ~24h (usually longer?) , between the issue date of new major certs and their deployment, but it's more unspoken general best practices than a documented policy.

Jun 4 2018, 8:25 PM · Performance-Team (Radar), HTTPS, Traffic, Operations
elukey awarded T164609: Merge cache_misc into cache_text functionally a Love token.
Jun 4 2018, 4:33 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T91820: Create HTTP verb and sticky cookie DC routing in VCL .

Well, a potential lesser goal that involves fewer moving parts would just be to loadbalance non-sessioned readonly requests (basically cache misses for GET/HEADs with no session token) across the sides, where we basically don't care if content is stale in the replication sense, and leave POST-like methods and all sessioned traffic master-only.

Jun 4 2018, 4:21 PM · Services (watching), Wikimania-Hackathon-2018, Availability (MediaWiki-MultiDC), Operations, Traffic
BBlack added a comment to T196371: Provide a multi-language user-faced warning regarding AES128-SHA deprecation.

English grammar nits: it would be forward secret ciphers (meaning "ciphers which have the property of forward secrecy"). But these terms "forward secret" and "cipher" belong to fairly small-scope technical fields, so they probably don't translate easily (if at all, in some cases). With English being the normal language of technical matters on the Internet (for better or worse!), it may make sense to split this up into a simpler translatable message, followed by a more-accurate English-only explanation to be placed at the bottom that's intended for the more-technical (in some cases, what will get copied to some forum or tech support person to ask what it's about).

Jun 4 2018, 2:38 PM · User-notice, User-Johan, Operations, Traffic
BBlack removed projects from T196368: Wikimedia Hungary's website should use HTTPS: Operations, Wikimedia-Site-requests, HTTPS, Traffic.

Correct, neither the domain itself (Registration, DNS) nor the hosting of any resources for it, belongs to the WMF. http://www.domain.hu/domain/domainsearch/ can give details to look up the owners.

Jun 4 2018, 1:34 PM

May 31 2018

BBlack added a comment to T181569: rack/setup scs-eqsin.mgmt.eqsin.wmnet.

Last update was just missing console for atlas, as a non-blocker. I think this has probably been in a closeable state for a while?

May 31 2018, 5:33 PM · Traffic, Operations, ops-eqsin

May 25 2018

BBlack added a comment to T194962: Create and deploy a centralized letsencrypt service.

Anyway, as part of my initial code I made the "oh, it's not issued yet, let's use a self-signed cert so we can get a web server up and start responding to challenges" code on the client end - obviously this is a problem for using the puppet file protocol. Is it okay to have the central API just send a self-signed cert until it has a proper one? How should we handle this case?

May 25 2018, 3:01 AM · Patch-For-Review, Wikimedia-Hackathon-2018, Traffic, Operations

May 24 2018

BBlack added a comment to T195365: cp intermittent IPsec MTU issue.

Raising the MTU above standard everywhere is indeed another can of worms and out of scope here.
With careful testing, raising it on some hosts (with well identified flows) might be more doable, especially for internal flows where we expect the MSS to be respected on TCP (/sessions based), and UDP (/similar) to be bellow 1500 by convention (and sometimes configurable).

May 24 2018, 11:13 AM · netops, Traffic, Operations

May 23 2018

BBlack added a comment to T194962: Create and deploy a centralized letsencrypt service.

So if we implement this API, how are we going to point clients at it? Seeing as it won't be a puppetmaster...

May 23 2018, 2:59 PM · Patch-For-Review, Wikimedia-Hackathon-2018, Traffic, Operations
BBlack added a comment to T195365: cp intermittent IPsec MTU issue.

I don't have complete thoughts, but keep in mind in general it's complicated to go changing our actual host interface MTUs to anything larger than 1500 ("jumbo frames"), for a few reasons:

May 23 2018, 2:21 PM · netops, Traffic, Operations

May 19 2018

BBlack added a comment to T91820: Create HTTP verb and sticky cookie DC routing in VCL .

Right, I forgot, that was discussed as an optimization (vs having ChronologyProtector just timeout -> stale invisibly in a case where we may have some operational issue causing persistent, large replication lag).

May 19 2018, 3:57 PM · Services (watching), Wikimania-Hackathon-2018, Availability (MediaWiki-MultiDC), Operations, Traffic
BBlack added a project to T91820: Create HTTP verb and sticky cookie DC routing in VCL : Wikimania-Hackathon-2018.
May 19 2018, 3:39 PM · Services (watching), Wikimania-Hackathon-2018, Availability (MediaWiki-MultiDC), Operations, Traffic
BBlack added a comment to T91820: Create HTTP verb and sticky cookie DC routing in VCL .

Right. Just to re-state for clarity, the sort of logic we should be implementing in VCL (in the cache layers) will look like this pseudo-code:

May 19 2018, 3:38 PM · Services (watching), Wikimania-Hackathon-2018, Availability (MediaWiki-MultiDC), Operations, Traffic
BBlack closed T127485: Enable VCL applayer datacenter-switch via confd as Declined.

At this point, we're pushing off commit-free DC switching to post-ATS (sometime during the latter part of next FY, probably).

May 19 2018, 3:29 PM · codfw-rollout, Traffic, Operations
BBlack added a comment to T194962: Create and deploy a centralized letsencrypt service.

Also: naming bikshedding stuff: we should name/implement this as a generic ACME tool rather than LE-specific, and just make LE be the default issuing CA.

May 19 2018, 3:25 PM · Patch-For-Review, Wikimedia-Hackathon-2018, Traffic, Operations
BBlack added a comment to T194031: Setup a new PKI software as an alternative to the puppet CA for managing services certificates.

@Joe - So we're looking at doing something just for the LetsEncrypt (ACME) use-case over in T194962. The idea is this will manage puppetized issue/renewal/distribution of public/private keypairs from LetsEncrypt's ACME CA and support all the LE use-cases we have (e.g. same cert on N endpoint hosts, etc). I think it will work well and basically-solve all our issues for public-facing certs well enough. We can/should add revocation support as well (using the saved privkeys to revoke->reissue chosen certs via ACME, under admin command, if we think a privkey was compromised), but that might come later.

May 19 2018, 3:23 PM · Traffic, Operations
BBlack added a comment to T194962: Create and deploy a centralized letsencrypt service.
  • We should assume by default we want all certificates to be dual-issued as ECDSA+RSA variants and served to clients in both forms (I think this basically requires doing the same thing twice over with different private key types).
  • We should look at what attributes we can/should set in the request to get any optional goodies by default, like embedding SCTs for transparency.
May 19 2018, 3:19 PM · Patch-For-Review, Wikimedia-Hackathon-2018, Traffic, Operations
BBlack added a comment to T194962: Create and deploy a centralized letsencrypt service.

Some after-thoughts on design issues and such (I haven't looked at any code!):

May 19 2018, 3:15 PM · Patch-For-Review, Wikimedia-Hackathon-2018, Traffic, Operations

May 18 2018

BBlack triaged T194965: gdnsd plugin support for ACME DNS challenges as Normal priority.
May 18 2018, 3:59 PM · Traffic, Operations
BBlack added a comment to T133548: Create a secure redirect service for large count of non-canonical / junk domains.

Note added parent above, T194962. We're going to try to develop the generic solution for LE first, and then make the secredir service a client of the central LE service.

May 18 2018, 3:57 PM · Patch-For-Review, HTTPS, Operations, Traffic
BBlack added a parent task for T133548: Create a secure redirect service for large count of non-canonical / junk domains: T194962: Create and deploy a centralized letsencrypt service.
May 18 2018, 3:56 PM · Patch-For-Review, HTTPS, Operations, Traffic
BBlack added a subtask for T194962: Create and deploy a centralized letsencrypt service: T133548: Create a secure redirect service for large count of non-canonical / junk domains.
May 18 2018, 3:56 PM · Patch-For-Review, Wikimedia-Hackathon-2018, Traffic, Operations
BBlack triaged T194962: Create and deploy a centralized letsencrypt service as Normal priority.
May 18 2018, 3:55 PM · Patch-For-Review, Wikimedia-Hackathon-2018, Traffic, Operations

May 15 2018

BBlack closed T189252: Define turn-up process and scope for eqsin service to regional countries as Resolved.

Closing this as well, we're through the basic turn-up process. Trailing work on network engineering and Zero is tracked elsewhere.

May 15 2018, 4:11 PM · Patch-For-Review, Performance-Team (Radar), Traffic, Operations
BBlack closed T189252: Define turn-up process and scope for eqsin service to regional countries, a subtask of T156026: Enable Service in Asia Cache DC, as Resolved.
May 15 2018, 4:11 PM · Operations, Traffic
BBlack closed T156026: Enable Service in Asia Cache DC as Resolved.

Closing this (a bit late), as service has been online for a while now. Trailing remaining tasks re: Zero and/or further network engineering aren't really a blocker for saying the site is online :)

May 15 2018, 4:10 PM · Operations, Traffic
BBlack closed T194766: https://tendril.wikimedia.org/ IPv6 doesn't work as Resolved.

Works now, thanks!

May 15 2018, 3:39 PM · Operations, DBA
BBlack triaged T194766: https://tendril.wikimedia.org/ IPv6 doesn't work as Normal priority.
May 15 2018, 2:35 PM · Operations, DBA
BBlack reassigned T182993: TLS security review of the Kafka stack from Ottomata to Vgutierrez.

Done :)

May 15 2018, 2:22 PM · Patch-For-Review, Traffic, User-Elukey, Analytics-Kanban, Operations, Analytics-Cluster
BBlack added a comment to T188561: SSL cert for links.email.wikimedia.org.

Sorry, I missed the above scan link earlier. The "weak DH" issue isn't mentioned explicitly on our policy page, but is definitely an issue. I think maybe the policy page wording is defective, but basically if ssllabs.com isn't showing an A+, it's a fail. I'll amend it a bit to be more explicit about that.

May 15 2018, 1:29 PM · Operations, Traffic, fundraising-tech-ops, Fundraising-Backlog

May 14 2018

BBlack added a comment to T182993: TLS security review of the Kafka stack.

I think this ended up being an Analytics Q4 goal? It's not on our goals list, but we agree to alot some time to it in this Q to do our part (which is most of it, confusingly!), and have discussed it internally.

May 14 2018, 8:25 PM · Patch-For-Review, Traffic, User-Elukey, Analytics-Kanban, Operations, Analytics-Cluster
BBlack added a comment to T194380: Identify bots using AES128-SHA maintainers running on toolforge.

It's more likely that DotNetWikiBot just needs to be built against a newer .NET version, or needs .NET configuration tweaks, to support better encryption (or perhaps already is capable and some bots are behind on releases of it or some of its dependencies). I'm not familiar enough with DotNetWikiBot or .NET in general to know, though.

May 14 2018, 1:15 PM · Operations, Traffic

May 8 2018

BBlack added a comment to T184677: Measure impact of Singapore data center on Wikimedia usage.

Singapore itself, for non-sensical reasons related to the wild world of network peering, doesn't tend to be our best comparison point anyways, even though it's the first one we turned on. Probably our best bet would be to look at the aggregate data for all of the involved countries marked "Done" in T189252

May 8 2018, 4:27 PM · Product-Analytics, Discovery-Analysis (Current work)

May 5 2018

BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

Yeah, lvs200x are all HPs as well, so they're "different" in many respects for better or worse, and getting replaced this quarter with something more-like lvs1016.

May 5 2018, 10:59 AM · Patch-For-Review, ops-eqiad, Operations, Traffic
BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

I don't think the firmware versions you see there in ethtool and/or dmesg are the whole story anyways. The ones visible from bios-level setup have like 6 different inter-related version numbers for different things, and even the "main" firmware version number was very different between the two cards. We should probably have @Cmjohnson flash them with latest (or at least, consistent) firmwares.

May 5 2018, 10:41 AM · Patch-For-Review, ops-eqiad, Operations, Traffic

May 4 2018

BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

The key to the ethool difference is this in the lspci stuff:
Capabilities: [a0] MSI-X: Enable+ Count=17 Masked-
vs
Capabilities: [a0] MSI-X: Enable+ Count=32 Masked-

May 4 2018, 8:11 PM · Patch-For-Review, ops-eqiad, Operations, Traffic
BBlack updated the task description for T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).
May 4 2018, 6:05 PM · Browser-Support-Internet-Explorer, User-notice, Patch-For-Review, Operations, Traffic

May 3 2018

BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

[I still bet if you undo the already-done cable swap, and then switch the two cards' cables (leaving port1/2 ordering the same), this will all magically come out right]

May 3 2018, 5:08 PM · Patch-For-Review, ops-eqiad, Operations, Traffic
BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

I don't think it was a flip of the two ports on the same card that was needed, but instead switching all the cables between the two cards (order of cards, not order of ports in each card).

May 3 2018, 4:44 PM · Patch-For-Review, ops-eqiad, Operations, Traffic

May 2 2018

BBlack added a comment to T192206: Remove wildcard vhost for *.wikimedia.org.

@Jgreen yeah if you don't have any special purpose for them, then they're basically the same as the text-lb ones (we like having those in DNS so we don't have to remember IPs when doing certain kinds of manual debugging against specific sites, etc, but they're not functionally-used in any public way).

May 2 2018, 4:19 PM · Patch-For-Review, Operations, Wikimedia-Apache-configuration, Traffic
BBlack updated subscribers of T192206: Remove wildcard vhost for *.wikimedia.org.

FWIW on my end, the following hostnames are definitely non-functional:

text-lb
text-lb.codfw
text-lb.eqiad
text-lb.eqsin
text-lb.esams
text-lb.ulsfo
May 2 2018, 1:40 PM · Patch-For-Review, Operations, Wikimedia-Apache-configuration, Traffic

May 1 2018

BBlack assigned T193521: Consider adding expect-CT: header to enforce certificate transparency to Vgutierrez.
May 1 2018, 7:04 PM · Traffic, Operations

Apr 26 2018

BBlack added a comment to T189552: Rack/cable/configure ulsfo MX204.

Note this will involve a planned ulsfo site outage, with its traffic falling back to codfw. If things go well the outage should be brief, the 5h estimate above is worst-case with complications. We should avoid taking other significant risks while this is ongoing (esp anything re: codfw-vs-eqiad redundancy, or risks to eqsin).

Apr 26 2018, 4:10 PM · Patch-For-Review, Operations, ops-ulsfo, netops, Traffic

Apr 23 2018

BBlack added a comment to T192551: atop on stretch overloading a host.

+1, probably the convservative approach for now would be have puppet disable the systemd unit and remove the cron.daily file (on jessie as well? seems a waste if we expect it's not used).

Apr 23 2018, 4:06 PM · Upstream, Patch-For-Review, monitoring, Operations
BBlack added a comment to T184293: rack/setup/install lvs101[3-6].

Are we good for handoff to Traffic for OS-level install/config now on lvs1016?

Apr 23 2018, 4:00 PM · Patch-For-Review, ops-eqiad, Operations, Traffic
BBlack edited P7029 hostname sanity in generic VCL.
Apr 23 2018, 1:52 PM · Traffic
BBlack created P7029 hostname sanity in generic VCL.
Apr 23 2018, 1:51 PM · Traffic

Apr 19 2018

BBlack added a comment to T192610: prometheus on bast3002 misbehaving.

Crash recovery appears to have completed at about 23:10:33 and things came back online. We'll see if it remains stable. Leaving the downtimes in place to avoid more spamming of IRC.

Apr 19 2018, 11:13 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
BBlack added a comment to T192610: prometheus on bast3002 misbehaving.

[also, I've downtimed all the esams-specific prometheus-based alerts in icinga for 24h now (varnish child-counting checks and pybal bgp checks)]

Apr 19 2018, 11:09 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
BBlack added a comment to T192610: prometheus on bast3002 misbehaving.

It seems to be having problems coming up cleanly too, so more spam. First chunk of startup logs:

Apr 19 2018, 11:08 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
BBlack added a comment to T192610: prometheus on bast3002 misbehaving.

And that was followed by this, by the time it finally stopped itself ~5 minutes later:

Apr 19 22:55:47 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:47Z" level=info msg="Done checkpointing in-memory metrics and chunks in 3
Apr 19 22:55:47 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:47Z" level=info msg="Checkpointing fingerprint mappings..." source="persi
Apr 19 22:55:48 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:48Z" level=info msg="Done checkpointing fingerprint mappings in 690.40272
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 05d932c8af845af9 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 92fa66fb4cf5b4ac (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint b3bc5c659c5701f2 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 67cc3bbc9996f80a (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 6c00f4c49f4d7f44 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="The storage is now inconsistent. Restart Prometheus
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 4b2269a7a2bc0903 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 77b10b7d10ad866b (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 3f98cf3aa324de5b (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=error msg="Error unarchiving fingerprint 31f74eca61e77fc7 (met
Apr 19 22:55:49 bast3002 prometheus@ops[9406]: time="2018-04-19T22:55:49Z" level=info msg="Local storage stopped." source="storage.go:484"
Apr 19 22:55:50 bast3002 systemd[1]: Stopped prometheus server (instance ops).
-- Subject: Unit prometheus@ops.service has finished shutting down
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit prometheus@ops.service has finished shutting down.
Apr 19 2018, 10:57 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
BBlack added a comment to T192610: prometheus on bast3002 misbehaving.

It did keep spamming by the time I got done writing the above. Attempting to stop it now, but the basic daemon "stop" operation via systemctl is taking quite a long time (over 3 minutes and counting, even though parts of it have logged that they're gone):

Apr 19 2018, 10:55 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
BBlack added a project to T192610: prometheus on bast3002 misbehaving: Operations.
Apr 19 2018, 10:49 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
BBlack triaged T192610: prometheus on bast3002 misbehaving as High priority.
Apr 19 2018, 10:49 PM · Patch-For-Review, User-fgiunchedi, Operations, monitoring
BBlack added a comment to T184144: Investigation: Who Wrote That revision search tool.

My random $0.02 as a bytstander:

Apr 19 2018, 3:51 PM · Community-Tech
BBlack added a comment to T192551: atop on stretch overloading a host.

When I look at our LVS hosts (which are mixed jessie+stretch currently), the jessie ones show atop processes like:

root     26337     1  0 00:00 ?        00:00:04 /usr/bin/atop -a -w /var/log/atop/atop_20180419 600
Apr 19 2018, 2:14 PM · Upstream, Patch-For-Review, monitoring, Operations
BBlack added a comment to T191996: db1114 connection issues.

For the record, the irq for eno1 is balanced across CPUs, so I don't think it is the bottleneck here:

root@db1114:/srv/tmp# for i in `cat /proc/interrupts | grep eno1 | awk -F ":" '{print $1}'`; do cat /proc/irq/$i/smp_affinity_list; done
0,2,4,6,8,10,12,14
0,2,4,6,8,10,12,14
0,2,4,6,8,10,12,14
0,2,4,6,8,10,12,14
0,2,4,6,8,10,12,14
Apr 19 2018, 12:40 PM · ops-eqiad, Patch-For-Review, netops, Operations, DBA

Apr 13 2018

BBlack added a comment to T187014: Proxies information gone from Zero portal. Opera mini pageviews geolocating to wrong country.

Yeah, ema and I discussed this after the meeting the other day. I'm not sure whether or how we can look into the history on the Zero side, but I don't see Zero at this point having a need or desire to restore that data through the previous infrastructure or add any new data, so we're planning to just stop pulling that empty data from them, and replace it with a private file that's puppet-managed/deployed by the SRE team and includes the newly-acquired OperaMini data.

Apr 13 2018, 3:26 PM · Zero, Patch-For-Review, Analytics-Data-Quality, Analytics-Kanban, Operations, Traffic, Analytics, Readers-Web-Backlog (Tracking), Mobile, New-Readers

Apr 10 2018

BBlack closed T191229: cp2022 memory replacement as Resolved.

all green in icinga now and repooled, closing!

Apr 10 2018, 6:38 PM · Traffic, ops-codfw, Operations
BBlack closed T191229: cp2022 memory replacement, a subtask of T190540: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error, as Resolved.
Apr 10 2018, 6:38 PM · Traffic, ops-codfw, Operations

Apr 9 2018

BBlack closed T191227: cp2017 memory replacement as Resolved.

cp2017 repooled into service

Apr 9 2018, 7:58 PM · Traffic, ops-codfw, Operations