BBlack (Brandon Black)
WMF Operations Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (154 w, 3 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF)

Recent Activity

Today

BBlack added a project to T178567: Server error (500) while trying to download files from Commons from PAWS: media-storage.
Fri, Oct 20, 8:06 PM · Patch-For-Review, media-storage, Traffic, Operations, Pywikibot-Commons, PAWS
BBlack added a comment to T178567: Server error (500) while trying to download files from Commons from PAWS.

So, I did some varnishlog tracing on the frontend @zhuyifei1999 was hitting with a reproduction of this. I caught one of the 500s there, and the relevant headers looked like:

[...]
-   ReqMethod      GET
-   ReqURL         /wikipedia/commons/1/16/Constitui%C3%A7%C3%A3o_da_Rep%C3%BAblica_dos_Estados_Unidos_do_Brasil_de_1937_p._34.jpg
-   ReqProtocol    HTTP/1.1
-   ReqHeader      Connection: close
-   ReqHeader      Host: upload.wikimedia.org
[...]
-   RespProtocol   HTTP/1.1
-   RespStatus     500
-   RespReason     Internal Error
-   RespHeader     Content-Type: text/plain
-   RespHeader     X-Trans-Id: tx5f7ada82081944279b1bb-0059ea4e96
-   RespHeader     Date: Fri, 20 Oct 2017 19:29:26 GMT
-   RespHeader     Content-Encoding: gzip
-   RespHeader     Vary: Accept-Encoding
-   RespHeader     X-Cache: cp1064 miss, cp1074 pass
[...]
Fri, Oct 20, 8:04 PM · Patch-For-Review, media-storage, Traffic, Operations, Pywikibot-Commons, PAWS
BBlack added a comment to T178567: Server error (500) while trying to download files from Commons from PAWS.

Most likely the error is just inconsistent over the time domain at the backend (swift -> (MW || Thumbor)). If Varnish manages to see a 200 happen it will cache it, but it won't cache the 500 errors.

Fri, Oct 20, 5:46 PM · Patch-For-Review, media-storage, Traffic, Operations, Pywikibot-Commons, PAWS

Yesterday

BBlack added a comment to T167840: Merge AS14907 with AS43821.

+1 LGTM!

Thu, Oct 19, 10:33 PM · Performance-Team (Radar), Performance-Team-notice, Patch-For-Review, Operations, netops
BBlack closed T176386: upload@ulsfo strange ethernet / power / switch issues, etc... as Resolved.

Nothing really to do here, except remember it if new power issues arise with the new hosts...

Thu, Oct 19, 6:25 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T156256: Allocate address space for Singapore (APNIC).

We can do revdns and basic puppet address space commits here or in T156027 as appropriate I think (maybe most of the puppet-level stuff over there). One thing it would be nice to sort out early is the public LVS subnets to forward to zero. If we follow the examples of other DCs, they'd be:

Thu, Oct 19, 4:13 PM · Patch-For-Review, Traffic, Operations

Wed, Oct 18

BBlack merged task T178443: wmfusercontent.org SSL cert expires 2017-11-22 into T178173: Renew unified certificates 2017.
Wed, Oct 18, 3:33 AM · Phabricator, procurement, Traffic, HTTPS, Operations
BBlack merged task T178444: *.planet.wikimedia.org SSL cert expires 2017-11-22 into T178173: Renew unified certificates 2017.
Wed, Oct 18, 3:33 AM · Traffic, Wikimedia-Planet, procurement, Operations
BBlack merged tasks T178444: *.planet.wikimedia.org SSL cert expires 2017-11-22, T178443: wmfusercontent.org SSL cert expires 2017-11-22 into T178173: Renew unified certificates 2017.
Wed, Oct 18, 3:33 AM · Operations, Traffic

Tue, Oct 17

BBlack created P6142 acme_tiny diffs.
Tue, Oct 17, 9:26 PM

Mon, Oct 16

BBlack moved T178173: Renew unified certificates 2017 from Triage to TLS on the Traffic board.
Mon, Oct 16, 4:50 PM · Operations, Traffic

Sun, Oct 15

BBlack added a comment to T102367: Migrate tools.wmflabs.org to https only (and set HSTS).

Be careful with preload. It's only purpose is to signal to the Chromium list maintainers that it's ok to you preload you (making HSTS more-or-less permanent), and once you're emitting anyone can submit your domain for preload inclusion.

Sun, Oct 15, 11:31 PM · Operations, Traffic, Cloud-Services, HTTPS, Toolforge

Fri, Oct 13

BBlack created T178173: Renew unified certificates 2017.
Fri, Oct 13, 3:06 PM · Operations, Traffic

Thu, Oct 12

BBlack added a comment to T168529: Upgrade to Varnish 5.

So, thinking ahead past cache_misc and assuming that's successful, probably the next target should cache_upload (lower complexity than text, and in more need of potential eviction improvements). With both of the other clusters, but especially upload, we'll face some load issues with the vslp->shard shift as we work through a DC (and the backend misses to remote DCs). With the current MISS2PASS behaviors this is somewhat minimized for cross-DC, though. Probably a sane-ish plan for the bigger clusters is to work from the bottom up first in the codfw direction, and then in the eqiad direction (so: codfw, ulsfo, eqiad, esams). Within each DC, we'll probably want to move through the nodes as quickly as reasonably allowed by load to avoid the worst of the effective backend storage reductions due to cross-chashing.

Thu, Oct 12, 1:30 PM · Patch-For-Review, Performance-Team (Radar), Operations, Traffic

Wed, Oct 11

BBlack added a comment to T178011: cp4026 memory error.

This will self-depool if you do a clean shutdown from software. We just need to verify + repool manually afterwards.

Wed, Oct 11, 10:39 PM · Traffic, Operations, ops-ulsfo
BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

Got arrow keys working in Ctrl-S (thanks @fgiunchedi !) by re-setting the local terminal. There is no "HP Shared Memory Features" prompt in the current NIC firmware to disable. Went ahead and disabled SR-IOV on both cards and tried another netboot, still fails to bring up the interface in the stretch installer as before.

Wed, Oct 11, 4:27 PM · ops-eqiad, Traffic, netops, Operations
BBlack added a comment to T177961: Upgrade LVS servers to stretch.

One significant thing to keep in mind is the interface naming changes. We'll be going from e.g. eth[0-3] to something like eno[1-2], ens1f[0-1], and we'll have to work that into all the magic that's configuring our per-vlan interfaces and doing all the custom network setup stuff, etc. Some of it could be easy to miss (breaks "silently", at least to some level of testing/scrutiny).

Wed, Oct 11, 4:27 PM · Patch-For-Review, Traffic, Operations, Pybal
BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

I figured as a next minimal testing step on lvs1009, should just go into the ethernet firmware (Ctrl+S) and try disabling SR-IOV and/or HP Shared Memory Features, without any upgrades, if possible. However, the firmware/bios levels currently on this host doesn't work well enough with VSP to do that remotely (it doesn't process arrow keys correctly, you can only drill into the first item of each menu and toggle whatever's there...).

Wed, Oct 11, 2:30 PM · ops-eqiad, Traffic, netops, Operations
BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

I gave in and tried a stretch network install on lvs1009 for comparison. I didn't make any bios/firmware changes there, just used RBSU console to onetimeboot netdev1, power reset, vsp and watched the installer. Hardware DHCP->PXE worked and loaded the installer, but the installer failed on network stuff (like jessie). Installer dmesg shows multiple driver panic on bnx2x . I just copied the end of one crashdump and start of the next for the metadata here:

Wed, Oct 11, 2:17 PM · ops-eqiad, Traffic, netops, Operations
BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

Still says 101-I/O ROM Error twice on every boot attempt, new NIC card has older firmware. PXE boot still doesn't work (tried setting Boot Strap Type to int19h in the ethernet card's firmware menu for eth0 as well just in case it was some BBS-specific failure, no dice).

Wed, Oct 11, 1:38 PM · ops-eqiad, Traffic, netops, Operations

Tue, Oct 10

jcrespo awarded T174932: Recurrent 'mailbox lag' critical alerts and 500s a Like token.
Tue, Oct 10, 3:59 PM · Patch-For-Review, Operations, Traffic
BBlack closed T175803: Text eqiad varnish 503 spikes as Resolved.

^ The above seems to have resolved the esams-specific 503s. Closing this up!

Tue, Oct 10, 3:56 PM · Patch-For-Review, Traffic, Operations
BBlack closed T174932: Recurrent 'mailbox lag' critical alerts and 500s as Resolved.
Tue, Oct 10, 3:56 PM · Patch-For-Review, Operations, Traffic
BBlack closed T174932: Recurrent 'mailbox lag' critical alerts and 500s, a subtask of T145661: varnish backends start returning 503s after ~6 days uptime, as Resolved.
Tue, Oct 10, 3:56 PM · Patch-For-Review, Operations, Traffic
BBlack closed T145661: varnish backends start returning 503s after ~6 days uptime as Resolved.

The cache admission policy change seems to have gotten us over this for now. We should probably wait for the Varnish5 upgrade ( T168529 ) before trying to undo any of the other related work (e.g. weekly restart crons or storage splitting).

Tue, Oct 10, 3:55 PM · Patch-For-Review, Operations, Traffic
BBlack moved T177228: Multiple systems in esams OE10 showing PSU failures from Triage to Caching on the Traffic board.
Tue, Oct 10, 3:47 PM · Traffic, ops-esams, DC-Ops, Operations
BBlack added a comment to T177742: Investigate Chrony as a replacement for ISC ntpd.

We could run one of our NTP servers based on Chrony parallel to the existing ones to see whether it meets our needs (which are fairly limited in terms of NTP features)

Tue, Oct 10, 1:06 PM · Operations

Mon, Oct 9

Liuxinyu970226 awarded T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) a Doubloon token.
Mon, Oct 9, 2:32 PM · User-notice, Patch-For-Review, Operations, Traffic

Fri, Oct 6

BBlack added a comment to T168529: Upgrade to Varnish 5.

We're moving to 5.1.3 with this upgrade. 5.2.0 is a little too bleeding-edge for now :)

Fri, Oct 6, 3:01 PM · Patch-For-Review, Performance-Team (Radar), Operations, Traffic
BBlack added a comment to T171881: CL support for Wikipedia Zero piracy problems.

@kaldari - I'm not sure what "reasonable levels" is, but T173710#3646384 was showing commons queue backlogs in the low millions as recently as a week ago, and there still seem to be unresolved questions there about how to address the overall event rate. I've re-run those queries myself just now:

bblack@terbium:~$ /usr/local/bin/foreachwikiindblist /srv/mediawiki/dblists/group1.dblist showJobs.php --group | awk '{if ($3 > 10000) print $_}'
commonswiki:  refreshLinks: 1628957 queued; 1762 claimed (4 active, 1758 abandoned); 0 delayed
commonswiki:  htmlCacheUpdate: 1482335 queued; 722 claimed (1 active, 721 abandoned); 0 delayed
[...? I stopped waiting here]
Fri, Oct 6, 2:59 PM · Community-Liaisons (Oct-Dec 2017), Zero

Wed, Oct 4

BBlack lowered the priority of T175803: Text eqiad varnish 503 spikes from High to Normal.
Wed, Oct 4, 10:47 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T175803: Text eqiad varnish 503 spikes.

There's still some overlap and/or confusion between the 503 issues in this ticket, T174932 and T145661, and there's some still lesser recurrent 503s in esams that we don't have a good explanation for, yet. We're still looking at those, and kind of stalling on resolving the related tickets until we're sure it's all sorted out.

Wed, Oct 4, 10:46 PM · Patch-For-Review, Traffic, Operations
BBlack added a project to T177228: Multiple systems in esams OE10 showing PSU failures: Traffic.
Wed, Oct 4, 3:06 PM · Traffic, ops-esams, DC-Ops, Operations
BBlack created T177403: esams rack OE10 power redundancy issues? (cp3030-9).
Wed, Oct 4, 2:38 PM · Traffic, ops-esams, Operations

Tue, Oct 3

BBlack moved T177233: Upgrade cache_misc to Varnish 5 from Triage to Caching on the Traffic board.
Tue, Oct 3, 2:39 PM · Patch-For-Review, Performance-Team (Radar), Traffic, Operations
BBlack merged task T148983: cp3021 failed disk sdb into T130883: decom cp3011-22 (12 machines).
Tue, Oct 3, 2:13 PM · ops-esams, Operations, Traffic
BBlack merged T148983: cp3021 failed disk sdb into T130883: decom cp3011-22 (12 machines).
Tue, Oct 3, 2:13 PM · ops-esams, Operations, hardware-requests
BBlack closed T166758: cp3032 ethernet link down (bnx2x dump in the dmesg) as Resolved.

Hasn't recurred AFAIK. Note this is similar to bnx2x dmesg we managed to induce on a bunch of upload@ulsfo machines via bad NUMA tuning. It's probably not a hardware issue.

Tue, Oct 3, 2:13 PM · Operations, Traffic

Mon, Oct 2

BBlack added a comment to T168529: Upgrade to Varnish 5.

Definitely not looking at Hitch presently. Just swapping out Varnish4 for Varnish5 in the existing software stack for both the frontend and backend cache processes. Forward-looking, some of our next steps in this area might be to replace the varnish backend processes and/or nginx with ATS, but even if we make quick progress on both fronts, we'll likely still have a Varnish frontend cache daemon in the middle for quite some time in the future, which is enough to justify not falling behind on the Varnish revs.

Mon, Oct 2, 6:56 PM · Patch-For-Review, Performance-Team (Radar), Operations, Traffic
BBlack created T177233: Upgrade cache_misc to Varnish 5.
Mon, Oct 2, 4:23 PM · Patch-For-Review, Performance-Team (Radar), Traffic, Operations
BBlack added a comment to T168529: Upgrade to Varnish 5.

This task hasn't been updated for various IRC/Hangouts discussions since. We did decide to move forward with V5 upgrades. Arzhel has built preliminary packages, and we have a goal this quarter to upgrade at least one production cache cluster to V5.

Mon, Oct 2, 4:23 PM · Patch-For-Review, Performance-Team (Radar), Operations, Traffic
BBlack lowered the priority of T175636: prometheus -> grafana stats for per-numa-node meminfo from Normal to Low.
Mon, Oct 2, 3:52 PM · Patch-For-Review, monitoring, Traffic, Operations

Fri, Sep 29

BBlack added a comment to T176366: Decom cp4005-8,13-16 (8 nodes).

@RobH - these are good to go for decom now. They're still booted, but have been depooled, removed from confd/lvs/etc, re-roled in puppet to spare::system, and all their cp-specific daemons stopped.

Fri, Sep 29, 8:05 PM · Patch-For-Review, hardware-requests, ops-ulsfo, Operations, Traffic
BBlack reassigned T176366: Decom cp4005-8,13-16 (8 nodes) from BBlack to RobH.
Fri, Sep 29, 8:04 PM · Patch-For-Review, hardware-requests, ops-ulsfo, Operations, Traffic
BBlack added a comment to T61115: Implement RPKI (Resource Public Key Infrastructure).

RFC 8205 (BGPSec) got published this week, which will use RPKI to secure against bad route announcements by signing UPDATE messages - https://tools.ietf.org/html/rfc8205

Fri, Sep 29, 7:08 PM · Operations, netops
BBlack added a comment to T48947: Vector: Horizontal nav elements should be flipped with CSS instead of in HTML.

We can probably do something like that, but it's not generally simple. The meaning of "generated" might need clarification. Generated as in "Varnish fetched it from MW", or generated as in "When the parsercache entry was created"?

Fri, Sep 29, 11:49 AM · Patch-For-Review, MW-1.31-release-notes (WMF-deploy-2017-10-03 (1.31.0-wmf.2)), Readers-Web-Backlog (Tracking), Technical-Debt (RW-Tech-Debt), Vector

Thu, Sep 28

BBlack added a comment to T174891: cp4024 kernel errors.

Oh sorry, my comment was redundant to your edit :)

Thu, Sep 28, 4:10 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T174891: cp4024 kernel errors.

Dell.com has a page on these here: http://www.dell.com/support/manuals/us/en/19/poweredge-vrtx/servers_tsg/psaepsa-diagnostics-error-codes?guid=guid-9afeed67-a47c-4afd-83d8-04301ebf3523&lang=en-us

Thu, Sep 28, 3:54 PM · ops-ulsfo, Operations, Traffic

Wed, Sep 27

BBlack added a comment to T174891: cp4024 kernel errors.

@RobH any updates here on diags?

Wed, Sep 27, 3:26 PM · ops-ulsfo, Operations, Traffic

Tue, Sep 26

BBlack closed T156028: Name Asia Cache DC site as Resolved.

eqsin is the site name (Vendor: Equinix, Airport: SIN )

Tue, Sep 26, 4:24 PM · Operations, Traffic
BBlack closed T156028: Name Asia Cache DC site, a subtask of T156027: Configuration for Asia Cache DC hosts, as Resolved.
Tue, Sep 26, 4:24 PM · Operations, Traffic
BBlack closed T156028: Name Asia Cache DC site, a subtask of T156031: Turn up network links for Asia Cache DC, as Resolved.
Tue, Sep 26, 4:24 PM · Operations, Traffic
BBlack closed T156028: Name Asia Cache DC site, a subtask of T162684: Network hardware configuration for Asia Cache DC, as Resolved.
Tue, Sep 26, 4:24 PM · Operations, Traffic
BBlack closed T156030: Select site vendor for Asia Cache Datacenter as Resolved.

Equinix was selected by the process, and we've negotiated and signed and sent in the specific order at this point.

Tue, Sep 26, 4:23 PM · Traffic, Operations
BBlack closed T156030: Select site vendor for Asia Cache Datacenter, a subtask of T156028: Name Asia Cache DC site, as Resolved.
Tue, Sep 26, 4:23 PM · Operations, Traffic
BBlack closed T156030: Select site vendor for Asia Cache Datacenter, a subtask of T162683: Network hardware purchasing for Asia Cache DC, as Resolved.
Tue, Sep 26, 4:23 PM · Operations, Traffic
BBlack closed T156030: Select site vendor for Asia Cache Datacenter, a subtask of T156033: Server hardware purchasing for Asia Cache DC, as Resolved.
Tue, Sep 26, 4:23 PM · Operations, Traffic

Fri, Sep 22

BBlack updated the task description for T128559: store.wikimedia.org HTTPS issues.
Fri, Sep 22, 1:09 PM · Operations, Traffic, Wikimedia-Shop, HTTPS
BBlack added a comment to T128559: store.wikimedia.org HTTPS issues.

Thanks for the updates! Even a 90d HSTS without the preload/includeSub flags is better than nothing. If we can get the time extended out to 1y that's even better. Of the two missing attributes, preload is the more important of the two. I suspect Shopify will be getting increasing pressure about all of these things from customers over time, so hopefully the situation will continue to improve.

Fri, Sep 22, 1:07 PM · Operations, Traffic, Wikimedia-Shop, HTTPS

Thu, Sep 21

BBlack renamed T176386: upload@ulsfo strange ethernet / power / switch issues, etc... from cp4026 strange ethernet issue to upload@ulsfo strange ethernet / power / switch issues, etc....
Thu, Sep 21, 4:34 AM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

Recoveries of whatever the hell is happening in ulsfo:

04:26 <+icinga-wm> RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms
04:28 <+icinga-wm> RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.60 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.69 ms
04:30 <+icinga-wm> RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms
04:30 <+icinga-wm> RECOVERY - Host cp4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.19 ms
Thu, Sep 21, 4:32 AM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

... and now we've lost the cr1-eqiad <-> cr1-codfw link ... ?

cr1-eqiad xe-4/2/0: down -> Core: cr1-codfw:xe-5/2/1
Thu, Sep 21, 4:13 AM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

asw-ulsfo has some other alerts going on, aside from the expected link loss to various flapping or supposedly-down hosts, e.g.:

Thu, Sep 21, 4:06 AM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

So, the same basic issue appears to have happened for almost all of upload@ulsfo (cp402[12356]) at about the same time. cp4021 was the lone exception. cp402[78] in text@ulsfo unaffected. Inbound network traffic to all the upload@ulsfo nodes was ramping up to unusual values ahead of the netdev watchdog -> link issues -> meltdown. It's possible this was caused by actual external traffic burst?

Thu, Sep 21, 4:05 AM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T176386: upload@ulsfo strange ethernet / power / switch issues, etc....

Actually, seeing the same on several cp402x. Depooling ulsfo, maybe switch issue?

Thu, Sep 21, 3:43 AM · Patch-For-Review, Operations, Traffic
BBlack created T176386: upload@ulsfo strange ethernet / power / switch issues, etc....
Thu, Sep 21, 3:35 AM · Patch-For-Review, Operations, Traffic

Sep 20 2017

BBlack added a comment to T176366: Decom cp4005-8,13-16 (8 nodes).

[just pre-creating the task, we're not quite ready to take action yet. These systems are now depooled, but we'll wait a few days before un-configuring in case a reason to repool them arises...]

Sep 20 2017, 9:15 PM · Patch-For-Review, hardware-requests, ops-ulsfo, Operations, Traffic
BBlack created T176366: Decom cp4005-8,13-16 (8 nodes).
Sep 20 2017, 9:14 PM · Patch-For-Review, hardware-requests, ops-ulsfo, Operations, Traffic

Sep 19 2017

BBlack added a comment to T175319: cp1066 unexplained 503 spikes.

Going to repool this today on the assumption it was genuinely part of T175803

Sep 19 2017, 9:21 PM · Traffic, Operations
BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

I think @Cmjohnson said before that they're at different revs because they're different pieces of hardware (onboard vs card), and those are the latest revs for each, respectively.

Sep 19 2017, 4:38 PM · ops-eqiad, Traffic, netops, Operations
BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

I don't know, I hadn't tried re-enabling the memory sharing stuff. All I really know is the sequence of events last week was approximately:

Sep 19 2017, 3:15 PM · ops-eqiad, Traffic, netops, Operations

Sep 18 2017

BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

I did the NIC card bios check last week when I first found the PXE booting problem. It is enabled there. My guess is either something else in BIOS settings got changed that affects this, or somehow disabling "HP Shared Memory Feature" with the new NIC firmware also kills the PXE functionality indirectly (which would be pretty awful, and might mean we have no solution but to move to stretch installs).

Sep 18 2017, 6:34 PM · ops-eqiad, Traffic, netops, Operations
BBlack merged T175319: cp1066 unexplained 503 spikes into T175803: Text eqiad varnish 503 spikes.
Sep 18 2017, 4:07 PM · Patch-For-Review, Traffic, Operations
BBlack merged task T175319: cp1066 unexplained 503 spikes into T175803: Text eqiad varnish 503 spikes.
Sep 18 2017, 4:07 PM · Traffic, Operations
ema awarded T175636: prometheus -> grafana stats for per-numa-node meminfo a Goat token.
Sep 18 2017, 3:38 PM · Patch-For-Review, monitoring, Traffic, Operations
BBlack added a comment to T175636: prometheus -> grafana stats for per-numa-node meminfo.

yeah, put it somewhere useful in grafana :)

Sep 18 2017, 3:38 PM · Patch-For-Review, monitoring, Traffic, Operations
BBlack added a comment to T25932: Enable, whitelist, and incorporate semantic HTML5 elements.

Re: T147199 , you'll probably want to re-evaluate UA stats after Nov 17 to see what the true final fallout is. It *should* significantly reduce the population of certain ancient UAs (notably, IE7-8/XP), but there are a few ways such UAs can stick around as well, and we can't readily predict what the numbers will look like:

Sep 18 2017, 3:09 PM · Epic, Accessibility, MediaWiki-Parser

Sep 14 2017

BBlack added a comment to T147202: Removing support for AES128-SHA TLS cipher.

Another note-to-self for the future: https://gerrit.wikimedia.org/r/#/c/301817/ is where we removed the fairly-similar AES128-SHA256 and AES128-GCM-SHA256, throwing all such clients into the AES128-SHA bucket discussed here. It probably doesn't make sense to re-split these as we head towards deprecation, but it's a possibility worth considering if we want to split up the impacts over time a bit.

Sep 14 2017, 3:35 PM · Operations, Traffic

Sep 13 2017

BBlack closed T170598: Extending our HSTS value beyond ~1y as Resolved.

no substantive counter-arguments in 2 months, resolving

Sep 13 2017, 12:20 PM · Operations, Traffic

Sep 12 2017

BBlack added a comment to T167299: Upgrade BIOS/RBSU/etc on lvs1007.

Also, the NIC firmware update only applied to ports 2+3, but not ports 0+1. I don't suspect NIC firmware level was a leading candidate for the fix anyways, but having the ports at different firmware levels sounds particularly problematic on its own.

Sep 12 2017, 4:44 PM · ops-eqiad, Traffic, netops, Operations
BBlack added a comment to T164768: Explicitly limit varnishd transient storage.

We're still missing caps for the upload cluster, right? (well and misc, but that case isn't all that important here). I'm a little concerned about the interplay of unbounded transient mem spikes and NUMA in the new cp4's, although I think so long as they're happening on backends we're probably fine (even better than the non-NUMA case). A big upload frontend transient spike will likely oomkill the NUMA-isolation nodes much easier than before...

Sep 12 2017, 1:36 PM · Patch-For-Review, Traffic, Operations

Sep 11 2017

BBlack created T175636: prometheus -> grafana stats for per-numa-node meminfo.
Sep 11 2017, 9:51 PM · Patch-For-Review, monitoring, Traffic, Operations
BBlack added a project to T174891: cp4024 kernel errors: ops-ulsfo.
Sep 11 2017, 4:07 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T174891: cp4024 kernel errors.

So far, other nodes are testing ok on this front. This is likely a node-specific early hardware failure.

Sep 11 2017, 4:07 PM · ops-ulsfo, Operations, Traffic
BBlack moved T175585: cp4021 memory hardware issue - DIMM B1 from Triage to Caching on the Traffic board.
Sep 11 2017, 4:06 PM · ops-ulsfo, Operations, Traffic
BBlack added a comment to T175588: Server overloaded .. can't save (only remove or cancel).

Can you explain in more detail? Is the subject of this ticket was was shown as an error in your browser window? I doubt this is related to varnish and/or "mailbox lag".

Sep 11 2017, 3:38 PM · TestMe, Wikidata
BBlack removed a subtask for T174932: Recurrent 'mailbox lag' critical alerts and 500s: T175588: Server overloaded .. can't save (only remove or cancel).
Sep 11 2017, 3:38 PM · Patch-For-Review, Operations, Traffic
BBlack removed parent tasks for T175588: Server overloaded .. can't save (only remove or cancel): T174932: Recurrent 'mailbox lag' critical alerts and 500s, T175473: Multiple 503 Errors.
Sep 11 2017, 3:38 PM · TestMe, Wikidata
BBlack removed a subtask for T175473: Multiple 503 Errors: T175588: Server overloaded .. can't save (only remove or cancel).
Sep 11 2017, 3:38 PM · Traffic, Operations
BBlack updated the task description for T175585: cp4021 memory hardware issue - DIMM B1.
Sep 11 2017, 3:14 PM · ops-ulsfo, Operations, Traffic
BBlack created T175585: cp4021 memory hardware issue - DIMM B1.
Sep 11 2017, 3:14 PM · ops-ulsfo, Operations, Traffic

Sep 8 2017

BBlack added a comment to T163251: Communicate dropping IE8-on-XP support (a security change) to affected editors and other community members.

Thanks! So far, I haven't heard of any huge community pushback, which is awesome :)

Sep 8 2017, 4:06 PM · Community-Liaisons (Oct-Dec 2017), Patch-For-Review, User-Johan, Operations, Traffic
BBlack created T175319: cp1066 unexplained 503 spikes.
Sep 8 2017, 12:01 AM · Operations, Traffic

Sep 7 2017

BBlack added a comment to T147202: Removing support for AES128-SHA TLS cipher.

I've re-done some of the sampled/informal AES128-SHA analysis from before, since it hasn't been done in about a year, and the past results were never recorded in detail. This is informational, to remind ourselves of the scope/impact in the future, whenever we get around to planning this one. Note below that the "BlueCoat" label is a stand-in for any kind of TLS-downgrading proxy, BlueCoat just happens to be the brand name of one of the most popular ones. We can identify these requests by a few key attributes:

  • Sometimes the UA string is literally ProxySG Appliance
  • By the origin IPs belonging to BlueCoat's cloud proxy service
  • Because the browser's UA string is far more modern and shouldn't be using ancient crypto like AES128-SHA, leaving a downgrading proxy as the only explanation
  • The combination of the legacy AES128-SHA cipher with the TLSv1.2 protocol choice. Legitimate ancient UAs do not implement TLSv1.2, whereas these proxies do implement it, but intentionally chose weak/ancient ciphers.
Sep 7 2017, 8:14 PM · Operations, Traffic
BBlack added a comment to T174640: Invalid "wikimedia" family in unique devices data due to misplaced WMF-Last-Access-Global cookie .

The model's a bit different in the wikimedia.org case, I'm not even sure there's a rational answer here. Can we get some clear (e.g. pseudo-code level?) guidance on what the desired behavior would be in the wikimedia.org case?

Sep 7 2017, 2:44 PM · Patch-For-Review, Analytics-Kanban, Traffic, Operations
BBlack added a comment to T140365: Lower geodns TTLs from 600 to 300.

We're probably fine on existing capacity to handle failover at 600, and even at 300. We've had authdns server outages before, and the stats are pretty simple to interpret in general. It is, of course, best to test those assumptions! :)

Sep 7 2017, 12:21 AM · Traffic, Operations

Sep 6 2017

BBlack added a subtask for T175203: Implement stateless TCP balancing in our LVS servers: T86651: Fix LVS "sh" shortcomings.
Sep 6 2017, 7:36 PM · Operations, Pybal, Traffic
BBlack added a parent task for T86651: Fix LVS "sh" shortcomings: T175203: Implement stateless TCP balancing in our LVS servers.
Sep 6 2017, 7:36 PM · Operations
BBlack triaged T175203: Implement stateless TCP balancing in our LVS servers as High priority.
Sep 6 2017, 7:36 PM · Operations, Pybal, Traffic
BBlack created T175203: Implement stateless TCP balancing in our LVS servers.
Sep 6 2017, 7:36 PM · Operations, Pybal, Traffic