BBlack (Brandon Black)
WMF Operations Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (133 w, 1 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF)

Recent Activity

Today

BBlack added a comment to T166229: Mediawiki replies with 500 on wrongly formatted CSP report.

Agreed this should be 4xx rather than 5xx

Wed, May 24, 5:12 PM · Operations, Traffic, Security-Team, MediaWiki-General-or-Unknown

Mon, May 22

BBlack added a comment to T162850: acpi_pad issues.

acamar hit this again on Sunday, in spite of the (working) acpi_pad blacklist. A simple reboot seems to have cleared it. The next- best advice (based on that old Dell info) would be to blacklist mei. I've rmmod'd it on acamar for now to see if it causes additional issues before we try blacklisting it on all.

Mon, May 22, 6:47 PM · Patch-For-Review, Operations

Sat, May 20

BBlack added a comment to T137161: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains.

@Jgreen - re: civicrm, it needs to emit the HSTS header on all HTTPS responses.

Sat, May 20, 3:19 AM · Traffic, Operations, fundraising-tech-ops

Fri, May 19

BBlack reopened T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan, a subtask of T133821: Content purges are unreliable, as "Open".
Fri, May 19, 3:26 PM · Operations, Traffic
BBlack reopened T124418: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan as "Open".

Not resolved, as the purge graphs can attest!

Fri, May 19, 3:26 PM · Performance-Team, Wikidata, MediaWiki-Cache, MediaWiki-JobQueue, Traffic, Operations
BBlack edited the description of T165765: Refactor pybal/LVS config for shared failover.
Fri, May 19, 2:21 PM · Traffic, Operations
BBlack created T165765: Refactor pybal/LVS config for shared failover.
Fri, May 19, 2:20 PM · Traffic, Operations
BBlack created T165764: Fully-redundant LVS clusters using Pybal per-service MED feature.
Fri, May 19, 2:04 PM · Pybal, Operations, Traffic

Thu, May 18

BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

So from the above, apache really has 3 different modes of operation:

Thu, May 18, 9:20 PM · Pybal, Traffic, Operations, netops
BBlack added a comment to T96852: Define 3-host infra cluster for traffic pops.

The tentative and limited plan for now is to deploy 3x misc/infra hosts (meaning all the hosts other than lvs and cp) at each cache site and not use virtualization. We might revisit this at a later date. The basic layout looks like:

Thu, May 18, 2:45 PM · Operations, Traffic
BBlack added a comment to T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.

I'm probably backtracking into territory that was once known here, but after the long delay I felt I had to go back and re-validate what's going on with the ports and the thinking.

Thu, May 18, 1:49 PM · Traffic, netops, Operations

Wed, May 17

BBlack added a comment to T165614: LLDP on cache hosts.

Answering for @ema I think this mostly came up as a consequence of trying to map out the data in T150256#3271004 using lldpcli to confirm port connections. That led to an in-depth conversation about how our racks and rows and switches and vlans are set up and how and where redundancy matters, which led to him looking at our caches and their port/rack/row mapping (also in racktables) and how they're not laid out very ideally in eqiad. Basically these are just questions spawned from exploration.

Wed, May 17, 5:40 PM · Traffic, Operations, netops
BBlack added a comment to T165618: Audit / document reasons for not enabling HT?.

I think that almost universally, HT is a win for the host as a whole. There's always more things going on than there are cpu cores. If nothing else, picture it in your head as "puppet agent and stats outputting stuff can run in that extra headroom without impacting the important stuff" or whatever.

Wed, May 17, 5:28 PM · Operations
BBlack added a comment to T165618: Audit / document reasons for not enabling HT?.

Edited the top part, re-ran excluding virtuals.

Wed, May 17, 5:12 PM · Operations
BBlack edited the description of T165618: Audit / document reasons for not enabling HT?.
Wed, May 17, 5:12 PM · Operations
BBlack created T165618: Audit / document reasons for not enabling HT?.
Wed, May 17, 5:01 PM · Operations
BBlack added a comment to T165614: LLDP on cache hosts.

So, a couple points:

Wed, May 17, 4:51 PM · Traffic, Operations, netops
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

On the reboot issue: I've tested cp4021 and the existing puppetization works fine on reboot (even given the other stuff below).

Wed, May 17, 4:46 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

I wonder if Chrome (which is the dominant browser now, not MSIE as indicated in that nginx source comment) sends the close notify?

Wed, May 17, 4:21 PM · Pybal, Traffic, Operations, netops
BBlack added a comment to T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.

Also notable: lvs1009 and lvs1012 connections to row B (eth2) are using 1GbE ports rather than 10GbE?

Wed, May 17, 1:35 PM · Traffic, netops, Operations
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

Also while I'm thinking about it - we should validate that the sysctl setting for fq as default qdisc "sticks" on reboot and isn't affected by some kind of ordering race...

Wed, May 17, 12:49 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

There's not a lot of good data on how BBR behaves in datacenter-like networks (high bandwidth, low latency, low loss, etc). It's not really the use case it was designed for, and the reports from others have been mixed. It probably won't turn out completely awful or anything, but I don't know if it would actually fix the port saturation problem or not.

Wed, May 17, 12:48 PM · Patch-For-Review, Performance-Team, Traffic, Operations

Tue, May 16

BBlack added a comment to T150256: Re-setup lvs1007-lvs1012, replace lvs1001-lvs1006.

Re: ethernet port validation / config, the last table we had in the old ticket is here: T104458#1788478 . The idea was to try our best to ensure that a given vlan/row's LVS connections are FPC-redundant between the primaries and secondaries. e.g. if lvs1007 (primary for high-traffic1) connects to row C in asw-c-eqiad FPC 5, then lvs1010 (secondary for high-traffic1) needs to connect to row C / asw-c-eqiad in some FPC other than 5. Row D connections have probably changed entirely since that last table was made, and there were some pending moves/fixups listed there as well which may or may not have already happened.

Tue, May 16, 1:53 PM · Traffic, netops, Operations

Mon, May 15

BBlack created T165252: cp1053 possible hardware issues.
Mon, May 15, 1:28 AM · ops-eqiad, Traffic, Operations

Tue, May 9

BBlack added a comment to T108435: Add proper expiry headers to kartotherian's responses.

Are you sure you want the tiles public-cacheable as well? It takes load off of us, but it also puts the purging/invalidation of them on update out of our control (in the users' caches). We might want to sync up on how we want VCL to handle/mask the header as well (and all related things), maybe on IRC or Hangouts when we all get a chance.

Tue, May 9, 8:13 PM · Maps (Kartotherian), Discovery, Interactive-Sprint, Easy
BBlack added a comment to T163674: Frequent RST returned by appservers to LVS hosts.

You might want to look at the other side of the nginx proxy as well. Perhaps apache is terminating its connection to the local nginx with RST, and this causes nginx's proxy in turn to RST upstream to the actual client?

Tue, May 9, 7:07 PM · Pybal, Traffic, Operations, netops
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

APNIC has a good writeup here (first half is TCP history redux, second half goes into interesting details and new data on BBR): https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/

Tue, May 9, 4:33 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack updated subscribers of T164608: Merge cache_maps into cache_upload functionally.

@elukey - I think the only real analytics fallout here is that the data that is currently feeding to you as webrequest_maps will become data that's mixed into the existing feed of webrequest_upload. They'll still be differentiated on the request hostname (upload.wikimedia.org vs maps.wikimedia.org). At the time of deploy for the final transition commit, the data would move over smoothly from one to the other over a period of several minutes. Is this something that requires some special accommodation on analytics end first? Timeline?

Tue, May 9, 2:48 PM · Patch-For-Review, Operations, Traffic
BBlack moved T164173: Cache invalidations coming from the JobQueue are causing lag on several wikis from Triage to Watching on the Traffic board.
Tue, May 9, 12:50 PM · Wikidata, Traffic, DBA, Performance-Team, Operations
BBlack moved T164376: [Discuss] Split ORES scores in datacenters based on wiki from Triage to Watching on the Traffic board.
Tue, May 9, 12:49 PM · Traffic, Scoring-platform-team-Backlog, ORES, ChangeProp, Operations
BBlack moved T164460: Use DNS discovery record for deployment CNAME from Triage to Watching on the Traffic board.
Tue, May 9, 12:49 PM · DNS, Traffic, User-fgiunchedi, Operations
BBlack moved T164327: replace ulsfo aging servers from Triage to Caching on the Traffic board.
Tue, May 9, 12:47 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo

Mon, May 8

BBlack created T164768: Explicitly limit varnishd transient storage.
Mon, May 8, 5:02 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T163312: lvs2001: intermittent packet loss from Icinga checks.

Updates from IRC-only work - a significant majority of our ICMP echo volume is coming from a large number of IPs owned by Google. TODO here is compile information on that we can forwards to their abuse@.

Mon, May 8, 4:46 PM · Patch-For-Review, netops, Traffic, Operations
BBlack updated subscribers of T164610: Unprovision cache_misc @ ulsfo.

We could do so as a goal at the end of the process, depending how we arrange things.

Mon, May 8, 2:00 PM · Operations, Traffic

Fri, May 5

BBlack added a subtask for T164327: replace ulsfo aging servers: T164610: Unprovision cache_misc @ ulsfo.
Fri, May 5, 6:00 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo
BBlack added a parent task for T164610: Unprovision cache_misc @ ulsfo: T164327: replace ulsfo aging servers.
Fri, May 5, 6:00 PM · Operations, Traffic
BBlack created T164610: Unprovision cache_misc @ ulsfo.
Fri, May 5, 6:00 PM · Operations, Traffic
BBlack created T164609: Merge cache_misc into cache_text functionally.
Fri, May 5, 5:57 PM · Operations, Traffic
BBlack created T164608: Merge cache_maps into cache_upload functionally.
Fri, May 5, 5:56 PM · Patch-For-Review, Operations, Traffic
BBlack created T164587: cumin could use randomization/splay options.
Fri, May 5, 2:39 PM · Operations, Operations-Software-Development
BBlack added a comment to T164579: Investigate nginx reload behavior.

Hmmm another thing - when we first deployed this OCSP updating method, GlobalSign was giving us 8-hour OCSP validity windows. At present (just checked) we're getting 4-day validity from GlobalSign and 7-day validity from Digicert. Perhaps we should back off the OCSP timing from once an hour to once a day in light of this, and use cron_splay instead of fqdn_rand while we're at it?

Fri, May 5, 1:24 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T164579: Investigate nginx reload behavior.

Also note from that lengthy post - if we were willing to test the scalability of iptables on cache hosts (which we've avoided for fear that it won't scale over cores like the rest of what we're doing) and it works out, there are iptables hacks around this where you temporarily block new SYNs or dataless ACKs around the quick reload, which might work with nginx's methods.

Fri, May 5, 12:57 PM · Patch-For-Review, Traffic, Operations
BBlack added a comment to T164579: Investigate nginx reload behavior.

How timely! The subject of how to do completely-seamless reloads (especially for TCP) is quite thorny. I've been pondering it and fighting with the issues for years on the UDP side for gdnsd too. And then just yesterday, an HAProxy blog post popped up that goes into the whole thing in great detail with their own struggles and final solution, which (along with lots of other interesting details0 is that nothing will be perfect unless you do SCM_RIGHTS handoff over a unix socket. I think nginx isn't doing that, they're still doing SO_REUSEPORT-based takeover. https://www.haproxy.com/blog/truly-seamless-reloads-with-haproxy-no-more-hacks/

Fri, May 5, 12:54 PM · Patch-For-Review, Traffic, Operations

Thu, May 4

BBlack added a project to T164456: Build nginx without image filter support: Traffic.

Yeah, @faidon has brought up a similar argument before on a slightly different level: that we shouldn't be using nginx-full on most hosts anyways, since we use virtually none of the plugin modules. Somewhere there's an intersection of these ideas that makes life easier.

Thu, May 4, 6:05 PM · Traffic, Operations
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

Interesting data on the topic of BBR under datacenter conditions (low latency 100GbE), possibly supporting the idea that it's not awful to enable it everywhere: https://groups.google.com/forum/#!topic/bbr-dev/U4nlHzS-RFA

Thu, May 4, 1:47 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T164327: replace ulsfo aging servers.

With the cable in the second port + T164444 we're blocked on getting a successful test install. @RobH is asking smart hands to swap the cable, and I'll proceed once one of those two issues is resolved.

Thu, May 4, 12:47 AM · Patch-For-Review, Traffic, Operations, ops-ulsfo
BBlack added a comment to T164327: replace ulsfo aging servers.

ok I'm installing jessie onto cp4021 now (just to test configuration issues and patch up puppet for the real installs later!). Things I found while trying to boot:

Thu, May 4, 12:11 AM · Patch-For-Review, Traffic, Operations, ops-ulsfo

Wed, May 3

BBlack added a comment to T145661: varnish backends start returning 503s after ~6 days uptime.

Reformatting this a bit for comparison, and using the "new" binning (which splits 0-1K from 1K-16K):

Wed, May 3, 11:15 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T145661: varnish backends start returning 503s after ~6 days uptime.

I've re-run the binning analysis, with a few minor changes:

Wed, May 3, 10:55 PM · Patch-For-Review, Operations, Traffic
BBlack added a comment to T164376: [Discuss] Split ORES scores in datacenters based on wiki.

To re-iterate what @Joe is saying a little differently: the point of cross-dc active/active (which is a goal for all services) is to have the ability at any moment in time to handle all traffic in just one DC because we've suddenly lost or depooled the other. We're not sharding data cross-DC, or distributing maximum load capacity cross-DC.

Wed, May 3, 3:11 PM · Traffic, Scoring-platform-team-Backlog, ORES, ChangeProp, Operations
BBlack added a comment to T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+).

@Gilles - FYI the kernel upgrades that were blocking this are done, and we're tentatively looking at turning on BBR on May 22, so that we have a week of post-switchback stats to compare when looking at NavTiming impact.

Wed, May 3, 2:21 PM · Patch-For-Review, Performance-Team, Traffic, Operations
BBlack added a comment to T164327: replace ulsfo aging servers.

Yeah we should discuss our options a bit here re: minimizing ulsfo downtime, I think we have a few options for how we arrange this. There some complicating factors with the misc and maps clusters: the new hardware config assumes we've done the software work to fold those into the primary clusters. The reason we didn't block is that the backup plan is to simply not have misc and maps endpoints in ulsfo until the software side of the work is done.

Wed, May 3, 2:08 PM · Patch-For-Review, Traffic, Operations, ops-ulsfo

Thu, Apr 27

BBlack added a comment to T145661: varnish backends start returning 503s after ~6 days uptime.

Some general updates on bin-sizing estimates: Based on the graph data for available bytes in each bin and comparing how fast they initially reach zero after restarts on live servers, we can make a few tweaks. Bins 1 and 2 seem slightly-oversized, while bins 0, 3, and 4 are slightly undersized. Proposed tweak to upload.pp based on the live data would be:

Thu, Apr 27, 12:34 PM · Patch-For-Review, Operations, Traffic

Wed, Apr 26

BBlack merged task T100690: Enable add_ip6_mapped functionality on all hosts into T102099: Fix IPv6 autoconf issues once and for all, across the fleet..
Wed, Apr 26, 7:38 PM · Operations
BBlack merged T100690: Enable add_ip6_mapped functionality on all hosts into T102099: Fix IPv6 autoconf issues once and for all, across the fleet..
Wed, Apr 26, 7:38 PM · Operations, IPv6

Apr 24 2017

BBlack reopened T163674: Frequent RST returned by appservers to LVS hosts as "Open".

hmm, no, it is the HTTPS check, not the IdleConnection one. I wonder why it's RST and not regular close?

Apr 24 2017, 3:40 PM · Pybal, Traffic, Operations, netops
BBlack closed T163674: Frequent RST returned by appservers to LVS hosts as "Resolved".

I think this is "normal", it's from PyBal IdleConnection monitors (which just open a TCP conn and do no traffic, and eventually result in a RST).

Apr 24 2017, 3:38 PM · Pybal, Traffic, Operations, netops
BBlack moved T163251: Communicate this security change to affected editors and other community members from Triage to TLS on the Traffic board.
Apr 24 2017, 3:00 PM · Community-Liaisons (Jul-Sep 2017), Operations, Traffic

Apr 21 2017

Elitre awarded T163251: Communicate this security change to affected editors and other community members a Like token.
Apr 21 2017, 10:36 AM · Community-Liaisons (Jul-Sep 2017), Operations, Traffic

Apr 20 2017

BBlack added a subtask for T156033: Server hardware purchasing for Asia Cache DC: Unknown Object (Task).
Apr 20 2017, 8:17 PM · Operations, Traffic

Apr 19 2017

BBlack added a comment to T163323: Interface errors on asw-c-codfw:xe-7/0/46.

To do a soft-ish failover, on lvs2002 we can disable the puppet agent and stop pybal temporarily, wait a few minutes for traffic to settle over to lvs2005, and then re-seat or replace the optic on lvs2002 (and then restart pybal + re-enable puppet to bring lvs2002 back into service).

Apr 19 2017, 1:13 PM · Patch-For-Review, DC-Ops, Traffic, Operations, netops

Apr 18 2017

BBlack added a comment to T156256: Select or Acquire Address Space for Asia Cache DC.

We still have no real ETA on the IP addresses. We're attempting to acquire the address space from APNIC. They're (reasonably) requiring proof of our needs, which includes the physical address of the datacenter in Singapore (we're still evaluating multiple RFP responses), invoices for our equipment (which isn't ordered for the same lack of a shipping address), lists of our network peers in Singapore (which is, again, blocked on contracting with a datacenter vendor so we know which peers are available and what physical building we're peering at).

Apr 18 2017, 11:16 PM · Traffic, Operations
BBlack moved T145661: varnish backends start returning 503s after ~6 days uptime from Varnish v4 to Caching on the Traffic board.
Apr 18 2017, 6:30 PM · Patch-For-Review, Operations, Traffic
BBlack closed T138084: unix domain socket listening for varnish4 as "Resolved".

For now we've solved the pragmatic issues in other ways: some general nginx/varnish tuning, kernel TCP params tuning, and using 8x TCP sockets in parallel for the local traffic. I don't see any point in pursuing varnish patches for unix domain sockets at this time, or in the foreseeable Varnish future here.

Apr 18 2017, 6:30 PM · Traffic, Operations
BBlack moved T163233: Implement Varnish-level rough ratelimiting from Triage to Caching on the Traffic board.
Apr 18 2017, 6:28 PM · Traffic, Operations
BBlack closed T126206: Upgrade to Varnish 4: things to remember as "Resolved".

Going back over some of the unchecked boxes at the top:

Apr 18 2017, 6:28 PM · Varnish, Patch-For-Review, Traffic, Operations
BBlack closed T126206: Upgrade to Varnish 4: things to remember, a subtask of T131499: Upgrade all cache clusters to Varnish 4, as "Resolved".
Apr 18 2017, 6:27 PM · Patch-For-Review, Operations, Varnish, Traffic
BBlack added a parent task for T118365: Increase request limits for GETs to /api/rest_v1/: T163233: Implement Varnish-level rough ratelimiting.
Apr 18 2017, 6:26 PM · Operations, Traffic
BBlack added a subtask for T163233: Implement Varnish-level rough ratelimiting: T118365: Increase request limits for GETs to /api/rest_v1/.
Apr 18 2017, 6:26 PM · Traffic, Operations
BBlack added a parent task for T154704: Rate-limit browsers without referers: T163233: Implement Varnish-level rough ratelimiting.
Apr 18 2017, 6:25 PM · Operations, Traffic, Interactive-Sprint, Discovery, Maps
BBlack added a subtask for T163233: Implement Varnish-level rough ratelimiting: T154704: Rate-limit browsers without referers.
Apr 18 2017, 6:25 PM · Traffic, Operations
BBlack triaged T163233: Implement Varnish-level rough ratelimiting as "Normal" priority.
Apr 18 2017, 6:24 PM · Traffic, Operations
BBlack created T163233: Implement Varnish-level rough ratelimiting.
Apr 18 2017, 6:24 PM · Traffic, Operations
BBlack moved T162818: icinga alerts on nodejs services when a recdns server is depooled from Triage to DNS Infra on the Traffic board.
Apr 18 2017, 6:19 PM · Services (next), DNS, Traffic, Operations
BBlack changed the status of T162818: icinga alerts on nodejs services when a recdns server is depooled from "Open" to "Stalled".

@GWicke yeah we should.

Apr 18 2017, 6:19 PM · Services (next), DNS, Traffic, Operations
BBlack added a comment to T163141: dbtree: make wasat a working backend and become active-active .

Yeah, leave the traffic tag as we'll want to basically revert https://gerrit.wikimedia.org/r/#/c/348456/ once dbtree is ready for it.

Apr 18 2017, 6:15 PM · Traffic, DBA, Operations
BBlack moved T163141: dbtree: make wasat a working backend and become active-active from Triage to Caching on the Traffic board.
Apr 18 2017, 6:14 PM · Traffic, DBA, Operations
BBlack closed T159870: baham (ns1) CPU-related issues as "Resolved".

I'll close it for now. If we see more strange issues with super-low cpu freqs we can always search these up to correlate I guess.

Apr 18 2017, 2:30 AM · Traffic, Operations, ops-codfw
BBlack closed T159870: baham (ns1) CPU-related issues, a subtask of T162850: acpi_pad issues, as "Resolved".
Apr 18 2017, 2:30 AM · Patch-For-Review, Operations

Apr 17 2017

BBlack added a comment to T162976: dbtree broken (for some users?).

Yeah the patch I deployed above should have fixed the issue in this ticket. Both of the suggested followups would be ideal, but probably aren't pressing at this time.

Apr 17 2017, 7:48 PM · Patch-For-Review, Traffic, DBA, Operations
BBlack claimed T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).

I've deployed the change above, which gets all of the basics on track for how we want to operate the real campaign. We'll obviously make appropriate minor wording changes once we have dates set and as percentages increase.

Apr 17 2017, 6:47 PM · Patch-For-Review, Operations, Traffic
BBlack edited the description of T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).
Apr 17 2017, 3:58 PM · Patch-For-Review, Operations, Traffic
BBlack edited the description of T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).
Apr 17 2017, 2:55 PM · Patch-For-Review, Operations, Traffic
BBlack edited P5175 Proposed Browser Connection Security Synthetic Page.
Apr 17 2017, 2:53 PM · HTTPS, Traffic
BBlack added a comment to T162976: dbtree broken (for some users?).

(also, generally speaking errors aren't cached, but in this case the error would be cached, because it's returned with a 200 status code...)

Apr 17 2017, 1:52 PM · Patch-For-Review, Traffic, DBA, Operations
BBlack added a comment to T162976: dbtree broken (for some users?).

tendril.wikimedia.org is independent of varnish, only dbtree.wikimedia.org (that we're talking about here) goes through the standard varnish stuff (although arguably tendril should be moved there as well someday).

Apr 17 2017, 1:51 PM · Patch-For-Review, Traffic, DBA, Operations
BBlack added a subtask for T162683: Network hardware purchasing for Asia Cache DC: Unknown Object (Task).
Apr 17 2017, 1:43 PM · Operations, Traffic

Apr 13 2017

BBlack added a comment to T156029: Select location for Asia Cache DC.

This is the last time I'll respond to trolling on this ticket.

Apr 13 2017, 2:17 PM · Operations, Traffic
BBlack closed T155411: Reimage achernar and acamar to jessie as "Resolved".
Apr 13 2017, 1:59 AM · Patch-For-Review, Operations

Apr 12 2017

BBlack added a comment to T162850: acpi_pad issues.

modules/base/files/kernel/blacklist-wmf.conf is probably the place to try disabling this first, FWIW.

Apr 12 2017, 11:38 PM · Patch-For-Review, Operations
BBlack triaged T162850: acpi_pad issues as "High" priority.
Apr 12 2017, 11:36 PM · Patch-For-Review, Operations
BBlack created T162850: acpi_pad issues.
Apr 12 2017, 11:36 PM · Patch-For-Review, Operations
BBlack added a comment to T162818: icinga alerts on nodejs services when a recdns server is depooled.

FWIW - I did the same depooling (for reinstalls) in codfw this afternoon, and there was no impact in that case. So this seems to also be eqiad-specific (but that could be just an effect of all the real user load being in eqiad - maybe the same problem, whatever it is, happens in codfw but it's a non-issue due to light load)

Apr 12 2017, 11:30 PM · Services (next), DNS, Traffic, Operations
BBlack renamed T155411: Reimage achernar and acamar to jessie from "Reimage achernar and amacar to jessie" to "Reimage achernar and acamar to jessie".
Apr 12 2017, 8:46 PM · Patch-For-Review, Operations
BBlack created T162818: icinga alerts on nodejs services when a recdns server is depooled.
Apr 12 2017, 5:09 PM · Services (next), DNS, Traffic, Operations

Apr 11 2017

BBlack triaged T162683: Network hardware purchasing for Asia Cache DC as "Normal" priority.
Apr 11 2017, 12:46 PM · Operations, Traffic
BBlack triaged T162684: Network hardware configuration for Asia Cache DC as "Normal" priority.
Apr 11 2017, 12:45 PM · Traffic, Operations
BBlack moved T162684: Network hardware configuration for Asia Cache DC from Triage to Asia Cache DC on the Traffic board.
Apr 11 2017, 12:45 PM · Traffic, Operations
BBlack moved T162683: Network hardware purchasing for Asia Cache DC from Triage to Asia Cache DC on the Traffic board.
Apr 11 2017, 12:45 PM · Operations, Traffic