BBlack (Brandon Black)
WMF Operations Engineer

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 4 2014, 4:29 PM (129 w, 1 d)
Availability
Available
IRC Nick
bblack
LDAP User
BBlack
MediaWiki User
BBlack (WMF)

Recent Activity

Yesterday

BBlack merged task T100690: Enable add_ip6_mapped functionality on all hosts into T102099: Fix IPv6 autoconf issues once and for all, across the fleet..
Wed, Apr 26, 7:38 PM · Operations
BBlack merged T100690: Enable add_ip6_mapped functionality on all hosts into T102099: Fix IPv6 autoconf issues once and for all, across the fleet..
Wed, Apr 26, 7:38 PM · Operations, IPv6

Mon, Apr 24

BBlack reopened T163674: Frequent RST returned by appservers to LVS hosts as "Open".

hmm, no, it is the HTTPS check, not the IdleConnection one. I wonder why it's RST and not regular close?

Mon, Apr 24, 3:40 PM · Pybal, Traffic, Operations, netops
BBlack closed T163674: Frequent RST returned by appservers to LVS hosts as "Resolved".

I think this is "normal", it's from PyBal IdleConnection monitors (which just open a TCP conn and do no traffic, and eventually result in a RST).

Mon, Apr 24, 3:38 PM · Pybal, Traffic, Operations, netops
BBlack moved T163251: Communicate this security change to affected editors and other community members from Triage to TLS on the Traffic board.
Mon, Apr 24, 3:00 PM · Community-Liaisons (Jul-Sep 2017), Operations, Traffic

Fri, Apr 21

Elitre awarded T163251: Communicate this security change to affected editors and other community members a Like token.
Fri, Apr 21, 10:36 AM · Community-Liaisons (Jul-Sep 2017), Operations, Traffic

Thu, Apr 20

BBlack added a subtask for T156033: Server hardware purchasing for Asia Cache DC: Unknown Object (Task).
Thu, Apr 20, 8:17 PM · Operations, Traffic

Wed, Apr 19

BBlack added a comment to T163323: Interface errors on asw-c-codfw:xe-7/0/46.

To do a soft-ish failover, on lvs2002 we can disable the puppet agent and stop pybal temporarily, wait a few minutes for traffic to settle over to lvs2005, and then re-seat or replace the optic on lvs2002 (and then restart pybal + re-enable puppet to bring lvs2002 back into service).

Wed, Apr 19, 1:13 PM · Patch-For-Review, DC-Ops, Traffic, netops, Operations

Tue, Apr 18

BBlack added a comment to T156256: Select or Acquire Address Space for Asia Cache DC.

We still have no real ETA on the IP addresses. We're attempting to acquire the address space from APNIC. They're (reasonably) requiring proof of our needs, which includes the physical address of the datacenter in Singapore (we're still evaluating multiple RFP responses), invoices for our equipment (which isn't ordered for the same lack of a shipping address), lists of our network peers in Singapore (which is, again, blocked on contracting with a datacenter vendor so we know which peers are available and what physical building we're peering at).

Tue, Apr 18, 11:16 PM · Traffic, Operations
BBlack moved T145661: varnish backends start returning 503s after ~6 days uptime from Varnish v4 to Caching on the Traffic board.
Tue, Apr 18, 6:30 PM · Patch-For-Review, Operations, Traffic
BBlack closed T138084: unix domain socket listening for varnish4 as "Resolved".

For now we've solved the pragmatic issues in other ways: some general nginx/varnish tuning, kernel TCP params tuning, and using 8x TCP sockets in parallel for the local traffic. I don't see any point in pursuing varnish patches for unix domain sockets at this time, or in the foreseeable Varnish future here.

Tue, Apr 18, 6:30 PM · Traffic, Operations
BBlack moved T163233: Implement Varnish-level rough ratelimiting from Triage to Caching on the Traffic board.
Tue, Apr 18, 6:28 PM · Operations, Traffic
BBlack closed T126206: Upgrade to Varnish 4: things to remember as "Resolved".

Going back over some of the unchecked boxes at the top:

Tue, Apr 18, 6:28 PM · Varnish, Patch-For-Review, Traffic, Operations
BBlack closed T126206: Upgrade to Varnish 4: things to remember, a subtask of T131499: Upgrade all cache clusters to Varnish 4, as "Resolved".
Tue, Apr 18, 6:27 PM · Patch-For-Review, Operations, Varnish, Traffic
BBlack added a parent task for T118365: Increase request limits for GETs to /api/rest_v1/: T163233: Implement Varnish-level rough ratelimiting.
Tue, Apr 18, 6:26 PM · Operations, Traffic
BBlack added a subtask for T163233: Implement Varnish-level rough ratelimiting: T118365: Increase request limits for GETs to /api/rest_v1/.
Tue, Apr 18, 6:26 PM · Operations, Traffic
BBlack added a parent task for T154704: Rate-limit browsers without referers: T163233: Implement Varnish-level rough ratelimiting.
Tue, Apr 18, 6:25 PM · Traffic, Operations, Interactive-Sprint, Discovery, Maps
BBlack added a subtask for T163233: Implement Varnish-level rough ratelimiting: T154704: Rate-limit browsers without referers.
Tue, Apr 18, 6:25 PM · Operations, Traffic
BBlack triaged T163233: Implement Varnish-level rough ratelimiting as "Normal" priority.
Tue, Apr 18, 6:24 PM · Operations, Traffic
BBlack created T163233: Implement Varnish-level rough ratelimiting.
Tue, Apr 18, 6:24 PM · Operations, Traffic
BBlack moved T162818: icinga alerts on nodejs services when a recdns server is depooled from Triage to DNS Infra on the Traffic board.
Tue, Apr 18, 6:19 PM · Services (next), DNS, Traffic, Operations
BBlack changed the status of T162818: icinga alerts on nodejs services when a recdns server is depooled from "Open" to "Stalled".

@GWicke yeah we should.

Tue, Apr 18, 6:19 PM · Services (next), DNS, Traffic, Operations
BBlack added a comment to T163141: dbtree: make wasat a working backend and become active-active .

Yeah, leave the traffic tag as we'll want to basically revert https://gerrit.wikimedia.org/r/#/c/348456/ once dbtree is ready for it.

Tue, Apr 18, 6:15 PM · Traffic, DBA, Operations
BBlack moved T163141: dbtree: make wasat a working backend and become active-active from Triage to Caching on the Traffic board.
Tue, Apr 18, 6:14 PM · Traffic, DBA, Operations
BBlack closed T159870: baham (ns1) CPU-related issues as "Resolved".

I'll close it for now. If we see more strange issues with super-low cpu freqs we can always search these up to correlate I guess.

Tue, Apr 18, 2:30 AM · Traffic, Operations, ops-codfw
BBlack closed T159870: baham (ns1) CPU-related issues, a subtask of T162850: acpi_pad issues, as "Resolved".
Tue, Apr 18, 2:30 AM · Patch-For-Review, Operations

Mon, Apr 17

BBlack added a comment to T162976: dbtree broken (for some users?).

Yeah the patch I deployed above should have fixed the issue in this ticket. Both of the suggested followups would be ideal, but probably aren't pressing at this time.

Mon, Apr 17, 7:48 PM · Patch-For-Review, Traffic, DBA, Operations
BBlack claimed T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).

I've deployed the change above, which gets all of the basics on track for how we want to operate the real campaign. We'll obviously make appropriate minor wording changes once we have dates set and as percentages increase.

Mon, Apr 17, 6:47 PM · Patch-For-Review, Operations, Traffic
BBlack edited the description of T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).
Mon, Apr 17, 3:58 PM · Patch-For-Review, Operations, Traffic
BBlack edited the description of T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).
Mon, Apr 17, 2:55 PM · Patch-For-Review, Operations, Traffic
BBlack edited P5175 Proposed Browser Connection Security Synthetic Page.
Mon, Apr 17, 2:53 PM · HTTPS, Traffic
BBlack added a comment to T162976: dbtree broken (for some users?).

(also, generally speaking errors aren't cached, but in this case the error would be cached, because it's returned with a 200 status code...)

Mon, Apr 17, 1:52 PM · Patch-For-Review, Traffic, DBA, Operations
BBlack added a comment to T162976: dbtree broken (for some users?).

tendril.wikimedia.org is independent of varnish, only dbtree.wikimedia.org (that we're talking about here) goes through the standard varnish stuff (although arguably tendril should be moved there as well someday).

Mon, Apr 17, 1:51 PM · Patch-For-Review, Traffic, DBA, Operations
BBlack added a subtask for T162683: Network hardware purchasing for Asia Cache DC: Unknown Object (Task).
Mon, Apr 17, 1:43 PM · Operations, Traffic

Thu, Apr 13

BBlack added a comment to T156029: Select location for Asia Cache DC.

This is the last time I'll respond to trolling on this ticket.

Thu, Apr 13, 2:17 PM · Traffic, Operations
BBlack closed T155411: Reimage achernar and acamar to jessie as "Resolved".
Thu, Apr 13, 1:59 AM · Patch-For-Review, Operations

Wed, Apr 12

BBlack added a comment to T162850: acpi_pad issues.

modules/base/files/kernel/blacklist-wmf.conf is probably the place to try disabling this first, FWIW.

Wed, Apr 12, 11:38 PM · Patch-For-Review, Operations
BBlack triaged T162850: acpi_pad issues as "High" priority.
Wed, Apr 12, 11:36 PM · Patch-For-Review, Operations
BBlack created T162850: acpi_pad issues.
Wed, Apr 12, 11:36 PM · Patch-For-Review, Operations
BBlack added a comment to T162818: icinga alerts on nodejs services when a recdns server is depooled.

FWIW - I did the same depooling (for reinstalls) in codfw this afternoon, and there was no impact in that case. So this seems to also be eqiad-specific (but that could be just an effect of all the real user load being in eqiad - maybe the same problem, whatever it is, happens in codfw but it's a non-issue due to light load)

Wed, Apr 12, 11:30 PM · Services (next), DNS, Traffic, Operations
BBlack renamed T155411: Reimage achernar and acamar to jessie from "Reimage achernar and amacar to jessie" to "Reimage achernar and acamar to jessie".
Wed, Apr 12, 8:46 PM · Patch-For-Review, Operations
BBlack created T162818: icinga alerts on nodejs services when a recdns server is depooled.
Wed, Apr 12, 5:09 PM · Services (next), DNS, Traffic, Operations

Tue, Apr 11

BBlack triaged T162683: Network hardware purchasing for Asia Cache DC as "Normal" priority.
Tue, Apr 11, 12:46 PM · Operations, Traffic
BBlack triaged T162684: Network hardware configuration for Asia Cache DC as "Normal" priority.
Tue, Apr 11, 12:45 PM · Traffic, Operations
BBlack moved T162684: Network hardware configuration for Asia Cache DC from Triage to Asia Cache DC on the Traffic board.
Tue, Apr 11, 12:45 PM · Traffic, Operations
BBlack moved T162683: Network hardware purchasing for Asia Cache DC from Triage to Asia Cache DC on the Traffic board.
Tue, Apr 11, 12:45 PM · Operations, Traffic
BBlack added subtasks for T162684: Network hardware configuration for Asia Cache DC: T156256: Select or Acquire Address Space for Asia Cache DC, T156028: Name Asia Cache DC site.
Tue, Apr 11, 12:44 PM · Traffic, Operations
BBlack added a parent task for T156256: Select or Acquire Address Space for Asia Cache DC: T162684: Network hardware configuration for Asia Cache DC.
Tue, Apr 11, 12:44 PM · Traffic, Operations
BBlack added a parent task for T156028: Name Asia Cache DC site: T162684: Network hardware configuration for Asia Cache DC.
Tue, Apr 11, 12:44 PM · Operations, Traffic
BBlack removed a parent task for T156029: Select location for Asia Cache DC: T156028: Name Asia Cache DC site.
Tue, Apr 11, 12:43 PM · Traffic, Operations
BBlack removed a subtask for T156028: Name Asia Cache DC site: T156029: Select location for Asia Cache DC.
Tue, Apr 11, 12:43 PM · Operations, Traffic
BBlack removed a parent task for T156029: Select location for Asia Cache DC: T156031: Turn up network links for Asia Cache DC.
Tue, Apr 11, 12:42 PM · Traffic, Operations
BBlack removed a parent task for T156030: Select site vendor for Asia Cache Datacenter: T156031: Turn up network links for Asia Cache DC.
Tue, Apr 11, 12:42 PM · Traffic, Operations
BBlack removed subtasks for T156031: Turn up network links for Asia Cache DC: T156256: Select or Acquire Address Space for Asia Cache DC, T156029: Select location for Asia Cache DC, T156030: Select site vendor for Asia Cache Datacenter.
Tue, Apr 11, 12:42 PM · Operations, Traffic
BBlack removed a parent task for T156256: Select or Acquire Address Space for Asia Cache DC: T156031: Turn up network links for Asia Cache DC.
Tue, Apr 11, 12:42 PM · Traffic, Operations
BBlack added a subtask for T162683: Network hardware purchasing for Asia Cache DC: T156030: Select site vendor for Asia Cache Datacenter.
Tue, Apr 11, 12:42 PM · Operations, Traffic
BBlack added a parent task for T156030: Select site vendor for Asia Cache Datacenter: T162683: Network hardware purchasing for Asia Cache DC.
Tue, Apr 11, 12:42 PM · Traffic, Operations
BBlack removed a subtask for T156031: Turn up network links for Asia Cache DC: T162683: Network hardware purchasing for Asia Cache DC.
Tue, Apr 11, 12:41 PM · Operations, Traffic
BBlack removed a parent task for T162683: Network hardware purchasing for Asia Cache DC: T156031: Turn up network links for Asia Cache DC.
Tue, Apr 11, 12:41 PM · Operations, Traffic
BBlack added a subtask for T162684: Network hardware configuration for Asia Cache DC: T162683: Network hardware purchasing for Asia Cache DC.
Tue, Apr 11, 12:41 PM · Traffic, Operations
BBlack added a parent task for T162683: Network hardware purchasing for Asia Cache DC: T162684: Network hardware configuration for Asia Cache DC.
Tue, Apr 11, 12:41 PM · Operations, Traffic
BBlack added subtasks for T156031: Turn up network links for Asia Cache DC: T162684: Network hardware configuration for Asia Cache DC, T162683: Network hardware purchasing for Asia Cache DC, T156028: Name Asia Cache DC site.
Tue, Apr 11, 12:40 PM · Operations, Traffic
BBlack added a parent task for T156028: Name Asia Cache DC site: T156031: Turn up network links for Asia Cache DC.
Tue, Apr 11, 12:40 PM · Operations, Traffic
BBlack added a parent task for T162683: Network hardware purchasing for Asia Cache DC: T156031: Turn up network links for Asia Cache DC.
Tue, Apr 11, 12:40 PM · Operations, Traffic
BBlack added a parent task for T162684: Network hardware configuration for Asia Cache DC: T156031: Turn up network links for Asia Cache DC.
Tue, Apr 11, 12:40 PM · Traffic, Operations
BBlack created T162684: Network hardware configuration for Asia Cache DC.
Tue, Apr 11, 12:39 PM · Traffic, Operations
BBlack created T162683: Network hardware purchasing for Asia Cache DC.
Tue, Apr 11, 12:39 PM · Operations, Traffic
BBlack renamed T156032: Server hardware installation for Asia Cache DC from "Hardware installation for Asia Cache DC" to "Server hardware installation for Asia Cache DC".
Tue, Apr 11, 12:37 PM · Traffic, Operations
BBlack renamed T156033: Server hardware purchasing for Asia Cache DC from "Hardware purchasing for Asia Cache DC" to "Server hardware purchasing for Asia Cache DC".
Tue, Apr 11, 12:37 PM · Operations, Traffic

Mon, Apr 10

BBlack reassigned T162099: lvs2002 random shut down from BBlack to ayounsi.

@Papaul Everything looks good with lvs2002 (checked icinga, interfaces on correct vlans, etc).

Mon, Apr 10, 6:11 PM · ops-codfw, Traffic, Operations
BBlack claimed T162099: lvs2002 random shut down.

Switching this to me

Mon, Apr 10, 5:32 PM · ops-codfw, Traffic, Operations
BBlack closed T161819: Investigate 502 errors from nginx when backend returns 302 as "Resolved".

Merge above should fix this, at least for this case and any others on our cache terminators of similar magnitude (but not those with >8K of response header). Example URL from the description WFM now.

Mon, Apr 10, 12:44 PM · Patch-For-Review, Wikimedia-Logstash, Operations, Traffic

Sat, Apr 8

BBlack added a comment to T162239: cr2-esams FPC 0 is dead.

Update: others noticed the serial number didn't change. So, the new part is not yet installed, and we're not sure whether the old part recovered spontaneously, or due to some local action (e.g. reseated while inspecting, etc)

Sat, Apr 8, 6:35 PM · netops, Operations, ops-esams
BBlack added a comment to T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser.

The last round of bans mentioned above is complete now. If all of our theories and workarounds are completely valid (and there aren't other bugs or behaviors in play), this issue should be resolved now with no remaining examples (or new ones being created).

Sat, Apr 8, 4:03 PM · Patch-For-Review, Traffic, Operations, media-storage, User-Urbanecm
BBlack added a comment to T162239: cr2-esams FPC 0 is dead.

As best as I can tell from looking at a longer section of the cr2-esams logs, it really does look like esams remote hands already swapped in the replacement part and things came up normally (with a brief 503 spike). The part was already on-site a day or so before this according to UPS tracking. The logs are currently spamming some errors about misconfigured BGP peers, but that may well be "normal". A large number of peers are established and working fine.

Sat, Apr 8, 1:09 PM · netops, Operations, ops-esams
BBlack added a comment to T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser.

The continued reports above were expected, as detailed when the Varnish-level workaround was applied above in T162035#3159658 . I've done another of the periodic bans this morning. After giving some time for that impact to settle, I'll start later today on executing and monitoring the more-complete ban on "all objects without X-Original-Content-Type". After that ban, we should be able to get the workaround to take complete effect with one last ban on CT ~ text/html across the fleet. Next week we'll sort out the plans for solving the underlying issue with Swift so that we can eventually revert the Varnish-level hacks and restore our storage keep-time, etc.

Sat, Apr 8, 12:53 PM · Patch-For-Review, Traffic, Operations, media-storage, User-Urbanecm
BBlack merged T162483: Missing thumbnail image on Commons into T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser.
Sat, Apr 8, 12:48 PM · Patch-For-Review, Traffic, Operations, media-storage, User-Urbanecm
BBlack merged task T162483: Missing thumbnail image on Commons into T162035: Some PNG thumbnails and JPEG originals delivered as [text/html] content-type and hence not rendered in browser.
Sat, Apr 8, 12:48 PM

Fri, Apr 7

BBlack added a comment to T155411: Reimage achernar and acamar to jessie.

What's the status of T154759? The last we had a DNS recursor down, this led to various problems ( I don't remember all the details, though)

Fri, Apr 7, 1:52 PM · Patch-For-Review, Operations
BBlack raised the priority of T155411: Reimage achernar and acamar to jessie from "Normal" to "High".

This is going to block deploying edns-client-subnet -enabled recdns packages (requires jessie), which is important for the DC switching stuff. Perhaps we can squeeze this in next week?

Fri, Apr 7, 12:55 PM · Patch-For-Review, Operations

Tue, Apr 4

BBlack edited the description of T162073: Ops Onboarding for Arzhel Younsi.
Tue, Apr 4, 9:54 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests

Mon, Apr 3

BBlack edited the description of T162073: Ops Onboarding for Arzhel Younsi.
Mon, Apr 3, 6:48 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests
BBlack added a comment to T162073: Ops Onboarding for Arzhel Younsi.

Added to other email aliases in private repo as well: dns-admin, peering, ripe-updates

Mon, Apr 3, 6:22 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests
BBlack edited the description of T162073: Ops Onboarding for Arzhel Younsi.
Mon, Apr 3, 6:07 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests
BBlack edited the description of T162073: Ops Onboarding for Arzhel Younsi.
Mon, Apr 3, 6:03 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests
BBlack edited the description of T162073: Ops Onboarding for Arzhel Younsi.
Mon, Apr 3, 6:03 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests
BBlack edited projects for T162073: Ops Onboarding for Arzhel Younsi, added: Traffic; removed Patch-For-Review.
Mon, Apr 3, 5:58 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests
BBlack created T162073: Ops Onboarding for Arzhel Younsi.
Mon, Apr 3, 5:56 PM · Patch-For-Review, Traffic, Operations, Ops-Access-Requests
BBlack added a comment to T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).

Depending on the context I've been flipping between whether we're talking about just 3DES or both of the non-FS ciphers, sorry. In current weekly stats, 3DES is around 0.125% and AES128-SHA is around 0.225% for total of ~0.35% non-FS ( https://grafana.wikimedia.org/dashboard/db/tls-ciphers ). Probably the bolded redirect notice with the 0.2% number should be removed from the wikitech page, and the synthetic varnish error shown here should use "less than 0.2%" during the 3DES campaign. Once 3DES is disabled we can re-assess how we approach AES128-SHA and fix up various things appropriately.

Mon, Apr 3, 2:26 PM · Patch-For-Review, Operations, Traffic
BBlack created P5187 ms-fe1008 diff.
Mon, Apr 3, 1:44 PM

Fri, Mar 31

BBlack added a comment to T147199: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support).

We've been stalling on this a bit too long now. I'd like to start kicking off this process and getting in touch with Community as well. I've kinda backtracked on the idea of a redirect to meta-wiki for the initial notice to the user. I think for the initial page replacement, we should stick with a synthetic output directly from Varnish itself. That output can in turn contain a link to our existing wikitech page, or a similar page to that one on meta-wiki that's more-detailed. I've stolen from our existing Varnish errorpage.html and proposed some HTML for this here: P5175 (I wish pastes could be viewed raw with content-type!). Obviously wording and layout can be worked on a bit (and actual dates inserted for the real thing), but I like it having our standardized error theme and logo.

Fri, Mar 31, 4:04 PM · Patch-For-Review, Operations, Traffic
BBlack created P5175 Proposed Browser Connection Security Synthetic Page.
Fri, Mar 31, 3:58 PM · HTTPS, Traffic

Thu, Mar 30

BBlack added a comment to T161819: Investigate 502 errors from nginx when backend returns 302.

Ok, I was wrong in my initial thinking. Even though we configure proxy_buffering off;, proxy_buffer_size is still a factor. Technically it only defines a chunk-size for reading the response according to the docs, but I'm guessing if it can't read all of the headers in the first chunk it fails. Manual experiments made the logstash url work with proxy_buffer_size 8k;. This might bloat nginx memory usage if applied in the general case, but I don't think it's enough to really hurt anything.

Thu, Mar 30, 9:16 PM · Patch-For-Review, Wikimedia-Logstash, Traffic, Operations
BBlack added a comment to T161819: Investigate 502 errors from nginx when backend returns 302.

The content of the location header is 3960 bytes

Thu, Mar 30, 7:46 PM · Patch-For-Review, Wikimedia-Logstash, Traffic, Operations

Wed, Mar 29

BBlack added a project to T161517: Allow anonymous users to change interface language on Commons with ULS: Traffic.

Adding Traffic and myself and @ema to this. I don't think we've been aware of the uselang hack or its mechanics before (why did ?uselang=foo trigger uncacheability in the first place? query params vary the cache by default, it would've been fine and preferable to leave it cacheable...).

Wed, Mar 29, 3:16 PM · Operations, Traffic, Patch-For-Review, Commons, Wikimedia-Site-requests, I18n

Tue, Mar 28

BBlack added a comment to T114104: pybal doesn't fully manage LVS table leaving stale services (on IP change).

I think wiping the whole table, even at startup, is probably not ideal (but certainly better than wiping it on shutdown!)., What we should really be aiming for is just better state-sync. Pybal should delete unconfigured services on startup, but it shouldn't delete and then recreate ones that remained stable. So basically it needs to read the current state and model from that what the minimal actions are to bring it into alignment with configuration. What it seems to do now is more blind/idempotent than that, but it leads to these kinds of issues.

Tue, Mar 28, 10:26 PM · Traffic, Operations, Pybal

Mar 24 2017

BBlack moved T161148: AuthDNS CM/CI refactor from Triage to DNS Infra on the Traffic board.
Mar 24 2017, 12:19 PM · DNS, Traffic, Operations

Mar 23 2017

BBlack added a comment to T161145: Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters.

I was trying to think of a way to do this that isn't quite as stateful as current cron_splay, but I haven't thought of a good one yet. If we assume we're trying to just extend the cron_splay mechanism to cover a more-general case like this, it would need the entire nodelist the global cron is applied to, as well as some way to notice which nodes are part of a shared cluster (e.g. name of applied role class?), and some way to identify datacenter/site (current cron_splay uses NNNN from hostname, which doesn't apply to all clusters).

Mar 23 2017, 5:53 PM · Operations
BBlack updated subscribers of T161145: Fix the general problem of randomly-bad puppet agent cron timings within redundant clusters.
Mar 23 2017, 5:15 PM · Operations
BBlack edited the description of T161148: AuthDNS CM/CI refactor.
Mar 23 2017, 3:34 PM · DNS, Operations, Traffic