Page MenuHomePhabricator

aqs-http-gateway services at risk due to inaccessible cassandra hosts
Open, MediumPublic

Description

It was discovered earlier today that one pod of editor-analytics is crash looping due to (initial) liveness check timeouts, together with logs reporting gocql client timeouts communicating with aqs1012 - e.g.,

2026/04/13 16:39:35 gocql: unable to dial control conn 10.64.32.145:9042: dial tcp 10.64.32.145:9042: i/o timeout

Initially, I suspected this is the scenario we've seen previously in T366851: gocql startup times have increased between v1.2.0 and v1.6.0 where, for certain versions of the gocql client, even one unresponsive host will incur a timeout at startup that is greater than the total liveness probe check period (10s) x attempts (3).

It turned out that it was indeed that, but not in the way I initially thought: The i/o timeout errors for aqs1012 were a red herring, and instead the problem was timeouts connecting to the newly added aqs1023 and 1024 hosts upon discovery, due to a pending external-services network policy change (T423168#11816149).

Action items:

  • Apply pending external-services network policy changes (risk mitigated)
  • Remove aqs1012 from the initial host list https://gerrit.wikimedia.org/r/1270496 (resolve confusing timeout)
  • Update Cassandra host turnup procedure to include external-services update step - i.e., after they become visible to this puppet query, but before clients will discover them [0]. Note that this is needed regardless of whether the client configured to behave in a sensible way at start up in the presence of unreachable hosts.
  • Configure gocql clients to not block start up in the presence of unreachable hosts. See discussion in T366851 (e.g., one possible option being to shorten the timeout when attempting to connect to discovered peers). Alternatively or together with that, decouple client init from service liveness, and instead only reflect init completion in readiness.
  • (defense in depth) Consider setting initialDelaySeconds on all liveness checks once again. Done in https://gerrit.wikimedia.org/r/1270980.

[0] Deploying this is identical to the procedure in https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d/admin_ng. You should also be able to add --selector name=external-services to the helmfile commands to narrow the scope just to the external-services policies (if there are other pending diffs you do not recognize).

Event Timeline

I've verified that manually deleting an editor-analytics pod in staging will trigger crash looping, and then setting initialDelaySeconds on the liveness probe (in this case 40s) will resolve it.

Change #1270496 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] aqs2-common: Remove decommed aqs1012

https://gerrit.wikimedia.org/r/1270496

Change #1270496 merged by jenkins-bot:

[operations/deployment-charts@master] aqs2-common: Remove decommed aqs1012

https://gerrit.wikimedia.org/r/1270496

Plot twist:

Deploying https://gerrit.wikimedia.org/r/1270496 to editor-analytics staging failed, again with a (initial) liveness check timeout, but without the i/o timeout error. So, what gives?

It turns out there was also latent external-services network policy change, to allow egress to aqs1023 and aqs1024 added to the cluster last week. Once that was deployed to staging, I was able to cleanly apply https://gerrit.wikimedia.org/r/1270496.

So, we need to:

  1. Apply the latent network policy change to other clusters.
  2. Apply https://gerrit.wikimedia.org/r/1270496 to the remaining services x clusters.

Mentioned in SAL (#wikimedia-operations) [2026-04-13T17:40:19Z] <swfrench-wmf> applied latent external-services network policy changes for aqs{1023,1024} - T423168

So, once the external-services network policy changes were applied, the crash-looping pod in editor-analytics was able to start successfully.

That means the i/o timeout errors emitted during startup due to the continued presence of aqs1012 in the initial list were a red herring. In reality, it was the missing egress network policy entries for the discovered cluster hosts that were leading to the timeouts, and this was happening silently within the allowed liveness time limit (i.e., presumably a dial timeout that's longer than the one used to connect to the initial set), similar to what we saw in T366851.

Scott_French renamed this task from aqs-http-gateway services at risk from defunct hosts in cassandra_hosts to aqs-http-gateway services at risk due to inaccessible cassandra hosts.Apr 13 2026, 5:51 PM
Scott_French updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2026-04-13T19:14:52Z] <swfrench-wmf> applied aqs cassandra host list changes from https://gerrit.wikimedia.org/r/1270496 - T423168

Scott_French lowered the priority of this task from High to Medium.Apr 13 2026, 7:32 PM
Scott_French added a project: Cassandra.

@Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up?

Basically, once the new host reaches the point where it's returned by the PQL query linked in the task description, it will appear in the intended network policy, but that still needs applied manually. Reaching out to ServiceOps to do that step is totally fine.

Dropping this to medium now that the immediate issue is remediated.

@Eevans - Could I ask you to pick up the documentation change for Cassandra host turn-up?

Basically, once the new host reaches the point where it's returned by the PQL query linked in the task description, it will appear in the intended network policy, but that still needs applied manually. Reaching out to ServiceOps to do that step is totally fine.

Dropping this to medium now that the immediate issue is remediated.

These almost always occur in batches (i.e. hardware refreshes, expansions, etc), usually on the order of between 3-9 hosts at a time. The new hosts go up over a period of days, or even weeks, one-by-one. That's a lot of ServiceOps pings, no?


Also, I think the proper solution is for the driver to not block startup when one or more cluster nodes is down. That's not a stop the world condition, so it shouldn't block liveness IMO (and in fact, it doesn't for other language drivers, nor previous (and possibly subsequent) versions of this one). If this is the case, your list of action items above should probably include this (even if as a link to another issue), and it probably (hopefully) changes your 3rd and 4th action items to stopgap status, no?

[...]
These almost always occur in batches (i.e. hardware refreshes, expansions, etc), usually on the order of between 3-9 hosts at a time. The new hosts go up over a period of days, or even weeks, one-by-one. That's a lot of ServiceOps pings, no?

The point here is that you need the external-services network policy changes deployed in order for the new hosts to be usable by dependent services. This is true regardless of whether the client is taught to behave properly at startup in the event of an unreachable host (see below).

ServiceOps can do this for you (largely because cluster-wide changes like this have a wider blast radius), but you're also welcome to do so yourself rather than ping us.

Also, I think the proper solution is for the driver to not block startup when one or more cluster nodes is down. That's not a stop the world condition, so it shouldn't block liveness IMO (and in fact, it doesn't for other language drivers, nor previous (and possibly subsequent) versions of this one). If this is the case, your list of action items above should probably include this (even if as a link to another issue), and it probably (hopefully) changes your 3rd and 4th action items to stopgap status, no?

Indeed, that's what I meant by the 2nd clause in the 4th item. However, I can see how the way I phrased it over-indexes on how the client behaves now rather than prescribing how it should behave. I'll update that now to make it clearer.

Also, to emphasize, item 3 is required regardless of client behavior (see above).

Change #1270980 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Set initialDelaySeconds on aqs-http-gateway direct Cassandra clients

https://gerrit.wikimedia.org/r/1270980

Change #1270980 merged by jenkins-bot:

[operations/deployment-charts@master] Set initialDelaySeconds on aqs-http-gateway direct Cassandra clients

https://gerrit.wikimedia.org/r/1270980

Alright, for now we're in a holding pattern until we decide how to approach the remaining item in the task description - i.e., improving the interaction between unreachable Cassandra hosts, gocql client session initialization, and service initialization (i.e., liveness and readiness).

Change #1277172 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] linked-artifacts: upgrade to hoarde v1.1.3

https://gerrit.wikimedia.org/r/1277172

Change #1277172 merged by jenkins-bot:

[operations/deployment-charts@master] linked-artifacts: upgrade to hoarde v1.1.3

https://gerrit.wikimedia.org/r/1277172

See gitlab.wikimedia.org/repos/sre/hoarde/-/commit/a78889e for an example of application initialization that separates liveness from readiness. A solution like this should be broadly applicable to other cassandra-enabled golang services.