It was discovered earlier today that one pod of editor-analytics is crash looping due to (initial) liveness check timeouts, together with logs reporting gocql client timeouts communicating with aqs1012 - e.g.,
2026/04/13 16:39:35 gocql: unable to dial control conn 10.64.32.145:9042: dial tcp 10.64.32.145:9042: i/o timeout
Initially, I suspected this is the scenario we've seen previously in T366851: gocql startup times have increased between v1.2.0 and v1.6.0 where, for certain versions of the gocql client, even one unresponsive host will incur a timeout at startup that is greater than the total liveness probe check period (10s) x attempts (3).
It turned out that it was indeed that, but not in the way I initially thought: The i/o timeout errors for aqs1012 were a red herring, and instead the problem was timeouts connecting to the newly added aqs1023 and 1024 hosts upon discovery, due to a pending external-services network policy change (T423168#11816149).
Action items:
- Apply pending external-services network policy changes (risk mitigated)
- Remove aqs1012 from the initial host list https://gerrit.wikimedia.org/r/1270496 (resolve confusing timeout)
- Update Cassandra host turnup procedure to include external-services update step - i.e., after they become visible to this puppet query, but before clients will discover them [0]. Note that this is needed regardless of whether the client configured to behave in a sensible way at start up in the presence of unreachable hosts.
- Configure gocql clients to not block start up in the presence of unreachable hosts. See discussion in T366851 (e.g., one possible option being to shorten the timeout when attempting to connect to discovered peers). Alternatively or together with that, decouple client init from service liveness, and instead only reflect init completion in readiness.
- (defense in depth) Consider setting initialDelaySeconds on all liveness checks once again. Done in https://gerrit.wikimedia.org/r/1270980.
[0] Deploying this is identical to the procedure in https://wikitech.wikimedia.org/wiki/Kubernetes/Add_a_new_service#Deploy_changes_to_helmfile.d/admin_ng. You should also be able to add --selector name=external-services to the helmfile commands to narrow the scope just to the external-services policies (if there are other pending diffs you do not recognize).