Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T317340 Incident: 2022-09-08 codfw appservers degradation | |||
Resolved | Clement_Goubert | T317402 Page on etcdmirror critical status | |||
Open | None | T317403 Add etcdmirror status check to scap | |||
Open | None | T317405 Add failure rate triggered rollback to scap | |||
Open | None | T317535 Add etcdmirror connection retry on etcd-tls-proxy unavailability | |||
Open | None | T317537 Update Etcd/Main cluster#Replication documentation with safe restart conditions and information |
Event Timeline
@Clement_Goubert I've added more detail to the incident doc at https://wikitech.wikimedia.org/wiki/Incidents/2022-09-08_codfw_appservers_degradation. In particular, I'm interested in understanding Scap's handling and what we believe the reason for the depool and degraded latencies were.
I went through Scap messages in Logstash to find when it started and how it failed.
What I don't yet know is:
- Why did Scap's depool attempts fail? It seems like Etcd wasn't down but degraded (based on the alert I saw about reduced availability, and given that MW didn't go down despite having a hard dependency on Etcd data every 10-60s).
- Why did PyBal depool appservers? I see not a single php-fpm error about EtcdConfig.php unable to fetch data from etcd. It seems like the Scap went ahead with the restart without depool or perhaps didn't even restart, but either way the servers would be healthy. I'm probably missing something and asking the wrong question, correct me :)
- What do we believe was the source of the appserver latency increase? Was it basically the queries to etcd from MW every 10s taking longer to respond, and thus latency increasing through that? I'd surprises me that the php-apcu cache misses would be common enough for p75 latency to be affected. (Again, I'm probably asking the wrong question.)
If I remember correctly, they failed because of the discrepancy between the server's status in the canonical etcd cluster (queried by scap) and the one in Pybal's targeted etcd server. That desync was caused by the etcdmirror crash. (@Joe can probably correct me there, as well as on the following questions)
- Why did PyBal depool appservers? I see not a single php-fpm error about EtcdConfig.php unable to fetch data from etcd. It seems like the Scap went ahead with the restart without depool or perhaps didn't even restart, but either way the servers would be healthy. I'm probably missing something and asking the wrong question, correct me :)
The appservers status in the "canonical" etcd cluster was depooled. When etcdmirror was restarted, this status was synchronized to the etcd server Pybal targets, which started depooling servers.
- What do we believe was the source of the appserver latency increase? Was it basically the queries to etcd from MW every 10s taking longer to respond, and thus latency increasing through that? I'd surprises me that the php-apcu cache misses would be common enough for p75 latency to be affected. (Again, I'm probably asking the wrong question.)
I think it was mostly caused by the increased load on the remaining appservers.