Incident: 2022-09-08 codfw appservers degradation
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Clement_Goubert
	Sep 8 2022, 4:16 PM

Related Objects
Search...

Status	Assigned	Task
Open	None	T317340 Incident: 2022-09-08 codfw appservers degradation
Resolved	Clement_Goubert	T317402 Page on etcdmirror critical status
Open	None	T317403 Add etcdmirror status check to scap
Open	None	T317405 Add failure rate triggered rollback to scap
Open	None	T317535 Add etcdmirror connection retry on etcd-tls-proxy unavailability
Open	None	T317537 Update Etcd/Main cluster#Replication documentation with safe restart conditions and information

Event Timeline

Clement_Goubert created this task.Sep 8 2022, 4:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 8 2022, 4:16 PM

https://wikitech.wikimedia.org/wiki/Incidents/2022-09-08_codfw_api-https_api_appserver_appserver_parsoid_degradation

+Wikimedia-Incident

Clement_Goubert renamed this task from Incident: 2022-09-08 codfw api-https api appserver appserver parsoid degradation to Incident: 2022-09-08 codfw appservers degradation.Sep 9 2022, 2:45 PM

RhinosF1 subscribed.Sep 9 2022, 2:57 PM

Clement_Goubert closed subtask T317402: Page on etcdmirror critical status as Resolved.Sep 13 2022, 9:02 AM

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:17 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Oct 24 2022, 5:18 PM

@Clement_Goubert I've added more detail to the incident doc at https://wikitech.wikimedia.org/wiki/Incidents/2022-09-08_codfw_appservers_degradation. In particular, I'm interested in understanding Scap's handling and what we believe the reason for the depool and degraded latencies were.

I went through Scap messages in Logstash to find when it started and how it failed.

What I don't yet know is:

Why did Scap's depool attempts fail? It seems like Etcd wasn't down but degraded (based on the alert I saw about reduced availability, and given that MW didn't go down despite having a hard dependency on Etcd data every 10-60s).
Why did PyBal depool appservers? I see not a single php-fpm error about EtcdConfig.php unable to fetch data from etcd. It seems like the Scap went ahead with the restart without depool or perhaps didn't even restart, but either way the servers would be healthy. I'm probably missing something and asking the wrong question, correct me :)
What do we believe was the source of the appserver latency increase? Was it basically the queries to etcd from MW every 10s taking longer to respond, and thus latency increasing through that? I'd surprises me that the php-apcu cache misses would be common enough for p75 latency to be affected. (Again, I'm probably asking the wrong question.)

Krinkle added a project: Performance-Team (Radar).Oct 24 2022, 6:16 PM

In T317340#8339447, @Krinkle wrote:

Why did Scap's depool attempts fail? It seems like Etcd wasn't down but degraded (based on the alert I saw about reduced availability, and given that MW didn't go down despite having a hard dependency on Etcd data every 10-60s).

If I remember correctly, they failed because of the discrepancy between the server's status in the canonical etcd cluster (queried by scap) and the one in Pybal's targeted etcd server. That desync was caused by the etcdmirror crash. (@Joe can probably correct me there, as well as on the following questions)

Why did PyBal depool appservers? I see not a single php-fpm error about EtcdConfig.php unable to fetch data from etcd. It seems like the Scap went ahead with the restart without depool or perhaps didn't even restart, but either way the servers would be healthy. I'm probably missing something and asking the wrong question, correct me :)

The appservers status in the "canonical" etcd cluster was depooled. When etcdmirror was restarted, this status was synchronized to the etcd server Pybal targets, which started depooling servers.

What do we believe was the source of the appserver latency increase? Was it basically the queries to etcd from MW every 10s taking longer to respond, and thus latency increasing through that? I'd surprises me that the php-apcu cache misses would be common enough for p75 latency to be affected. (Again, I'm probably asking the wrong question.)

I think it was mostly caused by the increased load on the remaining appservers.

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Nov 4 2022, 12:45 AM

Krinkle awarded a token.

TheresNoTime removed a subscriber: RhinosF1.Dec 15 2022, 11:35 PM

Krinkle unsubscribed.Feb 8 2023, 11:20 PM

Is there anything specific about this task that is actionable?

Krinkle removed a project: Performance-Team (Radar).Aug 6 2023, 10:35 PM

Incident: 2022-09-08 codfw appservers degradationOpen, Needs TriagePublicActions

Related ObjectsSearch...

Event Timeline

Incident: 2022-09-08 codfw appservers degradation
Open, Needs TriagePublic
Actions

Related Objects
Search...