Investigate intermittent uptime health check failures
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	Tarrow
	Jun 25 2024, 8:18 AM

Description

Event Timeline

Tarrow created this task.Jun 25 2024, 8:18 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 25 2024, 8:18 AM

Tarrow claimed this task.Jun 25 2024, 8:18 AM

CPU usage by production MW pods:

I'm also suspicious about the number of errors about Redis storages failing due to being too slow

It's very uneducated but a cheap thing to try may be to increase the replicaCount of these web pod. I stuck up https://github.com/wmde/wbaas-deploy/pull/1659

Anton.Kokh edited projects, added Wikibase Cloud (Kanban Board Q3 2024); removed Wikibase Cloud (Kanban Board Q2 2024).Jul 1 2024, 7:31 AM

Anton.Kokh moved this task from To do to Doing on the Wikibase Cloud (Kanban Board Q3 2024) board.Jul 1 2024, 7:31 AM

This has improved but perhaps not fully resolved. Still have multiple restarts in the last 2 days

Addshore subscribed.Jul 12 2024, 2:06 PM

Hard to make any guessed without seeing more logs.
Whats the actual content of some of the redis logs? What keys are being accessed and are slow?
What does the CPU and memory usage of redis look like? and the logs and uptime of redis itself?

Tarrow moved this task from Doing to In Review on the Wikibase Cloud (Kanban Board Q3 2024) board.Jul 31 2024, 8:01 AM

Deniz_WMDE moved this task from In Review to Waiting for Deploy to Staging on the Wikibase Cloud (Kanban Board Q3 2024) board.Aug 1 2024, 8:03 AM

Tarrow moved this task from Waiting for Deploy to Staging to Waiting for Deploy to Production on the Wikibase Cloud (Kanban Board Q3 2024) board.Aug 8 2024, 7:59 AM

Tarrow moved this task from Waiting for Deploy to Production to Done on the Wikibase Cloud (Kanban Board Q3 2024) board.Aug 15 2024, 7:57 AM

Tarrow closed this task as Resolved.Sep 19 2024, 1:19 PM

	F55891478: image.png
	Jun 26 2024, 9:41 AM

	F55891400: image.png
	Jun 26 2024, 9:12 AM

Investigate intermittent uptime health check failuresClosed, ResolvedPublicBUG REPORTActions

Description

Event Timeline

Investigate intermittent uptime health check failures
Closed, ResolvedPublicBUG REPORT
Actions