Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Nov 28 2017, 8:30 PM

Description

It looks like this could be related to some downtime. See T181538#3794219 for the originating discussion.

Details

	Subject	Repo	Branch	Lines +/-
	Disable ORES redis persistence for queue	operations/puppet	production	+1 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T181538 ORES overload incident, 2017-11-28
		Resolved		Halfak	T181563 Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*

Event Timeline

Halfak created this task.Nov 28 2017, 8:30 PM

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptNov 28 2017, 8:30 PM

[15:27:23] <akosiaris> I was looking at https://github.com/antirez/redis/issues/1019 but it has been "fixed" since 2.6
[15:27:25] <akosiaris> we run 2.8
[15:28:30] <akosiaris> btw that issue only seems to hit the queue redis
[15:28:33] <akosiaris> not the cache redis

eqiad redis queue for ORES is no longer persisting in disk since 21:39 UTC today in an effort to address this. Results still inconclusive

This looks done to me. Feel free to re-open

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Nov 28 2017, 11:09 PM

Re-opening per the following:

After a brief discussion in #wikimedia-ai at ~01:00 UTC, we 've decided to make the "persist to disk" removal for the redis ORES queue permanent

I 'll upload a puppet patch for that

Change 394022 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable ORES redis persistence for queue

https://gerrit.wikimedia.org/r/394022

gerritbot added a project: Patch-For-Review.Nov 29 2017, 9:17 AM

Change 394022 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable ORES redis persistence for queue

https://gerrit.wikimedia.org/r/394022

Mentioned in SAL (#wikimedia-operations) [2017-11-29T10:15:22Z] <akosiaris> disable puppet on oresrdb* for merging https://gerrit.wikimedia.org/r/#/c/394022/. T181563

Re-resolving. This has been deployed. The deploy in codfw did not cause any pain, just a few overload errors. The deploy in eqiad wasn't so great. memory/cpu usage on the lower powered scb boxes spiked, thankfully without the kernel OOM killer showing up. But the cpu spike probably causes issues to service-runner applications, where child processes have been witnessed in the past to miss their heartbeat to the master and having to be reaped and restarted.

In an effort to make things better I 've lowered the consistency for celery on scb1001, scb1002 per T181538#3795990. This is actively lowering capacity but should alleviate the problems cause on other services.

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*
Closed, ResolvedPublic
Actions

Related Objects
Search...