Page MenuHomePhabricator

Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*
Closed, ResolvedPublic

Description

It looks like this could be related to some downtime. See T181538#3794219 for the originating discussion.

Event Timeline

[15:27:23] <akosiaris> I was looking at https://github.com/antirez/redis/issues/1019 but it has been "fixed" since 2.6
[15:27:25] <akosiaris> we run 2.8
[15:28:30] <akosiaris> btw that issue only seems to hit the queue redis
[15:28:33] <akosiaris> not the cache redis

eqiad redis queue for ORES is no longer persisting in disk since 21:39 UTC today in an effort to address this. Results still inconclusive

Halfak claimed this task.

This looks done to me. Feel free to re-open

Re-opening per the following:

After a brief discussion in #wikimedia-ai at ~01:00 UTC, we 've decided to make the "persist to disk" removal for the redis ORES queue permanent

I 'll upload a puppet patch for that

Change 394022 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable ORES redis persistence for queue

https://gerrit.wikimedia.org/r/394022

Change 394022 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable ORES redis persistence for queue

https://gerrit.wikimedia.org/r/394022

Re-resolving. This has been deployed. The deploy in codfw did not cause any pain, just a few overload errors. The deploy in eqiad wasn't so great. memory/cpu usage on the lower powered scb boxes spiked, thankfully without the kernel OOM killer showing up. But the cpu spike probably causes issues to service-runner applications, where child processes have been witnessed in the past to miss their heartbeat to the master and having to be reaped and restarted.

In an effort to make things better I 've lowered the consistency for celery on scb1001, scb1002 per T181538#3795990. This is actively lowering capacity but should alleviate the problems cause on other services.