|Resolved||None||T181538 ORES overload incident, 2017-11-28|
|Resolved||Halfak||T181563 Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200*|
[15:27:23] <akosiaris> I was looking at https://github.com/antirez/redis/issues/1019 but it has been "fixed" since 2.6 [15:27:25] <akosiaris> we run 2.8 [15:28:30] <akosiaris> btw that issue only seems to hit the queue redis [15:28:33] <akosiaris> not the cache redis
Re-opening per the following:
After a brief discussion in #wikimedia-ai at ~01:00 UTC, we 've decided to make the "persist to disk" removal for the redis ORES queue permanent
I 'll upload a puppet patch for that
Re-resolving. This has been deployed. The deploy in codfw did not cause any pain, just a few overload errors. The deploy in eqiad wasn't so great. memory/cpu usage on the lower powered scb boxes spiked, thankfully without the kernel OOM killer showing up. But the cpu spike probably causes issues to service-runner applications, where child processes have been witnessed in the past to miss their heartbeat to the master and having to be reaped and restarted.
In an effort to make things better I 've lowered the consistency for celery on scb1001, scb1002 per T181538#3795990. This is actively lowering capacity but should alleviate the problems cause on other services.