It looks like this could be related to some downtime. See T181538#3794219 for the originating discussion.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Disable ORES redis persistence for queue | operations/puppet | production | +1 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T181538 ORES overload incident, 2017-11-28 | |||
Resolved | Halfak | T181563 Investigate "Asynchronous AOF fsync is taking too long" on oresrdb200* |
Event Timeline
[15:27:23] <akosiaris> I was looking at https://github.com/antirez/redis/issues/1019 but it has been "fixed" since 2.6 [15:27:25] <akosiaris> we run 2.8 [15:28:30] <akosiaris> btw that issue only seems to hit the queue redis [15:28:33] <akosiaris> not the cache redis
eqiad redis queue for ORES is no longer persisting in disk since 21:39 UTC today in an effort to address this. Results still inconclusive
Re-opening per the following:
After a brief discussion in #wikimedia-ai at ~01:00 UTC, we 've decided to make the "persist to disk" removal for the redis ORES queue permanent
I 'll upload a puppet patch for that
Change 394022 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] Disable ORES redis persistence for queue
Change 394022 merged by Alexandros Kosiaris:
[operations/puppet@production] Disable ORES redis persistence for queue
Mentioned in SAL (#wikimedia-operations) [2017-11-29T10:15:22Z] <akosiaris> disable puppet on oresrdb* for merging https://gerrit.wikimedia.org/r/#/c/394022/. T181563
Re-resolving. This has been deployed. The deploy in codfw did not cause any pain, just a few overload errors. The deploy in eqiad wasn't so great. memory/cpu usage on the lower powered scb boxes spiked, thankfully without the kernel OOM killer showing up. But the cpu spike probably causes issues to service-runner applications, where child processes have been witnessed in the past to miss their heartbeat to the master and having to be reaped and restarted.
In an effort to make things better I 've lowered the consistency for celery on scb1001, scb1002 per T181538#3795990. This is actively lowering capacity but should alleviate the problems cause on other services.