Page MenuHomePhabricator

Investigate RDB snapshot issue on ORES
Closed, ResolvedPublic

Description

I got this error for every request this morning.

MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

It looks like the issue happens when we're saving to the cache. I checked the rdb files on ores-redis-01 and found that the cache rdb file was exactly 1000MB, but the config put the max memory at 3GB. I restarted redis and the service recovered. The 1000MB rdb file grew to 1001MB in a minute and then the same error started happening.

So I ran config set stop-writes-on-bgsave-error no in the redis-cli for 6380. That seems to have recovered ORES, but it means we're probably not persisting to disk anymore.

Event Timeline

Halfak assigned this task to yuvipanda.
Halfak raised the priority of this task from to Unbreak Now!.
Halfak updated the task description. (Show Details)
Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.
Halfak subscribed.

Just checked the logs I'm seeing

[11924] 30 Dec 18:33:15.078 * 1 changes in 900 seconds. Saving...
[11924] 30 Dec 18:33:15.078 # Can't save in background: fork: Cannot allocate memory
[11924] 30 Dec 18:33:21.098 * 1 changes in 900 seconds. Saving...
[11924] 30 Dec 18:33:21.098 # Can't save in background: fork: Cannot allocate memory
[11924] 30 Dec 18:33:27.026 * 1 changes in 900 seconds. Saving...
[11924] 30 Dec 18:33:27.026 # Can't save in background: fork: Cannot allocate memory
[11924] 30 Dec 18:33:33.055 * 1 changes in 900 seconds. Saving...
[11924] 30 Dec 18:33:33.055 # Can't save in background: fork: Cannot allocate memory
[11924] 30 Dec 18:33:39.087 * 1 changes in 900 seconds. Saving...
[11924] 30 Dec 18:33:39.088 # Can't save in background: fork: Cannot allocate memory

So it looks like it might be a memory issue. Maybe we should cut the maxmemory for the cache server.

@yuvipanda, where are we on this? Didn't you get another changeset merged and the problem was resolved. I seem to remember doing some manual restarts of the uwsgi processes so that we could reboot redis.

Yup, I enabled memory overcommit for all redises and it's all good. Was good we caught it here, would've hit other redises later...

@yuvipanda, great! Thanks. Please help us manage our progress report and move tasks to the "Done" column before resolving. I'll move this one.

Halfak closed this task as Resolved.
Halfak set Security to None.
Halfak moved this task from Backlog to Completed on the Machine-Learning-Team (Active Tasks) board.

What about https://phabricator.wikimedia.org/T122666 since that is still open but the ticket here is closed.

They are the same ticket no?

Sorry, wrong paste. I meant to ask what about https://gerrit.wikimedia.org/r/#/c/261642/ since that links over here to this ticket and was still open. I found that when looking in Gerrit for open changes to the ops/puppet repo.

Looks like that can be abandoned. I'll do that now.