Page MenuHomePhabricator

Production warning: [RedisBagOStuff] Rejected set() for X due to snapshot lag (late regeneration).
Open, Needs TriagePublic

Description

We're seeing warnings from rejected cache sets that has been happening continuously since we got some usage on the platform.

This link shows errors on the mediawiki-api pod for the last 30 days https://cloudlogging.app.goo.gl/seCth1MjwtBJRnmE7

In particular around 24th of may there are big clusters of these warnings, but they have been seen up until yesterday (7th june ~17:00) where we deployed an increase of resources of the secondary SQL pod.

[warning] [RedisBagOStuff] Rejected set() for global:NameTableSqlStore:wbt_type:mwdb_wbstack_X-mwt_X_ due to snapshot lag (late regeneration).

After this deployment the pod restarted and the warnings seems to have stopped for now, so we probably need to monitor this page for any new warnings.

Event Timeline

toan updated the task description. (Show Details)

So sadly this hasn't fully gone away all though the number of warnings has drastically reduced and seems to happen at times of heavy load.

Since the database got more memory and isn't falling over we've seen these happen less frequently and not being constantly spammed in the logs.

Last 30 days

https://cloudlogging.app.goo.gl/yAC8aR7BuHPZzdAb8

There are a few on the morning of the 13th at 9 am that doesn't make much sense to me and then a bigger cluster that seems to have occured around the same time migration of batch C was started.

The redis-replica restarted many times https://cloudlogging.app.goo.gl/w7UsAxeADx68iKjp8 when migration happened and it seems to have been OOMkilled by k8s at least once.

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 13 Jun 2022 17:35:30 +0200
  Finished:     Mon, 13 Jun 2022 17:35:47 +0200
Ready:          True
Restart Count:  23

image.png (274×1 px, 51 KB)

this also put an increased load on the secondary-sql which lead to greater than 35 seconds of replication lag.

image.png (224×918 px, 75 KB)

So, another update. Last 7 days this occurred 24 times in total and has not happened since june 24th.

We are seeing nowhere near the amount of errors that was initially reported.

I propose, lets sit on this a bit longer until it happens again and then maybe try correlating it in the same way as https://phabricator.wikimedia.org/T310597#8023571 or just close it if it doesn't happen.

We don't know the solution be we suspect at least sometimes we see these errors because generating the IPSet for blocking people is taking longer than the 7 seconds allowed when adding something to the WANCache (see: mediawiki/dist/includes/libs/objectcache/wancache/WANObjectCache.php:831) therefore we should revisit this ticket after solving T313215 to see if it still occurs with some regularity