Production warning: [RedisBagOStuff] Rejected set() for X due to snapshot lag (late regeneration).
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	• toan
	Jun 8 2022, 9:11 AM

Description

We're seeing warnings from rejected cache sets that has been happening continuously since we got some usage on the platform.

This link shows errors on the mediawiki-api pod for the last 30 days https://cloudlogging.app.goo.gl/seCth1MjwtBJRnmE7

In particular around 24th of may there are big clusters of these warnings, but they have been seen up until yesterday (7th june ~17:00) where we deployed an increase of resources of the secondary SQL pod.

[warning] [RedisBagOStuff] Rejected set() for global:NameTableSqlStore:wbt_type:mwdb_wbstack_X-mwt_X_ due to snapshot lag (late regeneration).

After this deployment the pod restarted and the warnings seems to have stopped for now, so we probably need to monitor this page for any new warnings.

Related Objects

Mentioned In: T313215: [24hrs] Investigate how to set up StopForumSpam correctly
T312801: smartmeta is missing updates to ElasticSearch
T309070: [Timebox: 18hrs] Frequent 502 responses when submitting edits
T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets)
T310066: Production error: sql-backup failed to start due to no database available
Mentioned Here: T313215: [24hrs] Investigate how to set up StopForumSpam correctly
T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets)

Event Timeline

• toan created this task.Jun 8 2022, 9:11 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 8 2022, 9:11 AM

• toan updated the task description. (Show Details)Jun 8 2022, 9:14 AM

• toan updated the task description. (Show Details)

• toan added subscribers: WMDE-leszek, conny-kawohl_WMDE, Tarrow, Addshore.Jun 8 2022, 9:17 AM

• toan mentioned this in T310066: Production error: sql-backup failed to start due to no database available.Jun 8 2022, 9:50 AM

• toan claimed this task.Jun 8 2022, 10:50 AM

• toan moved this task from Backlog to Blocked/Stalled on the Wikibase Cloud (Launch Migration Kanban (2022)) board.

• toan moved this task from Blocked/Stalled to Doing on the Wikibase Cloud (Launch Migration Kanban (2022)) board.Jun 14 2022, 11:46 AM

So sadly this hasn't fully gone away all though the number of warnings has drastically reduced and seems to happen at times of heavy load.

Since the database got more memory and isn't falling over we've seen these happen less frequently and not being constantly spammed in the logs.

Last 30 days

https://cloudlogging.app.goo.gl/yAC8aR7BuHPZzdAb8

There are a few on the morning of the 13th at 9 am that doesn't make much sense to me and then a bigger cluster that seems to have occured around the same time migration of batch C was started.

The redis-replica restarted many times https://cloudlogging.app.goo.gl/w7UsAxeADx68iKjp8 when migration happened and it seems to have been OOMkilled by k8s at least once.

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 13 Jun 2022 17:35:30 +0200
  Finished:     Mon, 13 Jun 2022 17:35:47 +0200
Ready:          True
Restart Count:  23

this also put an increased load on the secondary-sql which lead to greater than 35 seconds of replication lag.

• toan moved this task from Doing to Blocked/Stalled on the Wikibase Cloud (Launch Migration Kanban (2022)) board.Jun 14 2022, 1:13 PM

• toan removed • toan as the assignee of this task.Jun 15 2022, 10:18 AM

• toan mentioned this in T310597: Production warning: Aborted connection 38950 to db: 'mwdb_wbstack_X' user: 'mwu_X' host: '10.108.4.7' (Got an error reading communication packets) .Jun 16 2022, 10:58 AM

• toan claimed this task.Jun 28 2022, 7:50 AM

So, another update. Last 7 days this occurred 24 times in total and has not happened since june 24th.

We are seeing nowhere near the amount of errors that was initially reported.

I propose, lets sit on this a bit longer until it happens again and then maybe try correlating it in the same way as https://phabricator.wikimedia.org/T310597#8023571 or just close it if it doesn't happen.

• toan mentioned this in T309070: [Timebox: 18hrs] Frequent 502 responses when submitting edits.Jul 1 2022, 9:30 AM

• toan removed • toan as the assignee of this task.Jul 1 2022, 12:10 PM

Tarrow mentioned this in T312801: smartmeta is missing updates to ElasticSearch.Jul 13 2022, 9:30 AM

Evelien_WMDE moved this task from Launch Migration Kanban (2022) to Engineering prioritised backlog on the Wikibase Cloud board.Jul 18 2022, 2:03 PM

Evelien_WMDE edited projects, added Wikibase Cloud; removed Wikibase Cloud (Launch Migration Kanban (2022)).

Evelien_WMDE mentioned this in T313215: [24hrs] Investigate how to set up StopForumSpam correctly .Jul 26 2022, 1:42 PM

We don't know the solution be we suspect at least sometimes we see these errors because generating the IPSet for blocking people is taking longer than the 7 seconds allowed when adding something to the WANCache (see: mediawiki/dist/includes/libs/objectcache/wancache/WANObjectCache.php:831) therefore we should revisit this ticket after solving T313215 to see if it still occurs with some regularity

Tarrow moved this task from Engineering prioritised backlog to Backlog (incoming) on the Wikibase Cloud board.Aug 1 2022, 11:55 AM

	F35239448: image.png
	Jun 14 2022, 12:01 PM

	F35239458: image.png
	Jun 14 2022, 12:01 PM

Production warning: [RedisBagOStuff] Rejected set() for X due to snapshot lag (late regeneration).Open, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Production warning: [RedisBagOStuff] Rejected set() for X due to snapshot lag (late regeneration).
Open, Needs TriagePublic
Actions