T388290: EventStreams frequently replaying edits from beginning of queue causes the channels to be spammed with useless notifications. We should prevent StewardBot and SULWatcher from reporting events that happened in the past.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| Restart EventStream if it is replaying old events | labs/tools/stewardbots | master | +23 -1 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | BUG REPORT | None | T388290 EventStreams frequently replaying edits from beginning of queue | ||
| Resolved | AntiCompositeNumber | T388292 Mitigate EventStreams replay issues in stewardbots |
Event Timeline
I temporarily added some logging to StewardBot to confirm that EventStreams is sending the correct timestamps when replaying old events, and it is.
The first mitigation I tried was to use
lid = f'[{{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":-1}},{{"topic":"codfw.mediawiki.recentchange","partition":0,"timestamp":{int(time.time())}}}]' # TEMP: Testing ACN for event in EventSource(stream, last_id=lid): ...
to try to force EventStreams to only send current events, but it appeared to have no effect.
I am currently testing the following in StewardBot:
# TEMP: Testing ACN if (time.time() - change["timestamp"]) > 300: # drop events more than 300 seconds old logger.debug(f"Dropping change with ts {change['meta']['dt']}, {change['timestamp']}") continue
So far, no events have been dropped and no replayed events have been seen in the channel. So we'll have to wait and see. I was expecting at least a categorization change to be dropped so far because of T388294.
It's probably a better idea to restart the EventSource instead of just dropping old messages so we don't have to wait for it to catch up. We should ignore categorize events for that though (again because of T388294).
Change #1125575 had a related patch set uploaded (by AntiCompositeNumber; author: AntiCompositeNumber):
[labs/tools/stewardbots@master] Restart EventStream if it is replaying old events
Change #1125575 merged by jenkins-bot:
[labs/tools/stewardbots@master] Restart EventStream if it is replaying old events
+2'd — only follow-up I might suggest @AntiCompositeNumber would be making that threshold (300) a variable, if there's a chance it might be adjusted in the future
Mentioned in SAL (#wikimedia-cloud) [2025-03-10T19:51:50Z] <wmbot~anticomposite@tools-bastion-13> Remove uncommitted changes, deploy b8352a8 (T388292)
I'm hoping to not need to. Delays in EventStream propagation are theoretically possible but unusual right now, and if we start regularly getting delays >5 minutes we'll probably have to do more work than just tweaking the threshold.
2025-03-10T20:22:47Z stewardbot INFO: <#wikimedia-stewards/[self]> 03XXBlackburnXx locked account ClemmieBenton55 with the following comment: 07Spam-only account: spambot 2025-03-10T20:31:56Z stewardbot INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:54:13Z, https://meta.wikimedia.org/w/index.php?diff=28336156&oldid=22933912&rcid=34501122 2025-03-10T20:31:56Z stewardbot INFO: Starting EventStream listener
SULWatcher has only gotten a 429 so far.
2025-03-11T00:58:38Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None 2025-03-11T01:36:17Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None 2025-03-11T01:39:51Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None 2025-03-11T02:23:22Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None 2025-03-11T02:56:52Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None
log entries don't have a notify_url. Oh well. The actual mitigation is working, so this task is resolved for now.