Page MenuHomePhabricator

Mitigate EventStreams replay issues in stewardbots
Closed, ResolvedPublic

Description

T388290: EventStreams frequently replaying edits from beginning of queue causes the channels to be spammed with useless notifications. We should prevent StewardBot and SULWatcher from reporting events that happened in the past.

Event Timeline

I temporarily added some logging to StewardBot to confirm that EventStreams is sending the correct timestamps when replaying old events, and it is.

The first mitigation I tried was to use

lid = f'[{{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":-1}},{{"topic":"codfw.mediawiki.recentchange","partition":0,"timestamp":{int(time.time())}}}]'  # TEMP: Testing ACN
for event in EventSource(stream, last_id=lid):
    ...

to try to force EventStreams to only send current events, but it appeared to have no effect.

I am currently testing the following in StewardBot:

# TEMP: Testing ACN
if (time.time() - change["timestamp"]) > 300:
    #  drop events more than 300 seconds old
    logger.debug(f"Dropping change with ts {change['meta']['dt']}, {change['timestamp']}")
    continue

So far, no events have been dropped and no replayed events have been seen in the channel. So we'll have to wait and see. I was expecting at least a categorization change to be dropped so far because of T388294.
It's probably a better idea to restart the EventSource instead of just dropping old messages so we don't have to wait for it to catch up. We should ignore categorize events for that though (again because of T388294).

Change #1125575 had a related patch set uploaded (by AntiCompositeNumber; author: AntiCompositeNumber):

[labs/tools/stewardbots@master] Restart EventStream if it is replaying old events

https://gerrit.wikimedia.org/r/1125575

Change #1125575 merged by jenkins-bot:

[labs/tools/stewardbots@master] Restart EventStream if it is replaying old events

https://gerrit.wikimedia.org/r/1125575

+2'd — only follow-up I might suggest @AntiCompositeNumber would be making that threshold (300) a variable, if there's a chance it might be adjusted in the future

Mentioned in SAL (#wikimedia-cloud) [2025-03-10T19:51:50Z] <wmbot~anticomposite@tools-bastion-13> Remove uncommitted changes, deploy b8352a8 (T388292)

+2'd — only follow-up I might suggest @AntiCompositeNumber would be making that threshold (300) a variable, if there's a chance it might be adjusted in the future

I'm hoping to not need to. Delays in EventStream propagation are theoretically possible but unusual right now, and if we start regularly getting delays >5 minutes we'll probably have to do more work than just tweaking the threshold.

2025-03-10T20:22:47Z stewardbot INFO: <#wikimedia-stewards/[self]> 03X​XBlackburnXx locked account ClemmieBenton55 with the following comment: 07Spam-only account: spambot
2025-03-10T20:31:56Z stewardbot INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:54:13Z, https://meta.wikimedia.org/w/index.php?diff=28336156&oldid=22933912&rcid=34501122
2025-03-10T20:31:56Z stewardbot INFO: Starting EventStream listener

SULWatcher has only gotten a 429 so far.

2025-03-11T00:58:38Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None
2025-03-11T01:36:17Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None
2025-03-11T01:39:51Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None
2025-03-11T02:23:22Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None
2025-03-11T02:56:52Z SULWatcher EventsteamsThread INFO: Replayed change detected, restarting EventStream: 2025-03-03T18:52:22Z, None

log entries don't have a notify_url. Oh well. The actual mitigation is working, so this task is resolved for now.