Page MenuHomePhabricator

Old events in the stream
Closed, ResolvedPublicBUG REPORT

Description

Hello. Five days ago (06.10.22) something happened to streams. Now stream sometimes send very old events.

Simple Python code to demonstrate this bug:

from sseclient import SSEClient as EventSource
from datetime import datetime,timezone
import json


for event in EventSource("https://stream.wikimedia.org/v2/stream/mediawiki.revision-create"):
    if event.event == 'message':
        try:
            e = json.loads(event.data)
        except ValueError:
            pass
        else:
            now = datetime.now(timezone.utc).strftime("%Y-%m-%d")  # today
            if not e["meta"]["dt"].startswith(now):  # if not today
                # write to logs
                f = open("bug.log", "a")
                f.write("Now: {0}; Meta:dt: {1}; rev_timestamp: {2})\n".format(now, e["meta"]["dt"], e["rev_timestamp"]))
                f.close()

After ~30 minutes:

Now: 2022-10-11; Meta:dt: 2022-10-04T13:11:16Z; rev_timestamp: 2022-10-04T13:11:16Z)
Now: 2022-10-11; Meta:dt: 2022-10-04T13:11:15Z; rev_timestamp: 2022-10-04T13:11:15Z)
Now: 2022-10-11; Meta:dt: 2022-10-04T13:11:09Z; rev_timestamp: 2022-10-04T13:11:06Z)
Now: 2022-10-11; Meta:dt: 2022-10-04T13:11:16Z; rev_timestamp: 2022-10-04T13:11:16Z)
Now: 2022-10-11; Meta:dt: 2022-10-04T13:11:16Z; rev_timestamp: 2022-10-04T13:11:16Z)
Now: 2022-10-11; Meta:dt: 2022-10-04T13:11:16Z; rev_timestamp: 2022-10-04T13:11:15Z)
Now: 2022-10-11; Meta:dt: 2022-10-04T13:11:16Z; rev_timestamp: 2022-10-04T13:11:16Z)

…8k+ lines.

This does not happen immediately after startup. Maybe it related with reconnections or something like that.
This problem didn't exist before.

Event Timeline

Ottomata added subscribers: gmodena, tchin.

Interesting! Adding some tags and folks. Would like to look into this.

Indeed, if it happens after 15 or 30 minutes, something weird must be going on with the reconnect Last-Event-ID header. It looks like it is causing the client to reconnect consuming from the earliest offset in the stream (about 7 days ago).

(BTW, thank you for the very good bug report @Iluvatar :) )

I tried to hardcode EventSource(url, last_id = None), but without effect.
I use streams since 2017, but this bug appeared only 6 days ago (afternoon / in the evening of October 6). Almost all my bot are broken.

Maybe something changed / updated on server side in this time period?

earliest offset in the stream (about 7 days ago)

Usually exactly 7 days. But not always. Sometimes up to 13-15 days.
Sorry for my English. :)

JavaScript (Browser): https://swviewer.toolforge.org/test.html (see console):

Simple code from Wikitech example:

<script>
const url = 'https://stream.wikimedia.org/v2/stream/revision-create';
const eventSource = new EventSource(url);

eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);

if (!data.meta.dt.startsWith('2022-10-12') && !data.meta.dt.startsWith('2022-10-13'))  // today is 2022-10-12
    console.log(data.meta.dt + " " + data.rev_timestamp);
};
</script>
2022-10-05T05:30:30Z 2022-10-05T05:30:30Z
2022-10-05T05:30:30Z 2022-10-05T05:30:28Z
2022-10-05T05:30:30Z 2022-10-05T05:30:30Z
2022-10-05T05:30:29Z 2022-10-05T05:30:29Z
2022-10-05T05:30:30Z 2022-10-05T05:30:30Z

Any updates? It's a very important (breaking) bug.

The easiest way to reproduce:

  1. Open browser's console on the link.
  2. Disable and re-enable network controller (simulating a disconnect).

Gif (click on image).

Simple code from Wikitech example:

I just pasted this code (with ! dt.startsWith filter changed to today) into the developer tools console in a chromium based browser (Brave) while on https://stream.wikimedia.org/v2/ui/#/.

I waited 15 minutes, and saw the first reconnect request. In the request headers on the reconnect, I see:

last-event-id: [{"topic":"eqiad.mediawiki.revision-create","partition":0,"timestamp":1666105122001},{"topic":"codfw.mediawiki.revision-create","partition":0,"offset":-1}]

Some time later, I see another reconnect request, with a new last-event-id:

last-event-id: [{"topic":"codfw.mediawiki.revision-create","partition":0,"offset":-1},{"topic":"eqiad.mediawiki.revision-create","partition":0,"timestamp":1666105413001}]

Left this running for a couple of hours, had several reconnect events, all with proper last-event-ids with appropriate timestamps to reconnect at.

I did the same with at your https://swviewer.toolforge.org/test.html example, and saw the same behavoir.

The first reconnect got me:

last-event-id: [{"topic":"eqiad.mediawiki.revision-create","partition":0,"timestamp":1666106718001},{"topic":"codfw.mediawiki.revision-create","partition":0,"offset":-1}]

And after several reconnects and hours later:

last-event-id: [{"topic":"codfw.mediawiki.revision-create","partition":0,"offset":-1},{"topic":"eqiad.mediawiki.revision-create","partition":0,"timestamp":1666110008001}]

All reconnects had timestamps later than now, so the reconnect looks like it is working properly.

I never saw any events with meta.dt != today printed on the console either though, so that indicates to me that there aren't any events with inconsistent meta.dt vs Kafka timestamp values.

So far, I haven't been able to reproduce!

Is it possible some reconnect logic has changed in the EventSource implementation you are using? I'd guess not, since it looks like you've experienced this in both a Python client and in a browser. What browser are you using?

Thanks. Today this bug does not exist. Even when reconnect through disable/enable network controller. You fixed it. :) I will test for a few more days.

What browser are you using?

  • Chrome (desktop; windows), last version.
  • Python 3.8 on home PC, Python 3.9 on toolforge.
  • nodejs v12 on toolforge.
Ottomata claimed this task.

Haha, "I" fixed it! :p

Interesting. I wonder if there were indeed somehow some old events at later offsets in the stream. That could happen, but shouldn't. That could explain how you might see events even older than 7 days. If that were the case, then those events are likely passed retention time and removed from the stream.

I suppose let's resolve this then, but please reopen if it happens again.