Symptom
Between 8 and 20% of logged Popups events contain either duplicated link interaction tokens or are identical events [0]. Furthermore, very nearly all of these events (~98% per T161769#3195922) are logged by Firefox, including the latest version. This was first written up in T161769: Schema:Popups sends extraneous link interaction events in control condition. This task attempts to summarise what we know we know and what we know we don't know so that we can work to isolate the source of the event duplication.
Hypotheses
1. Page Previews (PP) is logging the duplicate events
The most likely source of the duplicate events is PP. We tested this by instrumenting the piece of code that receives events for logging from the central state container, [the eventLogging change listener](https://github.com/wikimedia/mediawiki-extensions-Popups/blob/master/src/changeListeners/eventLogging.js) (see T163198: Track instances of duplicate Popups events being logged). We began discarding both types of duplicate event when rEPOPc44fddf8cdfd: eventLogging: Discard events with duplicate tokens was deployed on Thursday, June 1st.
While he was reviewing those changes, @Jhernandez submitted rEPOP293d7ebe8dd8: Clear interaction after an event for it is logged in EL, which, when deployed on June 1st, all but eradicated the duplicate events generated by PP [1][2].
It's also worth noting that this kind of duplication has already been observed in T142667: Investigate duplicate EventLogging rows. The MediaViewer, WikimediaBlogVisit, and NavigationTiming schemas, as well as a previous revision of the Popups schema, are all affected differently. I've also observed this kind of duplication in the ReadingDepth schema (see T161769#3270707).
2. PP is logging events too frequently
This still needs to be tested.
Some time ago, I created the TestEventDuplication schema and an associated script that would try to provoke event duplication. The script can be run directly from the browser's console on any MediaWiki instance that has $wgEventLoggingBaseUri config variable set to a production beacon URI, e.g. "https://en.wikipedia.org/beacon/event".
3. The UA is at fault
We mentioned above that very nearly all of these events are logged by Firefox, including the latest version. This is in line with earlier informal observations about the behaviour of Firefox (see T142667#2891207).
In T166115#3285770, I observed that the majority of the ReadingDepth schema's duplicate events are logged by Chrome and Chrome Mobile. If the level of duplication were constant, then we could write this off as a bug in the implementation. It isn't, however, and the implementation is relatively straightforward. This doesn't strictly invalidate the above, it simply highlights that all browsers are capable of logging duplicate events.
4. The OS is at fault
This was invalidated early on in T161769: Schema:Popups sends extraneous link interaction events in control condition:
5. The duplicates are being generated by Varnish/varnishkafka
After a thorough explanation of the server-side EventLogging pipeline from @Ottomata, I'm fairly confident that this hypothesis can be invalidated. As I understand it, every entry in the webrequest table has the hostname of the Varnish server and that instance's internal sequence ID for that request. If the duplicate events were being duplicated by anything after Varnish in the pipeline, then the sequence IDs should be duplicated.
In T161769#3194893, for example, the HTTP requests are all served by the same Varnish server and have different sequence IDs. Querying the hostname and sequence of all Popups events logged in June yields the same result [4].
6. The events are being generated by a Kafka consumer
In T161769#3194893 and T161769#3247604 we see the same duplication in the log.Popups_16364296 MySQL and webrequest Hadoop tables, which are written to by different Kafka consumer that subscribe to different topics.
It might be worth checking for discrepancies between the log.Popups_16364296 MySQL table, which has messages from the eventlogging-valid-mixed topic written to it, and those in the corresponding Hadoop table, which has messages from the eventlogging_Popups topic written to it. However, this wouldn't validate this hypothesis given the above.
7. The events are being duplicated because the code is being loaded more than once
It's possible a gadget or piece of code in a project/user's Common.js is causing Popups JS code to be loaded again leading to multiple events being loaded.
It would be worth checking if the duplicate events only appear on certain projects.
8. A bot/person is replaying the events with malicious intent/accidentally to drive Tilman insane.
It's possible to replay any XHR request either using the JS console or programatically by accident. How does EventLogging deal with these?
+----------+--------+-------+-------+ | ymd | nDupes | n | % | +----------+--------+-------+-------+ | 20170525 | 11028 | 58367 | 18.89 | | 20170526 | 7910 | 49686 | 15.92 | | 20170527 | 7257 | 38524 | 18.84 | | 20170528 | 7160 | 43665 | 16.40 | | 20170529 | 7157 | 60953 | 11.74 | | 20170530 | 7419 | 59861 | 12.39 | | 20170531 | 5518 | 58975 | 9.36 | | 20170601 | 14204 | 64431 | 22.05 | | 20170602 | 13068 | 59266 | 22.05 | | 20170603 | 5052 | 43240 | 11.68 | | 20170604 | 7249 | 52642 | 13.77 | | 20170605 | 4902 | 59737 | 8.21 | | 20170606 | 4837 | 58618 | 8.25 | | 20170607 | 7060 | 57580 | 12.26 | | 20170608 | 761 | 12960 | 5.87 | +----------+--------+-------+-------+ select dupesPerDay.ymd, nDupes, n, round( 100 * nDupes / n, 2) as '%' from ( select ymd, sum(nDupes) AS nDupes from ( select left(timestamp, 8) as ymd, count(*) AS nDupes from log.Popups_16364296 where date(timestamp) >= '2017-05-25' and date(timestamp) <= '2017-06-08' and event_linkInteractionToken is not null group by ymd, event_linkInteractionToken having nDupes > 1 ) AS dupesPerDayInt group by ymd ) as dupesPerDay join ( select left(timestamp, 8) as ymd, count(*) AS n from log.Popups_16364296 where date(timestamp) >= '2017-05-25' and date(timestamp) <= '2017-06-08' and event_linkInteractionToken is not null group by ymd ) as nPerDay on dupesPerDay.ymd = nPerDay.ymd order by ymd;
[1] https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews?refresh=1m&orgId=1&from=1496188800000&to=1496275200000
[2] https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews?refresh=1m&orgId=1&from=1496880000000&to=1496966400000
[4] {P5558,lines=14}