Page MenuHomePhabricator

[Spike] Isolate the source of Popups event duplication
Closed, ResolvedPublic

Description

Symptom

Between 8 and 20% of logged Popups events contain either duplicated link interaction tokens or are identical events [0]. Furthermore, very nearly all of these events (~98% per T161769#3195922) are logged by Firefox, including the latest version. This was first written up in T161769: Schema:Popups sends extraneous link interaction events in control condition. This task attempts to summarise what we know we know and what we know we don't know so that we can work to isolate the source of the event duplication.

Hypotheses

1. Page Previews (PP) is logging the duplicate events

NOTE: This was invalidated by our implementation of the client-side deduplicator, the instrumentation for which is available on the Page Previews Perf dashboard.

The most likely source of the duplicate events is PP. We tested this by instrumenting the piece of code that receives events for logging from the central state container, [the eventLogging change listener](https://github.com/wikimedia/mediawiki-extensions-Popups/blob/master/src/changeListeners/eventLogging.js) (see T163198: Track instances of duplicate Popups events being logged). We began discarding both types of duplicate event when rEPOPc44fddf8cdfd: eventLogging: Discard events with duplicate tokens was deployed on Thursday, June 1st.

While he was reviewing those changes, @Jhernandez submitted rEPOP293d7ebe8dd8: Clear interaction after an event for it is logged in EL, which, when deployed on June 1st, all but eradicated the duplicate events generated by PP [1][2].

It's also worth noting that this kind of duplication has already been observed in T142667: Investigate duplicate EventLogging rows. The MediaViewer, WikimediaBlogVisit, and NavigationTiming schemas, as well as a previous revision of the Popups schema, are all affected differently. I've also observed this kind of duplication in the ReadingDepth schema (see T161769#3270707).

2. PP is logging events too frequently

This still needs to be tested.

Some time ago, I created the TestEventDuplication schema and an associated script that would try to provoke event duplication. The script can be run directly from the browser's console on any MediaWiki instance that has $wgEventLoggingBaseUri config variable set to a production beacon URI, e.g. "https://en.wikipedia.org/beacon/event".

3. The UA is at fault

We mentioned above that very nearly all of these events are logged by Firefox, including the latest version. This is in line with earlier informal observations about the behaviour of Firefox (see T142667#2891207).

In T166115#3285770, I observed that the majority of the ReadingDepth schema's duplicate events are logged by Chrome and Chrome Mobile. If the level of duplication were constant, then we could write this off as a bug in the implementation. It isn't, however, and the implementation is relatively straightforward. This doesn't strictly invalidate the above, it simply highlights that all browsers are capable of logging duplicate events.

4. The OS is at fault

This was invalidated early on in T161769: Schema:Popups sends extraneous link interaction events in control condition:

… the distribution by OS doesn't seem to deviate too much from the general distribution.

5. The duplicates are being generated by Varnish/varnishkafka

After a thorough explanation of the server-side EventLogging pipeline from @Ottomata, I'm fairly confident that this hypothesis can be invalidated. As I understand it, every entry in the webrequest table has the hostname of the Varnish server and that instance's internal sequence ID for that request. If the duplicate events were being duplicated by anything after Varnish in the pipeline, then the sequence IDs should be duplicated.

In T161769#3194893, for example, the HTTP requests are all served by the same Varnish server and have different sequence IDs. Querying the hostname and sequence of all Popups events logged in June yields the same result [4].

6. The events are being generated by a Kafka consumer

NOTE: Invalidated per T167391#3402475.

In T161769#3194893 and T161769#3247604 we see the same duplication in the log.Popups_16364296 MySQL and webrequest Hadoop tables, which are written to by different Kafka consumer that subscribe to different topics.

It might be worth checking for discrepancies between the log.Popups_16364296 MySQL table, which has messages from the eventlogging-valid-mixed topic written to it, and those in the corresponding Hadoop table, which has messages from the eventlogging_Popups topic written to it. However, this wouldn't validate this hypothesis given the above.

7. The events are being duplicated because the code is being loaded more than once

NOTE: Invalidated per T167391#3384593 and T167391#3402980 (as well as other informal observations on related tasks).

It's possible a gadget or piece of code in a project/user's Common.js is causing Popups JS code to be loaded again leading to multiple events being loaded.

It would be worth checking if the duplicate events only appear on certain projects.

8. A bot/person is replaying the events with malicious intent/accidentally to drive Tilman insane.

It's possible to replay any XHR request either using the JS console or programatically by accident. How does EventLogging deal with these?


[0]
+----------+--------+-------+-------+
| ymd      | nDupes | n     | %     |
+----------+--------+-------+-------+
| 20170525 |  11028 | 58367 | 18.89 |
| 20170526 |   7910 | 49686 | 15.92 |
| 20170527 |   7257 | 38524 | 18.84 |
| 20170528 |   7160 | 43665 | 16.40 |
| 20170529 |   7157 | 60953 | 11.74 |
| 20170530 |   7419 | 59861 | 12.39 |
| 20170531 |   5518 | 58975 |  9.36 |
| 20170601 |  14204 | 64431 | 22.05 |
| 20170602 |  13068 | 59266 | 22.05 |
| 20170603 |   5052 | 43240 | 11.68 |
| 20170604 |   7249 | 52642 | 13.77 |
| 20170605 |   4902 | 59737 |  8.21 |
| 20170606 |   4837 | 58618 |  8.25 |
| 20170607 |   7060 | 57580 | 12.26 |
| 20170608 |    761 | 12960 |  5.87 |
+----------+--------+-------+-------+

select
  dupesPerDay.ymd,
  nDupes,
  n,
  round( 100 * nDupes / n, 2) as '%'
from
  (
    select ymd, sum(nDupes) AS nDupes from
    (
      select
        left(timestamp, 8) as ymd,
        count(*) AS nDupes 
      from
        log.Popups_16364296
      where
        date(timestamp) >= '2017-05-25'
        and date(timestamp) <= '2017-06-08'
        and event_linkInteractionToken is not null
      group by
        ymd,
        event_linkInteractionToken
      having nDupes > 1
    ) AS dupesPerDayInt
    group by ymd
  ) as dupesPerDay
  join
  (
    select
      left(timestamp, 8) as ymd,
      count(*) AS n 
    from
      log.Popups_16364296
    where
      date(timestamp) >= '2017-05-25'
      and date(timestamp) <= '2017-06-08'
      and event_linkInteractionToken is not null
    group by
      ymd
  ) as nPerDay
  on dupesPerDay.ymd = nPerDay.ymd
order by ymd;

[1] https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews?refresh=1m&orgId=1&from=1496188800000&to=1496275200000
[2] https://grafana.wikimedia.org/dashboard/db/reading-web-page-previews?refresh=1m&orgId=1&from=1496880000000&to=1496966400000

[4] {P5558,lines=14}

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
phuedx updated the task description. (Show Details)

If after the changes from https://phabricator.wikimedia.org/T167236 are deployed and duplication is still only seen in firefox, we should reach out to them

I think it would be wise to wait for the outcome of T167236 and see if the problem has been fixed before attempting to find the cause again. @ovasileva what do you say?

I haven't been able to reproduce the issue locally yet.

@bmansurov - I think T167236: userSessionToken in RelatedArticles schema does not seem to survive beyond one pageview might only affect some of the hypotheses, perhaps it's still worth looking at the other ones? I think expecting it to fix this issue is more of a distant hope.

I thought I found the reason why duplicate events were being logged, but it doesn't look like so. Here's what I got.

I'm on the commit 9c2a4b143f1c179397d3930b806fb2f1028fbd1f (origin/wmf/1.30.0-wmf.5) and my setup looks like this:

Browser: Chromium Version 59.0.3071.104 (Developer Build) Fedora Project (64-bit)
OS: Fedora 26
Page: Франкофоны (imported from ruwiki -- According to stat1003 duplicate events are seen on this page)
Blue links: Франция, Бельгия, and Швейцария (all three imported from ruwiki)
User: logged in
EventLogging base URL ($wgEventLoggingBaseUri): 'http://en.wikipedia.org/beacon/event' (notice the 'http', my development machine is also on 'http').

So I'm on my local machine and visiting /wiki/Франкофоны. Upon page load I see the following event being sent to http://en.wikipedia.org/beacon/event:

{"event":{"pageTitle":"Франкофоны","namespaceId":0,"skin":"vector","isAnon":false,"pageToken":"6964cb486ee4680614985081664130775c855dbfaacde","sessionToken":"d23c654ab3ba6ffd","action":"pageLoaded","domInteractiveTime":1414,"firstPaintTime":1567},"revision":16325045,"schema":"ReadingDepth","webHost":"mw.loc","wiki":"wiki"};:

Since Wikipedia is configured to use https the above data is re-sent to https://en.wikipedia.org/beacon/event.

That's why I thought two events were being sent, but apparently, only one is.


According to https://tools.wmflabs.org/versions/ the group 2 wikis are on 1.30.0-wmf.6. I was hoping not to see duplicate events because the patches mentioned in the task description are already live. Here is the query I ran in stat1003:

SELECT *, COUNT(id) c FROM log.Popups_16364296 WHERE timestamp > 20170626201500 AND event_linkInteractionToken IS NOT NULL GROUP BY event_linkInteractionToken HAVING c > 1 ORDER BY event_linkInteractionToken LIMIT 100;

Unfortunately, I still see duplicate events.

@bmansurov, @phuedx - are we still only seeing them in Firefox? Perhaps it's time to contact Mozilla.

Debunking hypotheses 1 and 2

The following setup had duplicate events in the DB, so I've tried re-creating the exact same setup.

Browser: Firefox 54
OS: Windows 10
Page: https://ru.wikipedia.org/wiki/Платиновые_монеты

In the DB, for page interaction token "474d1e235bfa1893" I see two page loaded events. ID, UUID, and timestamp for these events are different, but the rest of the data is identical. In the browser I see only one event being sent. Other events such as "dwelledButAbandoned" and "dismissed" are also being sent once, and being recorded in the DB once.

Refreshed the same page and got a new page interaction token, which was "cfc01b1bc6bdcecb". This time though, in the DB, I see only one "pageLoaded" event. Interacting with links on the page also resulted in unique events.

To conclude, the bug's happening randomly (at least when I'm testing). Events are not being sent twice by the browser.

In support of hypothesis 3

Event count for today grouped by browser:

SELECT SUBSTRING( userAgent, INSTR(userAgent,'"browser_family": "')+19, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_family": "')+19), '"' )-1 ) AS browser_family, COUNT(*) AS event_count FROM log.Popups_16364296 WHERE timestamp LIKE '20170627%' GROUP BY browser_family ORDER BY event_count DESC;

+-----------------------------+-------------+
| browser_family              | event_count |
+-----------------------------+-------------+
| Chrome                      |       29130 |
| Firefox                     |       22128 |
| Yandex Browser              |        4139 |
| Opera                       |        3274 |
| Edge                        |        1527 |
| Mail.ru Chromium Browser    |         139 |
| Chromium                    |         120 |
| Iron                        |          22 |
| Vivaldi                     |          22 |
| Samsung Internet            |          19 |
| Pale Moon (Firefox Variant) |          18 |
| Maxthon                     |          14 |
| Chrome Mobile               |          11 |
| QQ Browser                  |           8 |
| Other                       |           6 |
| K-Meleon                    |           1 |
| Safari                      |           1 |
| Firefox Mobile              |           1 |
+-----------------------------+-------------+
18 rows in set (1.36 sec)
About 36% is Firefox.

Duplicate events for today grouped by browser:


SELECT SUBSTRING( userAgent, INSTR(userAgent,'"browser_family": "')+19, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_family": "')+19), '"' )-1 ) AS browser_family, COUNT(*) AS tokens_with_dupes FROM (SELECT userAgent, event_linkInteractionToken, COUNT(*) AS events  FROM log.Popups_16364296 WHERE timestamp LIKE '20170627%' GROUP BY event_linkInteractionToken  HAVING events>1) AS dupetokenlist GROUP BY browser_family ORDER BY tokens_with_dupes DESC;
+----------------+-------------------+
| browser_family | tokens_with_dupes |
+----------------+-------------------+
| Firefox        |              2243 |
| Chrome         |                74 |
| Yandex Browser |                 1 |
+----------------+-------------------+
3 rows in set (1.87 sec)
About 96% is Firefox.

About 10% of all Firefox events are duplicates. @ovasileva we may want to contact Firefox.

Debunking the hypothesis 6

In the MySQL database I see 6 results for the following query:

SELECT * FROM log.Popups_16364296 WHERE timestamp LIKE '20170627%' AND event_linkInteractionToken='02bc8b50120ee106' ORDER BY timestamp;

And in Hive I also see 6 results for the following query:

SELECT * FROM wmf.webrequest WHERE year=2017 AND month=06 AND day=27 AND instr(uri_query, '02bc8b50120ee106') != 0;

So there is no discrepancy between the data from Hadoop and MySQL.

Debunking the hypothesis 7

We're seeing duplicate events in multiple wikis.

SELECT wiki, COUNT(id) c FROM log.Popups_16364296 WHERE timestamp LIKE '20170627%' AND event_linkInteractionToken IS NOT NULL GROUP BY wiki HAVING c > 1 ORDER BY event_linkInteractionToken LIMIT 100;
+--------+-------+
| wiki   | c     |
+--------+-------+
| huwiki |  1327 |
| fawiki |     5 |
| srwiki |   289 |
| jawiki |     2 |
| ruwiki | 32665 |
| trwiki |    65 |
| cawiki |   283 |
| cswiki |     7 |
| elwiki |   735 |
| zhwiki |    10 |
| ptwiki |    37 |
| enwiki |    83 |
| hewiki |  1108 |
| dewiki |    13 |
| nlwiki |     2 |
| itwiki | 16970 |
| eswiki |     5 |
+--------+-------+

Debunking the hypothesis 6

In the MySQL database I see 6 results for the following query:

SELECT * FROM log.Popups_16364296 WHERE timestamp LIKE '20170627%' AND event_linkInteractionToken='02bc8b50120ee106' ORDER BY timestamp;

And in Hive I also see 6 results for the following query:

SELECT * FROM wmf.webrequest WHERE year=2017 AND month=06 AND day=27 AND instr(uri_query, '02bc8b50120ee106') != 0;

How is that different from the comparisons we ran earlier (see the links in the task description)? I thought the question posed in hypothesis 6 was asking about a different Hadoop table.

In support of hypothesis 3

...

Duplicate events for today grouped by browser:


SELECT SUBSTRING( userAgent, INSTR(userAgent,'"browser_family": "')+19, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_family": "')+19), '"' )-1 ) AS browser_family, COUNT(*) AS tokens_with_dupes FROM (SELECT userAgent, event_linkInteractionToken, COUNT(*) AS events  FROM log.Popups_16364296 WHERE timestamp LIKE '20170627%' GROUP BY event_linkInteractionToken  HAVING events>1) AS dupetokenlist GROUP BY browser_family ORDER BY tokens_with_dupes DESC;
+----------------+-------------------+
| browser_family | tokens_with_dupes |
+----------------+-------------------+
| Firefox        |              2243 |
| Chrome         |                74 |
| Yandex Browser |                 1 |
+----------------+-------------------+
3 rows in set (1.87 sec)
About 96% is Firefox.

About 10% of all Firefox events are duplicates. @ovasileva we may want to contact Firefox.

To clarify, this is the query from April (T161769#3195922, already mentioned in the second sentence of the task description), and the result looks very similar (back then it was 98% instead of 96% now). What about the other points @phuedx raised above under hypothesis 3?

How is that different from the comparisons we ran earlier (see the links in the task description)? I thought the question posed in hypothesis 6 was asking about a different Hadoop table.

@Tbayer do you know what different Hadoop table the description is referring to?

@bmansurov For the data about hypothesis 3 (browsers), is there any way we could identify if the duplicate events are sent by a few clients, or many of them? Any anon client token we could group by?

The idea being to see if it is a few clients with a repeated malicious behavior or if it is a widespread issue. Few clients could mean a specific extension or setting in the browser, widespread usage may mean a browser issue.

Thanks for all your work on this!

About 10% of all Firefox events are duplicates. @ovasileva we may want to contact Firefox.

Agreed. @bmansurov, @phuedx - are you on the mozilla-wikimedia-discuss group? Do you think we can put together a list of questions for them and send it out later today?

To clarify, this is the query from April (T161769#3195922, already mentioned in the second sentence of the task description), and the result looks very similar (back then it was 98% instead of 96% now). What about the other points @phuedx raised above under hypothesis 3?

In T166115#3285770 I noticed that the query is counting duplicates the wrong way. Since we know that a single pageToken is used for two actions -- pageLoaded and pageUnloaded -- we should not consider these events as duplicates. We should only consider events that have the same action and same pageToken as duplicates. Below is my attempt at finding duplicates.

More analysis for Hypothesis 3

This time we're looking at the ReadingDepth schema.

All events logged by all browsers since 06/27.

SELECT SUBSTRING( userAgent, INSTR(userAgent,'"browser_family": "')+19, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_family": "')+19), '"' )-1 ) AS browser_family, COUNT(*) AS event_count FROM log.ReadingDepth_16325045 WHERE timestamp LIKE '20170627%' GROUP BY browser_family ORDER BY event_count DESC;
+-----------------------------+-------------+
| browser_family              | event_count |
+-----------------------------+-------------+
| Chrome                      |      100599 |
| Chrome Mobile               |       60023 |
| Firefox                     |       50941 |
| Samsung Internet            |        9368 |
| Edge                        |        8312 |
| Opera                       |        2452 |
| Firefox Mobile              |        1965 |
| Yandex Browser              |        1627 |
| Opera Mobile                |         476 |
| Edge Mobile                 |         465 |
| Amazon Silk                 |         286 |
| Mobile Safari               |         252 |
| Facebook                    |         204 |
| Chromium                    |         128 |
| Vivaldi                     |          93 |
| QQ Browser                  |          86 |
| Pale Moon (Firefox Variant) |          83 |
| Maxthon                     |          76 |
| YandexSearch                |          71 |
| Mail.ru Chromium Browser    |          69 |
| Sleipnir                    |          66 |
| Sogou Explorer              |          64 |
| Android                     |          53 |
| UC Browser                  |          53 |
| SeaMonkey                   |          50 |
| Crosswalk                   |          42 |
| Puffin                      |          37 |
| QQ Browser Mobile           |          17 |
| Iron                        |          16 |
| Iceweasel                   |          14 |
| IE                          |          13 |
| K-Meleon                    |          11 |
| Other                       |          11 |
| Safari                      |           8 |
| Pinterest                   |           3 |
| Conkeror                    |           1 |
+-----------------------------+-------------+
Chrome is about 42%.
Firefox is about 21%.

Duplicate events logged by all browsers since 06/27. An event is considered a duplicate of another if it has the same pageToken and action.

SELECT SUBSTRING( userAgent, INSTR(userAgent,'"browser_family": "')+19, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_family": "')+19), '"' )-1 ) AS browser_family, COUNT(*) AS tokens_with_dupes FROM (SELECT userAgent, COUNT(*) AS events  FROM log.ReadingDepth_16325045 WHERE timestamp LIKE '20170627%' GROUP by event_action, event_pageToken HAVING events>1) AS dupetokenlist GROUP BY browser_family ORDER BY tokens_with_dupes DESC;
+------------------+-------------------+
| browser_family   | tokens_with_dupes |
+------------------+-------------------+
| Chrome Mobile    |               447 |
| Chrome           |               311 |
| Firefox          |               165 |
| Mobile Safari    |               106 |
| Samsung Internet |                57 |
| Edge             |                31 |
| SeaMonkey        |                11 |
| Opera            |                 6 |
| Firefox Mobile   |                 6 |
| Opera Mobile     |                 5 |
| Yandex Browser   |                 4 |
| Edge Mobile      |                 3 |
| IE               |                 2 |
| Facebook         |                 2 |
| Vivaldi          |                 2 |
| Amazon Silk      |                 2 |
| UC Browser       |                 1 |
| K-Meleon         |                 1 |
| YandexSearch     |                 1 |
| Chromium         |                 1 |
+------------------+-------------------+
Chrome is about 26%.
Firefox is about 14%.

About 0.3% of all Firefox events are duplicate. Same is true fro Chrome.

I think in the case of the ReadingDepth schema there is no clear winner browser that logs more duplicate events than others. And this is different from what we found in T167391#3384529.

@bmansurov For the data about hypothesis 3 (browsers), is there any way we could identify if the duplicate events are sent by a few clients, or many of them? Any anon client token we could group by?

It doesn't look like the problem is specific to an OS or a browser. It's happening everywhere but on some browsers it's happening more. See T167391#3384529.

SELECT SUBSTRING( userAgent, INSTR(userAgent,'"os_family": "')+14, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"os_family": "')+14), '"' )-1 ) AS os_family, SUBSTRING( userAgent, INSTR(userAgent,'"browser_family": "')+19, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_family": "')+19), '"' )-1 ) AS browser_family, CONCAT( SUBSTRING( userAgent, INSTR(userAgent,'"browser_major": "')+18, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_major": "')+18), '"' )-1 ), '-', SUBSTRING( userAgent, INSTR(userAgent,'"browser_minor": "')+18, INSTR(SUBSTRING(userAgent, INSTR(userAgent,'"browser_minor": "')+18), '"' )-1 ) ) AS browser_version, COUNT(*) AS tokens_with_dupes FROM (SELECT userAgent, COUNT(*) AS events  FROM log.ReadingDepth_16325045 WHERE timestamp LIKE '20170627%' GROUP by event_action, event_pageToken HAVING events>1) AS dupetokenlist GROUP BY os_family, browser_family, browser_version ORDER BY tokens_with_dupes DESC;
+---------------+------------------+-----------------+-------------------+
| os_family     | browser_family   | browser_version | tokens_with_dupes |
+---------------+------------------+-----------------+-------------------+
| Android       | Chrome Mobile    | 58-0            |               289 |
| Windows 7     | Chrome           | 58-0            |               111 |
| iOS           | Mobile Safari    | 10-0            |               102 |
| Windows 10    | Chrome           | 58-0            |                67 |
| Windows 7     | Firefox          | 54-0            |                43 |
| Windows 10    | Firefox          | 54-0            |                37 |
| Android       | Samsung Internet | 5-4             |                29 |
| Windows 7     | Chrome           | 59-0            |                24 |
| Android       | Chrome Mobile    | 55-0            |                22 |
| Android       | Samsung Internet | 4-0             |                21 |
| Windows 10    | Edge             | 14-14393        |                21 |
| Android       | Chrome Mobile    | 40-0            |                19 |
| Android       | Chrome Mobile    | 50-0            |                16 |
| Windows 8.1   | Chrome           | 58-0            |                15 |
| Android       | Chrome Mobile    | 46-0            |                15 |
| Android       | Chrome Mobile    | 57-0            |                14 |
| Mac OS X      | Chrome           | 58-0            |                13 |
| Android       | Chrome Mobile    | 56-0            |                13 |
| Windows XP    | Firefox          | 52-0            |                13 |
| Android       | Chrome Mobile    | 43-0            |                12 |
| Windows XP    | Chrome           | 49-0            |                11 |
| Android       | Chrome           | 58-0            |                11 |
| Windows 10    | Chrome           | 59-0            |                11 |
| Windows 10    | Edge             | 15-15063        |                10 |
| Windows 7     | Firefox          | 53-0            |                 9 |
| Mac OS X      | Firefox          | 54-0            |                 8 |
| Android       | Chrome Mobile    | 59-0            |                 8 |
| Android       | Chrome Mobile    | 52-0            |                 8 |
| Windows 8.1   | Firefox          | 54-0            |                 8 |
| Android       | Chrome Mobile    | 49-0            |                 6 |
| Android       | Chrome Mobile    | 51-0            |                 6 |
| Windows 7     | Firefox          | 52-0            |                 5 |
| Windows 7     | SeaMonkey        | 2-40            |                 5 |
| Windows 8.1   | Chrome           | 59-0            |                 4 |
| Windows Vista | Firefox          | 52-0            |                 4 |
| Windows 8     | Chrome           | 58-0            |                 4 |
| Windows 7     | Firefox          | 35-0            |                 4 |
| Windows 10    | Firefox          | 41-0            |                 4 |
| Mac OS X      | Firefox          | 48-0            |                 4 |
| Android       | Opera Mobile     | 42-7            |                 4 |
| Android       | Chrome Mobile    | 39-0            |                 3 |
| Android       | Chrome Mobile    | 45-0            |                 3 |
| Windows 7     | SeaMonkey        | 2-46            |                 3 |
| Android       | Firefox Mobile   | 54-0            |                 3 |
| Android       | Chrome Mobile    | 54-0            |                 3 |
| Android       | Samsung Internet | 4-2             |                 3 |
| Windows 10    | SeaMonkey        | 2-46            |                 3 |
| Android       | Chrome Mobile    | 53-0            |                 3 |
| Windows 7     | Chrome           | 56-0            |                 3 |
| iOS           | Mobile Safari    | 7-0             |                 2 |
| Windows 7     | Chrome           | 43-0            |                 2 |
| iOS           | Mobile Safari    | 9-0             |                 2 |
| Windows 7     | IE               | 11-0            |                 2 |
| Android       | Chrome           | 51-0            |                 2 |
| Windows Phone | Edge Mobile      | 15-15063        |                 2 |
| Windows 7     | Chrome           | 55-0            |                 2 |
| Windows 7     | Opera            | 45-0            |                 2 |
| Mac OS X      | Chrome           | 59-0            |                 2 |
| Android       | Samsung Internet | 5-0             |                 2 |
| Windows 10    | Firefox          | 48-0            |                 2 |
| Android       | Facebook         | 129-0           |                 2 |
| Linux         | Samsung Internet | 1-1             |                 2 |
| Windows 7     | Chrome           | 46-0            |                 2 |
| Android       | Chrome Mobile    | 44-0            |                 2 |
| Windows 8.1   | Vivaldi          | 1-91            |                 2 |
| Windows 7     | Firefox          | 45-0            |                 2 |
| Windows 8.1   | Firefox          | 53-0            |                 2 |
| Linux         | Chrome           | 58-0            |                 2 |
| Windows 7     | Chrome           | 45-0            |                 2 |
| Windows 7     | Chrome           | 57-0            |                 2 |
| Windows XP    | Firefox          | 34-0            |                 2 |
| Android       | Chrome           | 54-0            |                 2 |
| Windows 7     | Chrome           | 47-0            |                 1 |
| Windows 7     | Firefox          | 31-0            |                 1 |
| Windows 8.1   | Opera            | 39-0            |                 1 |
| Windows 7     | Opera            | 46-0            |                 1 |
| Windows 7     | Firefox          | 48-0            |                 1 |
| Android       | Chrome           | 57-0            |                 1 |
| Windows 8.1   | Yandex Browser   | 17-6            |                 1 |
| Ubuntu        | Firefox          | 54-0            |                 1 |
| Windows 10    | Firefox          | 51-0            |                 1 |
| Chrome OS     | Chrome           | 59-0            |                 1 |
| Android       | Chrome           | 43-0            |                 1 |
| Windows 7     | Chrome           | 53-0            |                 1 |
| Android       | UC Browser       | 10-7            |                 1 |
| Linux         | Chrome           | 52-0            |                 1 |
| Android       | Chrome Mobile    | 42-0            |                 1 |
| Windows Vista | Opera            | 36-0            |                 1 |
| Android       | Chrome Mobile    | 37-0            |                 1 |
| Windows 10    | Opera            | 45-0            |                 1 |
| Windows 7     | Firefox          | 56-0            |                 1 |
| Windows 8.1   | Chrome           | 51-0            |                 1 |
| Windows 8     | Firefox          | 40-0            |                 1 |
| Linux         | Chrome           | 59-0            |                 1 |
| Windows Phone | Edge Mobile      | 14-14393        |                 1 |
| Android       | Chrome Mobile    | 38-0            |                 1 |
| Linux         | Chrome           | 47-0            |                 1 |
| Android       | Amazon Silk      | 58-3            |                 1 |
| Windows 10    | Chrome           | 49-0            |                 1 |
| Windows 7     | Firefox          | 55-0            |                 1 |
| Windows Vista | Chrome           | 46-0            |                 1 |
| Android       | Chrome Mobile    | 18-0            |                 1 |
| Android       | Chrome           | 39-0            |                 1 |
| Windows 10    | Yandex Browser   | 17-6            |                 1 |
| Windows 8     | Chrome           | 59-0            |                 1 |
| Mac OS X      | Firefox          | 53-0            |                 1 |
| Windows 7     | Chrome           | 58-2            |                 1 |
| Windows XP    | K-Meleon         | 75-0            |                 1 |
| Android       | YandexSearch     | 6-45            |                 1 |
| Ubuntu        | Firefox          | 34-0            |                 1 |
| Windows 7     | Firefox          | 42-0            |                 1 |
| Windows XP    | Firefox          | 43-0            |                 1 |
| Mac OS X      | Chrome           | 49-0            |                 1 |
| Ubuntu        | Chromium         | 58-0            |                 1 |
| Windows 7     | Yandex Browser   | 17-6            |                 1 |
| Windows 7     | Chrome           | 42-0            |                 1 |
| Android       | Amazon Silk      | 58-2            |                 1 |
| Fedora        | Firefox          | 54-0            |                 1 |
| Windows 7     | Firefox          | 49-0            |                 1 |
| Android       | Firefox Mobile   | 50-0            |                 1 |
| Windows 8.1   | Firefox          | 51-0            |                 1 |
| Windows XP    | Firefox          | 51-0            |                 1 |
| Android       | Firefox Mobile   | 53-0            |                 1 |
| Windows 7     | Firefox          | 47-0            |                 1 |
| Android       | Chrome           | 49-0            |                 1 |
| Windows 7     | Firefox          | 43-0            |                 1 |
| Windows XP    | Chrome           | 41-0            |                 1 |
| Android       | Chrome Mobile    | 47-0            |                 1 |
| Windows 7     | Chrome           | 49-0            |                 1 |
| Android       | Firefox Mobile   | 40-0            |                 1 |
| Android       | Yandex Browser   | 17-4            |                 1 |
| Android       | Opera Mobile     | 20-0            |                 1 |
| Windows Vista | Firefox          | 47-0            |                 1 |
+---------------+------------------+-----------------+-------------------+

How is that different from the comparisons we ran earlier (see the links in the task description)? I thought the question posed in hypothesis 6 was asking about a different Hadoop table.

I didn't provide enough detail in the description:

The eventlogging-processor processes produce two Kafka topics: eventlogging-valid-mixed and eventlogging_<Schema> – in this case, eventlogging_Popups. The latter are consumed directly by Camus and stored in HDFS.

It might be worth checking for discrepancies between the log.Popups_16364296 MySQL table, which has messages from the eventlogging-valid-mixed topic written to it, and those in the corresponding Hadoop table, which has messages from the eventlogging_Popups topic written to it. However, this wouldn't validate this hypothesis given the above.

Since both of the topics are produced by the eventlogging-processors, it seems unlikely that one would have more data than the other – especially given a symptom of the problem is that the duplication is pronounced for one browser.


See https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Architecture for more detail about the overall EL architecture.

To clarify, this is the query from April (T161769#3195922, already mentioned in the second sentence of the task description), and the result looks very similar (back then it was 98% instead of 96% now). What about the other points @phuedx raised above under hypothesis 3?

In T166115#3285770 I noticed that the query is counting duplicates the wrong way. Since we know that a single pageToken is used for two actions -- pageLoaded and pageUnloaded -- we should not consider these events as duplicates. We should only consider events that have the same action and same pageToken as duplicates.

That's true, but unless I'm missing something, Sam's query in T166115#3285770 actually avoided that problem already, by focusing on the case event_action = 'pageLoaded'.

How is that different from the comparisons we ran earlier (see the links in the task description)? I thought the question posed in hypothesis 6 was asking about a different Hadoop table.

I didn't provide enough detail in the description:

The eventlogging-processor processes produce two Kafka topics: eventlogging-valid-mixed and eventlogging_<Schema> – in this case, eventlogging_Popups. The latter are consumed directly by Camus and stored in HDFS.

It might be worth checking for discrepancies between the log.Popups_16364296 MySQL table, which has messages from the eventlogging-valid-mixed topic written to it, and those in the corresponding Hadoop table, which has messages from the eventlogging_Popups topic written to it. However, this wouldn't validate this hypothesis given the above.

Since both of the topics are produced by the eventlogging-processors, it seems unlikely that one would have more data than the other – especially given a symptom of the problem is that the duplication is pronounced for one browser.

OK, but how would one go about comparing the data directly as suggested in the task? (Given that we still haven't identified the root cause, it seems worth checking the hypotheses independently.)


See https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Architecture for more detail about the overall EL architecture.

Yup (I think I already included that link back in May in our Google Doc draft)

@bmansurov For the data about hypothesis 3 (browsers), is there any way we could identify if the duplicate events are sent by a few clients, or many of them? Any anon client token we could group by?

It doesn't look like the problem is specific to an OS or a browser. It's happening everywhere but on some browsers it's happening more. See T167391#3384529.

I think @Jhernandez may have meant "client" in the sense of a browser on an individual machine. We could group by sessionToken (I already looked at one individual one earlier).

How is that different from the comparisons we ran earlier (see the links in the task description)? I thought the question posed in hypothesis 6 was asking about a different Hadoop table.

I didn't provide enough detail in the description:

The eventlogging-processor processes produce two Kafka topics: eventlogging-valid-mixed and eventlogging_<Schema> – in this case, eventlogging_Popups. The latter are consumed directly by Camus and stored in HDFS.

It might be worth checking for discrepancies between the log.Popups_16364296 MySQL table, which has messages from the eventlogging-valid-mixed topic written to it, and those in the corresponding Hadoop table, which has messages from the eventlogging_Popups topic written to it. However, this wouldn't validate this hypothesis given the above.

Since both of the topics are produced by the eventlogging-processors, it seems unlikely that one would have more data than the other – especially given a symptom of the problem is that the duplication is pronounced for one browser.

OK, but how would one go about comparing the data directly as suggested in the task? (Given that we still haven't identified the root cause, it seems worth checking the hypotheses independently.)

For the record: In today's meeting we clarified that this was referring to the hourly EL imports described at https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop . It's conceivable to compare the duplicates by accessing the Hadoop copy (for a particular hour) as described there, but that doesn't seem worthwhile right now considering that the duplication appears to happen further upstream (at least in the one example examined at T161769#3194893).

Gonna debug this a little bit more on Wednesday...

Debunking scenario 7
I couldn't find any evidence to suggest Common.js on the main offenders - Italian and Russian wiki were at fault.

Debunking scenario 8
It's interesting to identify to pick one of the duplicate token [1] and then actually analyse what happened there. I'd need to look at more than 1 case of these to rule out scenario 8 but they seem too random to be done programatically:

On one particular one this is what I see:

  • I see 5 opened events at the same time
  • 2 seconds later 5 dwelled but abandoned events followed by 1 opened and 1 dwelledButAbandoned.
  • 12 seconds later a dwelled but abandoned + opened happen at the same time
  • 4 seconds later a dwelled but abandoned + opened happen at the same time
  • 28 seconds later a dwelled but abandoned + opened happen at the same time
  • 14 seconds later a dwelled but abandoned + opened happen at the same time
  • 25 seconds later a dwelled but abandoned + opened happen at the same time
  • 13 seconds later a dwelled but abandoned + opened happen at the same time
  • 7 seconds later a dwelled but abandoned + opened happen at the same time
  • 17 seconds later a dwelled but abandoned + opened happen at the same time

*7 seconds later a dwelled but abandoned + opened happen at the same time

link interaction, page token and session token are exactly the same for all of these

Musing about scenario 3
I'm not convinced this is a Firefox issue. Maybe Firefox on Windows? [2]

[1]

SELECT wiki, event_action, event_linkInteractionToken, COUNT(*) AS events 
	FROM log.Popups_16364296
	WHERE event_isAnon = 1 
	AND event_popupEnabled = 0
	AND timestamp LIKE '2017041%'
	AND event_linkInteractionToken IS NOT NULL
	GROUP BY event_linkInteractionToken # userAgent should be identical if the token is
	HAVING events>15
SELECT wiki, event_action, userAgent, wiki, COUNT(*) AS tokens_with_dupes FROM (
    SELECT wiki, userAgent, event_action, event_linkInteractionToken, COUNT(*) AS events 
	FROM log.Popups_16364296
	WHERE event_isAnon = 1 
	AND event_popupEnabled = 0
	AND timestamp LIKE '2017041%'
	AND event_linkInteractionToken IS NOT NULL
	GROUP BY event_linkInteractionToken # userAgent should be identical if the token is
	HAVING events>1) AS dupetokenlist
GROUP BY userAgent, wiki
ORDER BY tokens_with_dupes DESC;

Debunking scenario 8
It's interesting to identify to pick one of the duplicate token [1] and then actually analyse what happened there. I'd need to look at more than 1 case of these to rule out scenario 8 but they seem too random to be done programatically:

You might want to follow along with the subtasks that @Tbayer just opened: T169581: Histogram for the number of duplicated events per browser session and T169582: Examine several sessions which contained duplicate events. Those tasks focus on analyzing the behavior of clients that are sending duplicate events in more detail with a view to (in)validating #2 and #3.

Musing about scenario 3
I'm not convinced this is a Firefox issue. Maybe Firefox on Windows? [2]

If it's a Firefox on Windows issue, then it's still a Firefox issue ;)

Joking aside, @Tbayer, @bmansurov, and I have all run the queries for duplicates by OS and by browser family/version (T161769#3195922) but not together. I don't see any harm in combining the two.

Following on from T167391#3403699:

This behavior is not limited to Firefox on Windows 10. Firefox remains the main offender across a variety of platforms. The distribution of the platforms appears to follow the general distribution.

[0]
+---------------+----------------+---------------+-------------------+
| os_family     | browser_family | browser_major | tokens_with_dupes |
+---------------+----------------+---------------+-------------------+
| Windows 7     | Firefox        | 54            |              1549 |
| Windows 10    | Firefox        | 54            |               967 |
| Windows XP    | Firefox        | 52            |               484 |
| Windows 8.1   | Firefox        | 54            |               266 |
| Mac OS X      | Firefox        | 54            |               239 |
| Ubuntu        | Firefox        | 54            |                94 |
| Windows 7     | Firefox        | 52            |                86 |
| Windows 7     | Firefox        | 53            |                78 |
| Windows Vista | Firefox        | 52            |                25 |
| Windows 8     | Firefox        | 54            |                15 |
| Windows 7     | Firefox        | 51            |                15 |
| Windows 10    | Firefox        | 52            |                14 |
| Windows 8.1   | Firefox        | 53            |                13 |
| Linux         | Firefox        | 52            |                11 |
| Windows 8.1   | Chrome         | 57            |                 9 |
| Windows 10    | Firefox        | 53            |                 9 |
| Windows 10    | Firefox        | 47            |                 9 |
| Mac OS X      | Chrome         | 58            |                 6 |
| Windows 7     | Chrome         | 59            |                 5 |
| Fedora        | Firefox        | 54            |                 5 |
| Windows 8.1   | Firefox        | 47            |                 4 |
| Windows 10    | Chrome         | 59            |                 1 |
| Windows 7     | Yandex Browser | 17            |                 1 |
| Linux         | Chrome         | 58            |                 1 |
+---------------+----------------+---------------+-------------------+

select
  substring(userAgent, instr(userAgent, '"os_family": "') + 14,
    instr(substring(userAgent, instr(userAgent, '"os_family": "') + 14), '"') -1) as os_family,
  substring(userAgent, instr(userAgent, '"browser_family": "') + 19, instr(substring(userAgent, instr(userAgent, '"browser_family": "') + 19), '"') - 1) as browser_family,
  substring(userAgent, instr(userAgent, '"browser_major": "') + 18, instr(substring(userAgent, instr(userAgent, '"browser_major": "') + 18), '"') -1) as browser_major,
  count(*) as tokens_with_dupes
from (
  select
    userAgent,
    event_linkInteractionToken,
    count(*) as events
  from
    log.Popups_16364296
  where
    month(timestamp) = 7
    and event_linkInteractionToken is not null
  group by
    event_linkInteractionToken
  having
    events > 1
) as dupetokenlist
group by
  os_family,
  browser_family,
 browser_major
order by
  tokens_with_dupes desc;

Gonna debug this a little bit more on Wednesday...

Thanks! It's great that you are able to chip in (every additional pair of eyes is valuable). But as @phuedx already indicated, we should coordinate about who does what, to make sure your time is not spent in vain. E.g. the theory that this happens only on a particular OS was indeed already ruled out back in April.

Debunking scenario 8
It's interesting to identify to pick one of the duplicate token [1] and then actually analyse what happened there.

It looks [1] was excerpted from the query I posted three months ago at T161769#3195922 . I would recommend changing the date from April 1 to more recent one, because quite a few changes and fixes have been made since then. In particular, the "dwelled but abandoned + opened happen at the same time" behavior in your example below should no longer happen since T162924: PagePreviews sending "dwelledButAbandoned" after "opened".

I'd need to look at more than 1 case of these to rule out scenario 8 but they seem too random to be done programatically:

I agree. Note that we already looked at another example back in May that likewise was clearly inconsistent with hypothesis 8: T161769#3250284 (all events for one client / session in one pageview)

On one particular one this is what I see:

  • I see 5 opened events at the same time
  • 2 seconds later 5 dwelled but abandoned events followed by 1 opened and 1 dwelledButAbandoned.
  • 12 seconds later a dwelled but abandoned + opened happen at the same time
  • 4 seconds later a dwelled but abandoned + opened happen at the same time
  • 28 seconds later a dwelled but abandoned + opened happen at the same time
  • 14 seconds later a dwelled but abandoned + opened happen at the same time
  • 25 seconds later a dwelled but abandoned + opened happen at the same time
  • 13 seconds later a dwelled but abandoned + opened happen at the same time
  • 7 seconds later a dwelled but abandoned + opened happen at the same time
  • 17 seconds later a dwelled but abandoned + opened happen at the same time

*7 seconds later a dwelled but abandoned + opened happen at the same time

link interaction, page token and session token are exactly the same for all of these

if you filter by event_pageTitleSource="Proving a point to Tilman" you'll see a bunch of duplicate events under there.. i manufactured those and it's thus definitely very easy to do #8 but to be convincing would require a little bit more malice.

One interesting thing I'm not sure if you've noticed:
event_namespaceIdSource is not always 0. A lot of the duplicate events happen in namespace 10 for huwiki. That's the template namespace...

That's a pretty advanced namespace to be operating in. It makes me wonder if maybe a power user has a gadget or there is a bug in that namespace. To me that would be worth exploring some more.

Can we version our events to ensure which version of our code is generating events? Maybe based on current commit is? Seems like that would give more confidence of where the events are originating...

Okay I'm pretty damn sure I've got to the bottom of this... > T170018

giphy.gif (180×300 px, 251 KB)

I'm hoping we can verify this was the problem and sign this off after a fix for T170018 has been merged and deployed so have moved to sign off column... Fingers crossed this does it...

if not, we'll back to square one and we should reach out to Mozilla and those in charge of the EventLogging stack.

When we've fixed this it would make sense for Tilman to sign this off (although that will be a while)

I re-ran the query in T170018#3429243 prior to resolving the task. I noted that there were 3 instances of duplication in the last 48 hours.

Each of those instances is a distinct browser session:

Timestamp (YMDH)Session IDnDupes% of nEventsBrowser FamilyPreviews Enabled?
20170711157e2e9800de9ae77f12~Yandex BrowserNo
20170712195f2b46c45534c9a61220.28 [0]FirefoxYes
20170713054525a900c85c7a8b40.02 [0]ChromeYes
[0]
+----------+--------+-------+------+
| ymd      | nDupes | n     | %    |
+----------+--------+-------+------+
| 20170711 |   3164 | 44274 | 7.15 |
| 20170712 |    122 | 43331 | 0.28 |
| 20170713 |      4 | 16687 | 0.02 |
+----------+--------+-------+------+

select
  nPerDay.ymd,
  ifnull(nDupes, 0) as nDupes,
  n,
  ifnull(round(100 * nDupes / n, 2), 0) as '%'
from (
  select ymd, sum(nDupesPerDayInt.n) as nDupes
  from (
    select left(timestamp, 8) as ymd, count(*) as n
    from log.Popups_16364296
    where
      year(timestamp) = 2017
      and month(timestamp) = 7
      and day(timestamp) >= 11
      and event_linkInteractionToken is not null
    group by ymd, event_linkInteractionToken
    having n > 1
  ) as nDupesPerDayInt
  group by ymd
) as nDupesPerDay
right outer join (
  select left(timestamp, 8) as ymd, count(*) as n
  from log.Popups_16364296
  where
    year(timestamp) = 2017
    and month(timestamp) = 7
    and day(timestamp) >= 11
    and event_linkInteractionToken is not null
  group by ymd
) as nPerDay
on
  nDupesPerDay.ymd = nPerDay.ymd
order by
  nPerDay.ymd asc;

closing this. If we notice something as a part of T158172, we can open a separate bug. @Tbayer - feel free to re-open if you think we're missing something.