Page MenuHomePhabricator

Duplicate events sent in Firefox after back button press
Closed, ResolvedPublic5 Estimated Story Points

Description

This is caused by an upstream bug that has been confirmed in Firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=1379762#c5

Background

After some debugging, I noticed that certain pages were appearing more often than others with regards to duplicate tokens.

Some examples:

Looking at these pages I noticed they had one common problem - links clustered together that could be accidentally clicked.

My theory is that people are accidentally navigating to links and then quickly hitting back. IT seems this behaviour confuses Firefox and it's possibly reloading the Popups code (I'm currently exploring why more)

Replication steps

Note: This can be replicated locally and is not specific to this page

  1. Visit https://hu.wikipedia.org/wiki/Sablon:A_Sz%C3%ADnh%C3%A1z-_%C3%A9s_Filmm%C5%B1v%C3%A9szeti_Egyetem_2001-ben_oklevelet_szerzett_hallgat%C3%B3i
  2. Bucket yourself: mw.storage.session.set('mwuser-sessionId', 1059 ) (you can get a token using this script https://gist.github.com/jdlrobson/999ad762eead45fd27a4d6e29aebe9f8 )
  3. Refresh page
  4. Filter events in network tab by "beacon"
  5. Hover over a link and hover off:
  6. See single beacon event.
  7. Click a link "accidentally".
  8. As soon as url changes to new page press keyboard shortcut for back e.g. cmd + left
  9. notice duplicated events

Visual of replication steps:

dupetokens.gif (704×1 px, 3 MB)

SQL queries

Identify problematic pages:

SELECT wiki, event_pageTitleSource, event_namespaceIdSource, event_action, event_linkInteractionToken, COUNT(*) AS events 
	FROM log.Popups_16364296
	WHERE  timestamp LIKE '201707%'
	AND event_linkInteractionToken IS NOT NULL
	GROUP BY event_pageTitleSource, event_linkInteractionToken # userAgent should be identical if the token is
	HAVING events>1

How to get a stack trace

In console please execute:

var oldLog = mw.eventLog.logEvent;
mw.eventLog.logEvent = function(name, data) { console.trace(); oldLog(name, data); }

Stacktrace:

mw.eventLog.logEvent debugger eval code:1:47
	handleTrackedEvent/< https://hu.wikipedia.org/w/extensions/EventLogging/modules/ext.eventLogging.subscriber.js:37:4
	fire https://hu.wikipedia.org/w/load.php:3148:10
	fireWith https://hu.wikipedia.org/w/load.php:3260:7
	Deferred/</deferred[tuple[0]] https://hu.wikipedia.org/w/load.php:3350:5
	using https://hu.wikipedia.org/w/load.php:12348:7
	handleTrackedEvent https://hu.wikipedia.org/w/extensions/EventLogging/modules/ext.eventLogging.subscriber.js:36:3
	handler https://hu.wikipedia.org/w/load.php:10965:7
	fire https://hu.wikipedia.org/w/load.php:3148:10
	fireWith https://hu.wikipedia.org/w/load.php:3260:7
	fire https://hu.wikipedia.org/w/load.php:3267:5
	track https://hu.wikipedia.org/w/load.php:10941:4
	./src/changeListeners/eventLogging.js/e.exports/< https://hu.wikipedia.org/w/extensions/Popups/resources/dist/index.js:2:10962
	./src/changeListener.js/e.exports/< https://hu.wikipedia.org/w/extensions/Popups/resources/dist/index.js:2:10418
	p https://hu.wikipedia.org/w/extensions/Popups/resources/dist/index.js:2:2827
	dispatch https://hu.wikipedia.org/wiki/Sablon:A_Sz%C3%ADnh%C3%A1z-_%C3%A9s_Filmm%C5%B1v%C3%A9szeti_Egyetem_2001-ben_oklevelet_szerzett_hallgat%C3%B3i:2:31753
	n/</</< https://hu.wikipedia.org/w/extensions/Popups/resources/dist/index.js:2:1017
	dispatch https://hu.wikipedia.org/w/extensions/Popups/resources/dist/index.js:2:4469
	./src/actions.js/s.abandon/</< https://hu.wikipedia.org/w/extensions/Popups/resources/dist/index.js:2:9712
	then/</</< https://hu.wikipedia.org/w/load.php:3305:30
	fire https://hu.wikipedia.org/w/load.php:3148:10
	fireWith https://hu.wikipedia.org/w/load.php:3260:7
	Deferred/</deferred[tuple[0]] https://hu.wikipedia.org/w/load.php:3350:5
	./src/wait.js/e.exports/< https://hu.wikipedia.org/w/extensions/Popups/resources/dist/index.js:2:35245

mw.eventLog.logEvent() is called three times, but Redux shows only one state change. (EVENT_LOGGED state is triggered only once)

Possible reasons

  1. Bug in Mozilla Firefox that leads to two load events being fired
  2. There is a bug in jquery that leads to two load events being fired
  3. There is a bug in jquery migrate that leads to two load events being fired
  4. ad block or other extension is interfering Ruled out - both Sam and piotr do not have any extensions installed
  5. The way we load JS from localStorage leads to double execution of code under certain circumstances

Event Timeline

Change 363957 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/EventLogging@master] Only subscribe EventLogging once

https://gerrit.wikimedia.org/r/363957

Jdlrobson raised the priority of this task from High to Unbreak Now!.Jul 7 2017, 10:25 PM
Jdlrobson added subscribers: Catrope, Krinkle.

I have isolated the issue to EventLogging. The fix is simple. The code assumes incorrectly that a window load event will only be fired once. This explains why we are seeing duplicate events in many of our schemas.

cc @Krinkle @Catrope

@Jdlrobson - wait. Firefox sends only a second load event (without reloading document) when the user quickly clicks back button?

@Jdlrobson - wait. Firefox sends only a second load event (without reloading document) when the user quickly clicks back button?

I find this curious as well.

Last I checked, all modern browsers agree that load can only fire once in a given browsing context. The "bfcache" (backward-forward cache) is capable of serialising such a context and restoring in certain conditions. When this happens, it is essentially like you never left. No load event would fire.

I find it hard to believe browsers would have a bug in this code or intentionally deviate from this given that there would be a lot more breakage if that was the case. All kinds of initialisation starts from the load event in different modules and libraries on the web. If those run a second time, I'd expect to find more bug reports about user-visible issues.

It wouldn't surprise me if there was a bug here (perhaps a recent regression, or limited to a beta channel), in which case we'll need to detect when this happens, but without solid confirmation of such bug I fear that merely switching to one() will do more harm than good.

https://stackoverflow.com/a/2218733/319266
https://webkit.org/blog/516/webkit-page-cache-ii-the-unload-event/
https://developer.mozilla.org/en-US/docs/Working_with_BFCache

@pmiazga yup. Interestingly the event triggers again and the code is still present. I think this is an upstream bug but it's definitely happening... I can reproduce it 100% of the time.

@Krinkle this is happening in latest Firefox. It's definitely firing a load event on the same JS session more than once. I plan to contact Mozilla Monday with a smaller MTC.

What are your concerns around using one? Why do we need to register this more than once?

Just a quick note that I tried to reproduce this myself using the given steps, and haven't gotten lucky so far. (For reference, I mined my own session token using Jon's trick: 3911.) Maybe I didn't hit the back key quickly enough or used the wrong Firefox - I might try again later.

But in any case, this is undeniably a breakthrough regarding T167391, as it marks the first time anyone has been able to generate duplicates deliberately.

(To view the duplicate events from Jon's experiment, use e.g.:

SELECT timestamp, event_pageToken, event_pageTitleSource, event_linkInteractionToken, event_action AS action FROM log.Popups_16364296 WHERE timestamp LIKE '20170707%' AND event_sessionToken = '1059' ORDER BY timestamp;

)

BTW, what's an MTC?

But in any case, this is undeniably a breakthrough regarding T167391, as it marks the first time anyone has been able to generate duplicates deliberately.

Seconded. Excellent work!

BTW, what's an MTC?

I read it as Minimal Test Case, but I might be wrong.

Bucket yourself: mw.storage.session.set('mwuser-sessionId', 1059 ) (you can get a token using this script https://gist.github.com/jdlrobson/999ad762eead45fd27a4d6e29aebe9f8 )

We wrote the same script! https://gist.github.com/phuedx/d580f01c501d207398828b717bf9870b

@Jdlrobson - wait. Firefox sends only a second load event (without reloading document) when the user quickly clicks back button?

I find this curious as well.

Thirded! This seems to be the opposite of the behaviour described in the links that you've provided (and other SO questions about window.onload and the BFCache).

https://stackoverflow.com/a/2218733/319266

While I was trying to reproduce this bug – thanks @Jdlrobson for the written and visual instructions – I did note that the body of the ext.eventLogging.subscriber.js code isn't re-evaluated when the page is shown from the BFCache but the load event handler was being invoked (see rEEVLcb7a432588c7: WIP for my excellent debugging code…).

Ignoring the answers to the SO question that suggest disabling the BFCache by adding an unload event handler to the window object is the correct solution… There's a suggestion to rely on the jQuery ready event and, funnily enough, changing

ext.eventLogging.subscriber.js
$( window ).on( 'load' function () {
  /* ... */
} );

to

ext.eventLogging.subscriber.js
$( document ).ready( function () {
  /* ... */
} );

also stops duplicate events from being logged.

What would the impact on performance be if we were to start logging events when the document was in the ready state? The Page Previews and ReadingDepth Reading Web instrumentation requires the browser support the Beacon API but I imagine there's plenty of other instrumentation relying on the fallback method of using img tags.

@Tbayer: AFAICT this bug shouldn't impact the ReadingDepth instrumentation as we add an unload event handler, which disables the BFCache in Firefox.

  1. As soon as url changes to new page press keyboard shortcut for back e.g. cmd + left

When I got my local setup into a state where Firefox could use the BFCache – per T170018#3420291, I had to disable the ReadingDepth instrumentation – I found that I didn't have to navigate back ASAP. Indeed, I don't think the previous page fell out of the BFCache throughout my testing.

I second @Krinkle 's concerns. Has anyone been able to repro this with a clean reinstal of FF (besides @Jdlrobson )?

Sorry, edit, i see that @phuedx was able to repro too

@Nuria I'm able to reproduce this on Firefox 54.0.1 (64-bit) build for archlinux - 1.0

I second @Krinkle 's concerns. Has anyone been able to repro this with a clean reinstal of FF (besides @Jdlrobson )?

Sorry, edit, i see that @phuedx was able to repro too

To clarify T170018#3420286 and T170018#3420304, I can reproduce this bug on Firefox (54.0.1) on macOS Sierra (10.12.5).

@pmiazga @phuedx so, to be clear, you se a 2nd load event with no realoading of resources? as in there is nothing on your nextwork tab? Cause if the browser is not using the pagecache a load should fire (that makes total sense) but also a normal rebuild of the page should fire too (although this being a second request everything is pulled from http cache)

MBinder_WMF set the point value for this task to 5.Jul 10 2017, 5:14 PM
MBinder_WMF subscribed.

@Jdlrobson Is this Unplanned-Sprint-Work ? I can't tell from the history, I'm just going by the priority.

OK. I couldn't create a screencast in any format that was small enough for Phab to accept. I've uploaded it to YouTube under my wikimedia.org account.

@pmiazga @phuedx so, to be clear, you se a 2nd load event with no realoading of resources? as in there is nothing on your nextwork tab?

Consider https://youtu.be/eKu4RDWFpfQ?t=12s: when I ⌘+⬅, I see the initial page resumed (not loaded) from Firefox's Back-Forward cache (BFCache). There are no HTTP requests made during this process, so the network tab is emptied. Immediately after this a beacon request and its duplicate are sent.

Our contention is that the duplicate request is caused by this code running when the page is resumed from the BFCache, which it should not be. Since that code has already run once prior to the page being put into the BFCache, the event. topic is subscribed to twice.

mmm just to understand this better: is your instrumenting code (the one from which sendbeacon requests are originated) only run upon onload event? Cause the onload issue might be a red herring right? You are seeing a sendBeacon request being sent when you click back on the browser, whether that is unexpected or not depends on the instrumenting code sending that request.

This appears to be an upstream issue relating to img's that have a srcset attribute and I'm filing an upstream bug.

MTC:

GIF demonstrating issue:

mozillabug.gif (704×648 px, 290 KB)

The upstream bug is here: https://bugzilla.mozilla.org/show_bug.cgi?id=1379762 and has been confirmed by Mozilla devs

Now.. how do we want to work around this issue? @Krinkle are you willing to revisit your -1 on Only subscribe EventLogging once - I've updated the description to explain why we want to be defensive with a link to the upstream bug.

@Krinkle this is happening in latest Firefox. It's definitely firing a load event on the same JS session more than once. I plan to contact Mozilla Monday with a smaller MTC.

What are your concerns around using one? Why do we need to register this more than once?

The problem is that if this bug is real and causes load to fire multiple times within a single browsing context, we've got a lot more and much bigger problems that are user-visible, which calls for an adequate solution across the board. (Not specific to EventLogging.)

@Krinkle the bug is definitely real please see activity on the upstream ticket specifically this comment has a lot of context and has confirmed where and how this happens.

To be clear, right now we have a serious problem with EventLogging which has been blocking us launching the page previews feature for the last 2 months now. We've been investigating this for some time (see the very detailed T167391) and this is the culmination of those efforts. Yes there may be other problems and we need to investigate those, but this is the one that is a big and very real problem to reading web right now.

Only subscribe EventLogging once or switching to document.ready both seem to be uncontroversial changes here while we know this bug to exist and guidance would be appreciated on which to use (or if neither - a counter proposal).

Fixing EventLogging and launching page previews is our major priority here and this should be fixed in some way by the end of the week. Happy to jump into a hangout with related people if this speeds things up and alleviates any confusion.

@Krinkle the bug is definitely real please see activity on the upstream ticket specifically this comment.

[..] Yes there may be other problems and we need to investigate those, but this is the one that is a big and very real problem to reading web right now.

I've traced it to a regression in Firefox 51 (January 2017). This is a pretty significant and severe violation of web standards expectations. And yes, we will most likely need to patch other areas in similar ways, given the ramifications. I'd expect Web Platform tests to have detected such issue much sooner (e.g. before their stable release). Or at least more bug reports.

However, I suppose for the most part we use document-ready in our code bases. It's fairly old-school to use onload. EventLogging's use case happens to be a natural match given we want to defer handling of events to after loading of subresources completed.

Only subscribe EventLogging once or switching to document.ready both seem to be uncontroversial changes here while we know this bug to exist and guidance would be appreciated on which to use (or if neither - a counter proposal).

Switching to document-ready works, but has the downside of starting up the EventLogging pipeline ahead of any downloads for images, secondary styles and async JavaScript. It also means that on Safari (which still doesn't support sendBeacon), if the async JavaScript has arrived and the user starts interacting with the page – both before the load event fires – then any EventLogging image requests triggered by those interactions will hold back the load event, - and by extend the browser's page load indicator. We know that this affects perceived performance for readers. Although given this only happens if the user decided to go ahead and interact with the page, perhaps not as much of an issue.

It's unfortunate that we still use Event-based "onload" instead of a more modern promise-based approach (see https://github.com/whatwg/html/issues/127). This naturally avoids the problem by using a primitive (Promise) that has the desired semantics. A second trigger is naturally ignored given a Promise can only be resolved once. Given jQuery doesn't account for this bug, and we're still in the midst of a jQuery 3 upgrade, I suppose local use of .one() will do for now (with an explicit reference back to this bug report).

Change 364389 had a related patch set uploaded (by Phuedx; owner: Jdlrobson):
[mediawiki/extensions/EventLogging@wmf/1.30.0-wmf.7] Don't subscribe EventLogging twice if window.onload fires twice

https://gerrit.wikimedia.org/r/364389

Change 363957 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@master] Don't subscribe EventLogging twice if window.onload fires twice

https://gerrit.wikimedia.org/r/363957

mmm just to understand this better: is your instrumenting code (the one from which sendbeacon requests are originated) only run upon onload event? Cause the onload issue might be a red herring right? You are seeing a sendBeacon request being sent when you click back on the browser, whether that is unexpected or not depends on the instrumenting code sending that request.

Just so this question doesn't go unanswered: while I verified @Jdlrobson's fix (independently of @pmiazga and @Krinkle), I did check to see whether the Page Previews codebase was being reevaluated when the page was resumed from Firefox's BFCache. It isn't. Neither is the body of the EventLogging subscriber.

The Page Previews extension, like some other extensions, rely on the in-memory message queue provided by mw.track and mw.trackSubscribe and the EventLogging subscriber to log events. This allows us to have a soft dependency on EventLogging. Page Previews boots when the wikipage.content hook runs and not when the load event is triggered. However, [the EventLogging subscriber is subscribed when the loadevent is triggered](https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/30255332829f5867e219b35df43574260814dc87/modules/ext.eventLogging.subscriber.js#L90-L92). When this happens, it'll pick up any messages in the message queue with the event.* topic and immediately log them using mw.eventLog.logEvent.

Firefox's BFCache triggering a load event when the page is resumed from the BFCache means that the event.* topic will be subscribed to twice by the EventLogging subscriber.

You're right to highlight that there's a beacon request being made when the page is resumed from the BFCache. This appears to be a bug in the Page Previews codebase and will be filed separately.

Change 364389 merged by jenkins-bot:
[mediawiki/extensions/EventLogging@wmf/1.30.0-wmf.7] Don't subscribe EventLogging twice if window.onload fires twice

https://gerrit.wikimedia.org/r/364389

Mentioned in SAL (#wikimedia-operations) [2017-07-11T13:44:12Z] <dereckson@tin> Synchronized php-1.30.0-wmf.7/extensions/EventLogging/modules/ext.eventLogging.subscriber.js: Don't subscribe EventLogging twice if window.onload fires twice (T170018) (duration: 00m 42s)

To be clear, right now we have a serious problem with EventLogging which has been blocking us launching the page previews feature for the last 2 months now. We've been investigating this for some time (see the very detailed T167391) and this is the culmination of those >efforts. Yes there may be other problems and we need to investigate those, but this is the one that is a big and very real problem to reading web right now.

If the duplicate issue affects the majority of your data (and not just firefox) it will be worth digging out whether you have a different problem, the majority of our users in desktop use chrome so your percentage of data w/o issues should be A LOT higher than your data with duplicate problems.

If the duplicate issue affects the majority of your data (and not just firefox) it will be worth digging out whether you have a different problem, the majority of our users in desktop use chrome so your percentage of data w/o issues should be A LOT higher than your data with duplicate problems.

From the second sentence of the first paragraph of T167391: [Spike] Isolate the source of Popups event duplication:

Furthermore, very nearly all of these events (~98% per T161769#3195922) are logged by Firefox, including the latest version.

Soon after yesterday's European Mid-day SWAT deployment (11th July 2017, 1-2 PM UTC), the number of duplicate events arriving the server per hour has dropped to 0 regardless of browser family, OS, or whether Page Previews is enabled [0].

Edit

I re-ran the query today at ~10:20 AM UTC+1. There are now three instances of duplication in the last 48 hours. I'll investigate these one by one momentarily.

[0]
+------------+--------+------+-------+
| ymdh       | nDupes | n    | %     |
+------------+--------+------+-------+
| 2017071100 |      8 |  272 |  2.94 |
| 2017071101 |     57 |  338 | 16.86 |
| 2017071102 |      0 |  359 |  0.00 |
| 2017071103 |     30 |  345 |  8.70 |
| 2017071104 |    803 | 1525 | 52.66 |
| 2017071105 |     68 | 1085 |  6.27 |
| 2017071106 |     28 | 1841 |  1.52 |
| 2017071107 |     22 | 2004 |  1.10 |
| 2017071108 |    775 | 3633 | 21.33 |
| 2017071109 |     98 | 2744 |  3.57 |
| 2017071110 |    208 | 3037 |  6.85 |
| 2017071111 |    265 | 3054 |  8.68 |
| 2017071112 |    320 | 3131 | 10.22 |
| 2017071113 |     46 | 2742 |  1.68 |
| 2017071114 |      0 | 3473 |  0.00 |
| 2017071115 |     12 | 2198 |  0.55 |
| 2017071116 |      0 | 2233 |  0.00 |
| 2017071117 |      0 | 2272 |  0.00 |
| 2017071118 |      0 | 1990 |  0.00 |
| 2017071119 |      0 | 1701 |  0.00 |
| 2017071120 |      0 | 1854 |  0.00 |
| 2017071121 |      0 | 1277 |  0.00 |
| 2017071122 |      0 |  699 |  0.00 |
| 2017071123 |      0 |  467 |  0.00 |
| 2017071200 |      0 |  382 |  0.00 |
| 2017071201 |      0 |  267 |  0.00 |
| 2017071202 |      0 |  303 |  0.00 |
| 2017071203 |      0 |  683 |  0.00 |
| 2017071204 |      0 |  527 |  0.00 |
| 2017071205 |      0 |  889 |  0.00 |
| 2017071206 |      0 | 1414 |  0.00 |
| 2017071207 |      0 | 2475 |  0.00 |
| 2017071208 |      0 | 2975 |  0.00 |
| 2017071209 |      0 | 3129 |  0.00 |
| 2017071210 |      0 | 3423 |  0.00 |
| 2017071211 |      0 | 3296 |  0.00 |
| 2017071212 |      0 | 2933 |  0.00 |
| 2017071213 |      0 | 2981 |  0.00 |
| 2017071214 |      0 | 2900 |  0.00 |
| 2017071215 |      0 | 2280 |  0.00 |
| 2017071216 |      0 | 2215 |  0.00 |
| 2017071217 |      0 | 2029 |  0.00 |
| 2017071218 |      0 | 2003 |  0.00 |
| 2017071219 |    122 | 2065 |  5.91 |
| 2017071220 |      0 | 1817 |  0.00 |
| 2017071221 |      0 | 1252 |  0.00 |
| 2017071222 |      0 |  630 |  0.00 |
| 2017071223 |      0 |  463 |  0.00 |
| 2017071300 |      0 |  383 |  0.00 |
| 2017071301 |      0 |  403 |  0.00 |
| 2017071302 |      0 |  315 |  0.00 |
| 2017071303 |      0 |  649 |  0.00 |
| 2017071304 |      0 |  701 |  0.00 |
| 2017071305 |      4 | 1045 |  0.38 |
| 2017071306 |      0 | 1556 |  0.00 |
| 2017071307 |      0 | 2341 |  0.00 |
| 2017071308 |      0 | 2832 |  0.00 |
| 2017071309 |      0 |  969 |  0.00 |
+------------+--------+------+-------+

select
  nPerHour.ymdh,
  ifnull(nDupes, 0) as nDupes,
  n,
  ifnull(round(100 * nDupes / n, 2), 0) as '%'
from (
  select ymdh, sum(nDupesPerHourInt.n) as nDupes
  from (
    select left(timestamp, 10) as ymdh, count(*) as n
    from log.Popups_16364296
    where
      year(timestamp) = 2017
      and month(timestamp) = 7
      and day(timestamp) >= 11
      and event_linkInteractionToken is not null
    group by ymdh, event_linkInteractionToken
    having n > 1
  ) as nDupesPerHourInt
  group by ymdh
) as nDupesPerHour
right outer join (
  select left(timestamp, 10) as ymdh, count(*) as n
  from log.Popups_16364296
  where
    year(timestamp) = 2017
    and month(timestamp) = 7
    and day(timestamp) >= 11
    and event_linkInteractionToken is not null
  group by ymdh
) as nPerHour
on
  nDupesPerHour.ymdh = nPerHour.ymdh
order by
  nPerHour.ymdh asc;

I can be responsible for signing this off.

I can no longer reproduce this bug.

You're right to highlight that there's a beacon request being made when the page is resumed from the BFCache. This appears to be a bug in the Page Previews codebase and will be filed separately.

I no longer see this beacon request being made 💥💥💥