The Page Previews instrumentation revolves around the notion of a link interaction. Every time the user hovers over a link with their mouse or focuses on a link with their keyboard, a new interaction begins. Every link interaction has a unique identifier (herein, a “token”).
The token is a 64-bit integer in hexadecimal format that’s generated using [[ https://www.w3.org/TR/WebCryptoAPI/#Crypto-method-getRandomValues | crypto#getRandomValues ]], which should use a “well-established cryptographic PRNG seeded with high-quality entropy”. If this is the case, then the probability of token collision should approach 50% when over 4 billion tokens are generated.
On Monday, 27th March 2017, @Tbayer reported an unusually high number of events with duplicate tokens ("duplicate events") being inserted into the EventLogging MySQL table. Naturally, we assumed that this was being caused by one or more bugs in the instrumentation. During the following two sprints we tracked down and fixed what we thought was all of them. We even went so far as to instrument the instrumentation so that we could be confident that the fixes that were deployed weren’t causing more duplicate events to be logged. However, while our instrumentation was assuring us that Page Previews wasn't generating duplicate events, the number of events with duplicate tokens wasn't affected.
While we were tracking down and fixing bugs, @Tbayer investigated further and reported that Firefox v51 and v52 were sending circa 98% of the duplicate events. He also noted that the distribution of OS's that were sending duplicate events didn't deviate from the general distribution for pageviews. The Readers Web engineers tried to consistently reproduce the issue in Firefox but to no avail.
@Jdlrobson's aha! moment was when he noticed that a lot of the duplicate events were being logged on pages with dense clusters of links, which the user might be clicking accidentally and immediately navigating back. Immediately after trying this, he saw duplicate events being logged by Firefox v54 and raised a thorough – and startling – bug report.
The EventLogging protocol subscribes to the event.* topic after the document and its sub-resources have loaded so that logging events doesn't consume resources before or during rendering the page. So, when Firefox resumed a page from the BFCache the subscriber was re-subscribed due to the second load event, leading to logging of exact duplicate events.
The strict conditions that the BFCache requires and the user having to navigate backward or forward to a serialized page to trigger the issue very neatly explain the wild peaks and troughs in the daily % of Popups events with duplicate tokens that we were seeing.
Both @Jdlrobson and I suggested a workaround.
While I was trying to reproduce and understand how the bug was affecting the EventLogging codebase, I noticed that the DOMContentLoaded event was behaving correctly and only firing once. I suggested that EventLogging register the subscriber when the latter event fires. However, the subscriber is meant to be registered as late as possible so as not to impact page load time on resource constrained devices.
@Jdlrobson's suggestion, on the other hand, was delightfully simple: make EventLogging register the subscriber exactly once. The workaround itself was a 1 character change, which, as they often do, required a much larger comment to provide much-needed context.
Just under an hour after the change was deployed, the number of duplicate events per day dropped from between 10-30% to roughly 0.08%. This new rate is consistent with the background noise levels of duplication in our other much simpler instrumentation, e.g. ReadingDepth and RelatedArticles.
What We Learned
This issue took us a little over three months to track down and fix so we had /a lot/ of time to reflect on what we could've done better.
How Are We Doing?
The Page Previews instrumentation is complex. The complexity of the instrumentation is proportional to the number of general questions we're asking about the feature. This complexity meant that implementing, testing, and QA'ing the instrumentation all took considerable time. Since our initial hypothesis was that the issue(s) were in the instrumentation itself, we also spent comparable amounts of time trying to re-verify that the instrumentation was working correctly.
Were we to answer only one or two questions at a time, i.e. collect less data, we may have saved ourselves some time as it would've been simpler to test our hypothesis. As software engineers, we regularly sacrifice velocity for confidence in our implementation; I don't see how this is any different.
QA Without A Test Plan
The Readers Web kanban board has a Needs QA column. For better or worse, technical tasks like implementing the instrumentation, tend to skip this column as QA tends to be done at the browser level. Moreover, the Readers Web engineers didn't set an expectation that QA would be done as part of code review and if it was, then there was no test plan created before or after merging the code.
This situation has since greatly improved as a response to a variety of problems. We've agreed that all tasks should move from the Needs Code Review column to Needs QA after all of the associated changes have been merged into the codebase – if a task is exceptional, then it must be documented. We've also agreed that before a task can be moved into the Needs QA column it must have a test plan in its description.
We've yet to talk about setting expectations around planning, executing, and documenting QA as part of code review as a team. When we do, I'm sure that we'll write about it.
One of the goals of the Page Previews architecture and a driving force behind the design of new changes is testability. The extension is remarkably well covered by its unit tests.
The instrumentation is no exception and is 100% covered by unit tests. However, the focus of the unit tests was and still is the correctness of the properties of the events produced by the system, which is a result of misrepresenting events as POJO's and not value objects or types. So while the vanity metric of coverage is maximized, we lack proof that key invariants are holding.
Focussing on higher level integration tests, fuzzing, and mutation testing to prove the instrumentation correct would have allowed us to immediately decline our initial hypothesis that the bug was in the instrumentation. On the other hand, the suite of unit tests will give us confidence when refactoring the system to allow for these changes.