Page MenuHomePhabricator

TypeError: Cannot read properties of undefined (reading 'isElementCloseToViewport')
Closed, ResolvedPublicPRODUCTION ERROR

Description

Following this week's train a sudden error spike mostly observed on da.m.wikipedia.org but also on other wikis e.g. Japanese Wikipedia. It seems to be limited to iOS (Safari and Chrome)

It quickly resolved itself. This week's train removed the mediawiki.viewport library from the code path, so this would suggest that old RelatedArticles code was loaded with the old list of module dependencies. This might point to a ResourceLoader issue.

Screen Shot 2021-11-04 at 1.35.45 PM.png (606×2 px, 128 KB)

https://logstash.wikimedia.org/app/dashboards#/doc/logstash-*/logstash-2021.11.04?id=1vN_7HwB1RMACsoSHaqJ

Error
normalized_message
TypeError: Cannot read properties of undefined (reading 'isElementCloseToViewport')
exception.trace
Impact
Notes

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This bug looks like it was present in production for 5 minutes. I can't see anything problematic in the RelatedArticles extension or da.m.wikipedia.org's site scripts/gadgets that might have caused this.

The patch that introduced it is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/RelatedArticles/+/734734/4/extension.json

This bug suggests that changes in dependencies are not being cached with scripts which would be concerning if true.

Krinkle triaged this task as Medium priority.

This was most likely the result of a rare but not impossible deployment race condition. I believe this is inevitable at our scale, and our handling actually avoids worse errors that would be wider in reach and happen more frequently.

When a user views a page whilst we are mid-deployment, imagine their first request (e.g. startup module, vendor bundle, or other init script) happens to have just expired from the CDN cache and got served by a backend with the new code. As such, this script will refer to something like wikimediaevents_newVersion.js. Then their browser requests that second script fast enough to still be within the 50 seconds it takes to sync our servers, and happens to be served by one of the (shrinking in numbers) servers with the old code.

The one option the server doesn't have is simply respond A-OK with the requested script. It doesn't have any of the new code. In some cases, we will then serve what we do have and instruct all layers (web server, CDN, browser cache, localStorage) to treat it as temporary.

This might sound like a flaw that could cause breakage on a regular basis. If we were a static site without something like ResourceLoader's counter-measures, that would probably be true. In actuality, this might be the first time in seven years that we've seen console errors via this (never mind impact that is more user-noticable, on more wikis, or for longer than a few minutes with automatic self-correction).

As part of drafting a blogpost, I'll try to explain this as best I can here first. Feel free to ignore it, but if you want to understand why I'm considering this "normal", feel free to read along or ask questions. (Normal—so long it's short-lived, only on some wikis, and less than once a year).


The race condition (in detail):

  • when during a deployment,
  • and the deployment is removing a hard-dependency from a JS bundle,
  • and right now some servers still run a snapshot of old code only,
  • and the page or script that refers to your bundle has just expired from the CDN cache,
  • and nobody requested the script between it expiring and the deployment starting,
  • and the request for the HTML page happens to be load-balanced to a server with the new code,
  • and the user's browser and network are fast enough, and our deployment takes long enough, for there to still be servers with the old code by the time they have rendered the page and started to request the changed JS bundle,
  • and the request for the changed bundle happens to be load-balanced to a different server that is one of the (shrinking in numbers) "old" servers.

This is not specific to MediaWiki or even dynamic web apps. This race condition is known internally and explicitly designed against in ResourceLoader (relevant server code, relevant client code).

Example race condition with WMF details:

  • We are running version A.
  • We push a deployment for version B. Behind the scenes, this will take a few seconds to propagate globally to servers in all high-available clusters and data centers. For easy reference, let us say we only have 10 servers, and it takes 10 seconds to sync them, and we'll simulate a person's journey in slow motion at one bullet point per second.
  • Second 1: Alice's browser requests a page on your website. The page HTML is served from Server 1, which just received version B. This means it will contain information like "mything: depends on X, has version B" and plans to make a request like /load/mything.js?v=B.
  • Second 2: The browser downloads this HTML response from Server 1. The browser starts parsing and rendering content. Our deployment has reached Server 2 meanwhile.
  • Second 3: The browser requests /load/mything.js?v=B. It is served from Server 8. There is only one myscript.js file on disk and it is version A here. The query parameter is just for cache busting.
  • Second 4: The browser receives respondse to /load/mything.js?v=B which contains version A code.

Why it's unlikely:

  • The deployment has to remove a hard dependency from a JS module.
  • The CDN cache entry for the startup module has to have just expired.
  • Nobody has requested the startup module on this domain between the CDN expiring it and the start of your deployment. The startup module is cached for 10min in total. Last week load.php averaged a cache hit ratio of 99.92%.
  • Someone's browser now requests the startup module, and it reaches our data center just after the "canary and sync" steps of Scap have begun, but before these steps are finished. These steps take ~50s, or 1m30s including canary servers.
  • Their browser and network connection are fast enough to render the page and request the affected module before the Scap deploy finishes. Lazy-loaded features are not affected, such as Search, Popups, VisualEditor, ULS, etc. These are unlikely to be used early enough in the page life cycle to hit the race condition. By the time they load, the deployment will have finished.
  • The Scap deploy continues to reach more of the production servers.
  • The same browser now requests the affected page module (WikimediaEvents), and is additionally unlucky enough to be routed to one of the last servers waiting for the sync.

Why it's unavoidable:

We can't serve what we don't have. The server has to do something.

To avoid it, one would need a higher-level coordination measure that at the CDN level would keep track of HTTPS connections or end-user IP addresses and bucket backend cache miss requests such that during a deploy subsequent requests from the same IP will only go a server with the same or newer version. That kind of precision seems non-trivial and I'm not aware of this being done in any major app platform for static content. We make it very unlikely (per above) and we handle it quite well (per below), so it's hard to justify the complexity investment required to orchestrate this. Having said that, Kubernetes will make this relatively easy (though at a cost), by doing a forward-rolling deployment by telling it to spin up a number of temporary extra pods, then route new traffic to these, then upgrade most other pods, and delete any excess pods.

Why it's okay:

If you imagine a typical bundler and statically generated app with files like mything_v123.js, then a web host or cloud provider, in the above scenario on a high-traffic site could result in a 404 Not Found error every time the race condition is hit. This 404 would then be cached for however long a CDN caches those (e.g. 15 min), after which things will self-correct. Not terrible, but also not great that it would cause such a hard failure, every time the race condition is met.

By comparison, MediaWiki isn't affected by cache misses for the HTML and logged-in sessions, because our page HTML doesn't hardcode links to script or style versions directly. Our "startup" module eliminates a significant surface area of this race condition because it is consistently cached for 5-10 minutes, thus likely glossing over the deployment event. The startup module means that, in general, even on a long-lived browser tab and lazy-loading code and deployments happening in-between, you will generally continue to enjoy an internally consistant snapshot of the codebase. Even after a deployment, you'll still know the version that goes together with what you already have, and can find it in localStorage, browser cache or the CDN. We never proactively purge any load.php URLs. The startup modue also avoids large cache invalidation events by allowing a dependency to change without all consumers needing to be invalidated (Blogpost by philipwalton, Docs).

Since we handle /load/mything.js?v=123 dynamically in PHP, we can fallback to serving something else. 99% of the time when a change gets deployed, it isn't a complete refactor with new or removed hard dependencies that weren't loaded by anything else on the page already. So falling back in this way means stuff generally "just" works.

On top of what, we actually do enforce version validation. Unlike with static files, for ResourceLoader version= isn't merely gargage to bust the cache. We actually validate it (relevant server code, relevant client code), and then shorten CDN cache, and disable local caching to prevent "cache poisoning" of new code under old storage key, and vice versa.

Lastly, due to the way module execution is isolated in ResourceLoader (despite being downloaded in a batch), we not only avoid a 404 Error for lots of unrelated modules in the same batch, we also still get to load and initialise those other features. For this task, the dependency error is caught early and basically means the module is treated as if it couldn't be loaded at all. E.g. similar to what might happen if your network connection was spotty or if a server somewhere timed out for one of the JS files. This is among many reasons why I describe the web (and our platform) as a layered architecture. Modern (Grade A) is a layer on top of Basic. They are not separate experiences. Whenever possible, start with an HTML payload that is resilient and works regardless of the many ways in which JS payloads may arrive partly, very late, or not at all.

Thanks for this informative write-up.

Please consider leaving a link to this comment / task on https://www.mediawiki.org/wiki/ResourceLoader (or another page) as I can imagine it's going to be useful to refer back to this on future Wikimedia-production-error