Page MenuHomePhabricator

EventLogging schema modules take >1s to build (max: 22s)
Closed, ResolvedPublic

Event Timeline

Wow. Gilles asked in an email if someone from Analytics could help. I can!

I don't know much about ResourceLoader, but I did just skim a little on mediawiki.org. Are these graphs the time it takes for browsers running EventLogging JS to pull in schemas from meta?

No, it's time it takes to generate the JS on the backend, which happens more frequently than one might think. But at the end of the day it's a user request triggering it and someone is waiting that long for those files to generate before being able to download them.

Ok, I looked in the extension to see where it might reference those schemas, but I don't see any. How/why is the job deciding to download these schemas?

Well, since EventLogging users can create any number of schemas on the fly, I'm not surprised that those specific schemas aren't referenced in the code. I imagine that the EL extension accepts any schema name requested by ResourceLoader, attempts to locate the current schema data for that schema (wherever this is stored) if its exists and generate the JS based on that.

Here's an example ResourceLoader request for the NavigationTiming schema's JS:

https://en.wikipedia.org/w/load.php?modules=schema.NavigationTiming (real requests have more parameters)

This should give you this content:

mw.loader.implement("schema.NavigationTiming@15a5axw",function($,jQuery,require,module){mediaWiki.eventLog.declareSchema("NavigationTiming",{"schema":{"title":"NavigationTiming","properties":{"action":{"type":"string"},"lazyLoadImages":{"type":"string","enum":["A","B","C"]},"firstImage":{"type":"integer"},"connectEnd":{"type":"integer"},"connectStart":{"type":"integer"},"dnsLookup":{"type":"integer"},"domContentLoadedEventStart":{"type":"integer"},"domContentLoadedEventEnd":{"type":"integer"},"domComplete":{"type":"integer"},"domInteractive":{"type":"integer"},"domLoading":{"type":"integer"},"domainLookupStart":{"type":"integer"},"domainLookupEnd":{"type":"integer"},"fetchStart":{"type":"integer"},"isAnon":{"required":!0,"type":"boolean"},"isHttp2":{"type":"boolean"},"isHiDPI":{"type":"boolean"},"loadEventEnd":{"type":"integer"},"loadEventStart":{"type":"integer"},"mobileMode":{"enum":["alpha","beta","stable"],"type":"string"},"mediaWikiVersion":{"type":"string"},"originCountry":{
"type":"string"},"originRegion":{"type":"string"},"pageId":{"type":"integer"},"namespaceId":{"type":"integer"},"redirectCount":{"type":"integer"},"redirecting":{"type":"integer"},"redirectStart":{"type":"integer"},"redirectEnd":{"type":"integer"},"requestStart":{"type":"integer"},"responseEnd":{"type":"integer"},"responseStart":{"type":"integer"},"secureConnectionStart":{"type":"integer"},"unloadEventStart":{"type":"integer"},"unloadEventEnd":{"type":"integer"},"mediaWikiLoadComplete":{"type":"integer"},"mediaWikiLoadStart":{"type":"integer"},"mediaWikiLoadEnd":{"type":"integer"},"revId":{"type":"integer"},"firstPaint":{"type":"integer"}}},"revision":15485142});});

Which is the JS representation the PHP side of the EL extension has generated, to be consumed by the JS side of the EL extension.

It's based on the contents of the schema as defined on meta:

https://meta.wikimedia.org/wiki/Schema:NavigationTiming

I'm not familiar with the EL code, I don't know how the extension gets the schema generation to generate the JS for ResourceLoader, but it's that process that can sometimes take very long. And it shouldn't, because as you can see turning the simple schema definition seen here into that small blurb of JS should never take 20 seconds.

Oh, ok. I think I'm getting it. So a user requests a page (say maybe after a recent deploy) that includes EventLogging JS stuff, and the backend ResourceLoader is packaging it all up / caching in an efficient way for future loads of the same page? And in order to do so, it needs to grab any EventLogging JSONSchemas that the particular page might use?

These schemas have to be requested from meta. Is it possible meta is just slow?

Yes, that's the chain of events. Anything's possible, the issue probably is in the mechanism used to get the schema definition. Maybe it's going to the wiki externally via an API instead of reading the content from the database, that sort of thing. In which case the method used can be a channel that is indeed sometimes slow.

There is some very suspicious locking code with a 20s timeout in RemoteSchema.php

My hunch would be that this optimization to avoid stampedes is backfiring.

I don't know if the meta database is accessible from other wikis, but not going over HTTP like it seems to be doing would probably be ideal.

But those really high timeouts are probably worth looking into before making drastic changes.

I just read over the code too, and I agree that locking code looks suspicious. If a bunch of requests come it at once, and the schema is not cached, the first one to reach $this->lock() on line 87 will attempt the HTTP request to get the schema, and the others will all immediately return false. The return false will cause jsonSerialize to return new stdClass() as the schema. (I'm not sure what that will do to clients, but perhaps they are smart enough to request again?) I don't know if this would solve the problem, but perhaps the httpGet() function should delete the lock before returning?

In any case, if a schema is not cached, it will need to be requested via HTTP from meta wiki. I don't think adding in db schema lookup would be wise. The schemas are stored as revision content. I don't know what db these ResourceLoader jobs are using when they run, but yeah, it'd have to access metawiki db content somehow, which seems a little weird.

Even if the schema is not cached in memcached, because it is an HTTP URI, it is cacheable by varnish. Bypassing the HTTP request to get the schema would avoid the varnish cache.

curl -I 'https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=15496417&formatversion=2'
...
X-Cache: cp1067 hit/1, cp1066 miss
X-Cache-Status: hit

Internally, a miss returned to me in 0m0.053s, and externally, a miss returned in 0m0.399s. Just an anecdote, but unless there is some occasional slow down on the meta app servers / DBs, I don't see how this could result in ~20s slowdowns.

They no longer happen for NavigationTiming or SaveTiming, but they are still visible for UniversalLanguageSelector.

The schema modules are all logically identical, the only difference is the plain JSON value representing the schema. So any function calls made in PHP to build the module will be identical.

The size of the schema also cannot be the issue since the Navigation Timing schema (at 6KB) is actually larger than the ULS schema (at 4KB).

The only remaining difference I see is the build rate. The Navigation Timing and Save Timing schema modules are "built" (e.g. cache miss from Varnish) less than 1x per second (NavTiming: 0.1x per second; SaveTiming: 0.05x per second). The ULS schema, however, is rebuilt 5-15x per second. This seems rather suspicious, however I'll leave that for a second task. Most likely the latency problem here relates to general responsiveness of MediaWiki. Whether it's the load balancer, app servers, api servers, or databases, something somewhere causes a small fraction of these requests to consistently take over 1s at least 10-20 times every minute of every day.