EventLogging schema modules take >1s to build (max: 22s)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Nov 8 2016, 5:16 PM

Description

https://grafana.wikimedia.org/dashboard/db/resourceloader?panelId=35&fullscreen&from=now-24h&to=now

Screen Shot 2016-11-08 at 17.14.25.png (854×1 px, 93 KB)

Not sure what's going on but this doesn't look good. These builds are typically expected to respond in far less than 1000ms.

https://grafana.wikimedia.org/dashboard/db/resourceloadermodule?var-module=schema_UniversalLanguageSelector
https://grafana.wikimedia.org/dashboard/db/resourceloadermodule?var-module=schema_SaveTiming
https://grafana.wikimedia.org/dashboard/db/resourceloadermodule?var-module=schema_NavigationTiming

Screen Shot 2016-11-08 at 17.28.25.png (720×1 px, 147 KB)

Screen Shot 2016-11-08 at 17.29.19.png (720×1 px, 159 KB)

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Krinkle	T141652 Get typical upper module build time under 1 second
		Resolved		Krinkle	T150269 EventLogging schema modules take >1s to build (max: 22s)

Event Timeline

Krinkle created this task.Nov 8 2016, 5:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 8 2016, 5:16 PM

Krinkle added a project: Regression.Nov 8 2016, 5:16 PM

Krinkle updated the task description. (Show Details)Nov 8 2016, 5:30 PM

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Nov 10 2016, 8:28 PM

Krinkle added a parent task: T141652: Get typical upper module build time under 1 second.Nov 23 2016, 9:14 PM

Wow. Gilles asked in an email if someone from Analytics could help. I can!

I don't know much about ResourceLoader, but I did just skim a little on mediawiki.org. Are these graphs the time it takes for browsers running EventLogging JS to pull in schemas from meta?

Restricted Application added a project: Analytics. · View Herald TranscriptJan 23 2017, 2:25 PM

No, it's time it takes to generate the JS on the backend, which happens more frequently than one might think. But at the end of the day it's a user request triggering it and someone is waiting that long for those files to generate before being able to download them.

Ok, I looked in the extension to see where it might reference those schemas, but I don't see any. How/why is the job deciding to download these schemas?

Well, since EventLogging users can create any number of schemas on the fly, I'm not surprised that those specific schemas aren't referenced in the code. I imagine that the EL extension accepts any schema name requested by ResourceLoader, attempts to locate the current schema data for that schema (wherever this is stored) if its exists and generate the JS based on that.

Here's an example ResourceLoader request for the NavigationTiming schema's JS:

https://en.wikipedia.org/w/load.php?modules=schema.NavigationTiming (real requests have more parameters)

This should give you this content:

mw.loader.implement("schema.NavigationTiming@15a5axw",function($,jQuery,require,module){mediaWiki.eventLog.declareSchema("NavigationTiming",{"schema":{"title":"NavigationTiming","properties":{"action":{"type":"string"},"lazyLoadImages":{"type":"string","enum":["A","B","C"]},"firstImage":{"type":"integer"},"connectEnd":{"type":"integer"},"connectStart":{"type":"integer"},"dnsLookup":{"type":"integer"},"domContentLoadedEventStart":{"type":"integer"},"domContentLoadedEventEnd":{"type":"integer"},"domComplete":{"type":"integer"},"domInteractive":{"type":"integer"},"domLoading":{"type":"integer"},"domainLookupStart":{"type":"integer"},"domainLookupEnd":{"type":"integer"},"fetchStart":{"type":"integer"},"isAnon":{"required":!0,"type":"boolean"},"isHttp2":{"type":"boolean"},"isHiDPI":{"type":"boolean"},"loadEventEnd":{"type":"integer"},"loadEventStart":{"type":"integer"},"mobileMode":{"enum":["alpha","beta","stable"],"type":"string"},"mediaWikiVersion":{"type":"string"},"originCountry":{
"type":"string"},"originRegion":{"type":"string"},"pageId":{"type":"integer"},"namespaceId":{"type":"integer"},"redirectCount":{"type":"integer"},"redirecting":{"type":"integer"},"redirectStart":{"type":"integer"},"redirectEnd":{"type":"integer"},"requestStart":{"type":"integer"},"responseEnd":{"type":"integer"},"responseStart":{"type":"integer"},"secureConnectionStart":{"type":"integer"},"unloadEventStart":{"type":"integer"},"unloadEventEnd":{"type":"integer"},"mediaWikiLoadComplete":{"type":"integer"},"mediaWikiLoadStart":{"type":"integer"},"mediaWikiLoadEnd":{"type":"integer"},"revId":{"type":"integer"},"firstPaint":{"type":"integer"}}},"revision":15485142});});

Which is the JS representation the PHP side of the EL extension has generated, to be consumed by the JS side of the EL extension.

It's based on the contents of the schema as defined on meta:

https://meta.wikimedia.org/wiki/Schema:NavigationTiming

I'm not familiar with the EL code, I don't know how the extension gets the schema generation to generate the JS for ResourceLoader, but it's that process that can sometimes take very long. And it shouldn't, because as you can see turning the simple schema definition seen here into that small blurb of JS should never take 20 seconds.

Oh, ok. I think I'm getting it. So a user requests a page (say maybe after a recent deploy) that includes EventLogging JS stuff, and the backend ResourceLoader is packaging it all up / caching in an efficient way for future loads of the same page? And in order to do so, it needs to grab any EventLogging JSONSchemas that the particular page might use?

These schemas have to be requested from meta. Is it possible meta is just slow?

Yes, that's the chain of events. Anything's possible, the issue probably is in the mechanism used to get the schema definition. Maybe it's going to the wiki externally via an API instead of reading the content from the database, that sort of thing. In which case the method used can be a channel that is indeed sometimes slow.

There is some very suspicious locking code with a 20s timeout in RemoteSchema.php

My hunch would be that this optimization to avoid stampedes is backfiring.

I don't know if the meta database is accessible from other wikis, but not going over HTTP like it seems to be doing would probably be ideal.

But those really high timeouts are probably worth looking into before making drastic changes.

• Nuria edited projects, added Analytics-Kanban; removed Analytics.Jan 23 2017, 4:36 PM

I just read over the code too, and I agree that locking code looks suspicious. If a bunch of requests come it at once, and the schema is not cached, the first one to reach $this->lock() on line 87 will attempt the HTTP request to get the schema, and the others will all immediately return false. The return false will cause jsonSerialize to return new stdClass() as the schema. (I'm not sure what that will do to clients, but perhaps they are smart enough to request again?) I don't know if this would solve the problem, but perhaps the httpGet() function should delete the lock before returning?

In any case, if a schema is not cached, it will need to be requested via HTTP from meta wiki. I don't think adding in db schema lookup would be wise. The schemas are stored as revision content. I don't know what db these ResourceLoader jobs are using when they run, but yeah, it'd have to access metawiki db content somehow, which seems a little weird.

Even if the schema is not cached in memcached, because it is an HTTP URI, it is cacheable by varnish. Bypassing the HTTP request to get the schema would avoid the varnish cache.

curl -I 'https://meta.wikimedia.org/w/api.php?action=jsonschema&revid=15496417&formatversion=2'
...
X-Cache: cp1067 hit/1, cp1066 miss
X-Cache-Status: hit

Internally, a miss returned to me in 0m0.053s, and externally, a miss returned in 0m0.399s. Just an anecdote, but unless there is some occasional slow down on the meta app servers / DBs, I don't see how this could result in ~20s slowdowns.

• Tbayer subscribed.Jan 26 2017, 12:33 AM

• Tbayer mentioned this in T155639: Create reading depth schema.Feb 10 2017, 5:09 PM

Krinkle mentioned this in T104225: enwiki Main_Page timeouts.Mar 1 2017, 10:24 PM

• Nuria moved this task from Next Up to Paused on the Analytics-Kanban board.Mar 13 2017, 4:18 PM

Looking back at this anomaly a few months later:
https://grafana.wikimedia.org/dashboard/db/resourceloadermodule?var-module=schema_UniversalLanguageSelector&from=1469995200000&to=1489104000000
https://grafana.wikimedia.org/dashboard/db/resourceloadermodule?var-module=schema_SaveTiming&from=1469995200000&to=1489104000000
https://grafana.wikimedia.org/dashboard/db/resourceloadermodule?var-module=schema_NavigationTiming&from=1469995200000&to=1489104000000

p95 is coming down a lot. Also the increased module rebuild rate has been resolved.

Screen Shot 2017-03-20 at 21.03.11.png (1×2 px, 580 KB)

Screen Shot 2017-03-20 at 21.05.24.png (726×2 px, 254 KB)

However the >1s peaks still exist and happen several times per day.

Screen Shot 2017-03-20 at 21.08.10.png (722×2 px, 267 KB)

They no longer happen for NavigationTiming or SaveTiming, but they are still visible for UniversalLanguageSelector.

The schema modules are all logically identical, the only difference is the plain JSON value representing the schema. So any function calls made in PHP to build the module will be identical.

The size of the schema also cannot be the issue since the Navigation Timing schema (at 6KB) is actually larger than the ULS schema (at 4KB).

The only remaining difference I see is the build rate. The Navigation Timing and Save Timing schema modules are "built" (e.g. cache miss from Varnish) less than 1x per second (NavTiming: 0.1x per second; SaveTiming: 0.05x per second). The ULS schema, however, is rebuilt 5-15x per second. This seems rather suspicious, however I'll leave that for a second task. Most likely the latency problem here relates to general responsiveness of MediaWiki. Whether it's the load balancer, app servers, api servers, or databases, something somewhere causes a small fraction of these requests to consistently take over 1s at least 10-20 times every minute of every day.

Krinkle closed this task as Resolved.May 2 2017, 6:12 AM

Krinkle claimed this task.

Krinkle mentioned this in T164247: Investigate high build rate of schema.UniversalLanguageSelector module..

Krinkle mentioned this in T164248: HTTP responses from app servers sometimes stall for >1s.May 2 2017, 6:24 AM

Krinkle mentioned this in T181315: Varnish HTTP response from app servers taking 160s (only 0.031s inside Apache).Mar 7 2018, 7:32 PM

	F6801892: Screen Shot 2017-03-20 at 21.08.10.png
	Mar 21 2017, 4:08 AM

	F6801795: Screen Shot 2017-03-20 at 21.03.11.png
	Mar 21 2017, 4:07 AM

	F6801830: Screen Shot 2017-03-20 at 21.05.24.png
	Mar 21 2017, 4:07 AM

	F6801872: Screen Shot 2017-03-20 at 21.06.46.png
	Mar 21 2017, 4:07 AM

	F4707541: Screen Shot 2016-11-08 at 17.28.25.png
	Nov 8 2016, 5:30 PM

	F4707542: Screen Shot 2016-11-08 at 17.29.19.png
	Nov 8 2016, 5:30 PM

	F4707500: Screen Shot 2016-11-08 at 17.14.25.png
	Nov 8 2016, 5:16 PM

EventLogging schema modules take >1s to build (max: 22s)Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

EventLogging schema modules take >1s to build (max: 22s)
Closed, ResolvedPublic
Actions

Related Objects
Search...