Page MenuHomePhabricator

Investigate loadEventEnd / domComplete regression as of 2015-10-13
Closed, ResolvedPublic

Description

These metrics got suddenly worse on Tuesday, October 13:

Event Timeline

ori raised the priority of this task from to High.
ori updated the task description. (Show Details)
ori added a project: Performance-Team.
ori added a subscriber: ori.
ori raised the priority of this task from High to Unbreak Now!.Oct 21 2015, 7:34 AM
ori added a project: Release-Engineering-Team.
ori set Security to None.

@greg, could someone from your team help us isolate this by going over code changes that rolled out during this period? From graphite it looks like the regression rolled out sometime between 23:20 UTC on October 13.

Checked WebPageTest for that period, the data in Graphite is dead but when I compared between different runs in our WPT instance, there's one extra javascript that has beed added in that period (use the slider to compare the different waterfall graphs):
http://wpt.wmftest.org/video/compare.php?tests=151014_GC_8,151013_CK_40

That one:
https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=jquery.jStorage%7Cschema.TestSearchSatisfaction2&skin=vector&version=9e6e869d512b

Does that match what we have in Git?

Else I've seen some strange numbers comparing page sizes between runs (the total is larger but the individual changes doesn't add up, I'll add an issue for that at Github).

+ @EBernhardson who deployed some potentially related changes to Search


From production SAL https://tools.wmflabs.org/sal/production?d=2015-10-13

23:30	<ebernhardson@tin>	Synchronized wmf-config/InitialiseSettings.php: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default (duration: 01m 14s)
23:27	<ebernhardson@tin>	Synchronized php-1.27.0-wmf.2/extensions/WikimediaEvents/: Turn on cirrus common terms test (duration: 01m 15s)
23:25	<ebernhardson@tin>	Synchronized php-1.27.0-wmf.2/extensions/CirrusSearch/: Add information for common terms ab test (duration: 01m 15s)
23:19	<ebernhardson@tin>	Synchronized php-1.27.0-wmf.2/extensions/Flow: Make flow board descriptions editable again (duration: 01m 16s)
23:13	<ebernhardson@tin>	Synchronized php-1.27.0-wmf.3/extensions/Flow: Bump flow submodule in 1.27.0-wmf.3 (duration: 01m 15s)
23:10	<ebernhardson@tin>	Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/: Update WME for common terms AB test (duration: 01m 14s)
23:08	<ebernhardson@tin>	Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch/: Update cirrus for common terms AB tes (duration: 01m 15s)
23:02	<ebernhardson@tin>	Synchronized wmf-config/InitialiseSettings.php: Turn on logging of the "warning" channel (duration: 01m 13s)

2015-10-13 23:00–00:00 UTC was an evening SWAT deployment slot: https://wikitech.wikimedia.org/wiki/Deployments/Archive/2015/10#Tuesday.2C.C2.A0October.C2.A013

Roan (RoanKattouw)

[wmf.2] [wmf.3] Gerrit 246079 Gerrit 246076 Make Flow board descriptions editable again
[wmf.3] Gerrit 246080 Fix missing summarize step when resolving Flow topics
[wmf.3] Gerrit 246124 Un-disappear the categories list in Flow

Erik B (ebernhardson)

[config] Gerrit 244832 - Log messages to the 'warning' channel

[wmf.2] [wmf.3] CirrusSearch Gerrit 246108 Gerrit 246109 Add info about common terms to output for A/B test
[wmf.2] [wmf.3] WikimediaEvents Gerrit 246119 Gerrit 246120 Add more information to the TestSearchSatisfaction schema
[wmf.2] [wmf.3] WikimediaEvents Gerrit 246123 Gerrit 246122 Add common terms A/B test to satisfaction schema

Alex (Krenair)

[config] Gerrit 245945 Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default


The WikimediaEvents changes from CirrusSearch and WikimediaEvents in master branches read as:

CirrusSearch rECIRab19311a7fc8: Report common terms usage to javascript

Report common terms usage to javascript

The eventlogging code in the frontend needs to know if the
common terms query it turned on was actually used. This is
a bit of a hack, but should be fine for a few weeks. It
should be reverted after the test is complete.

Injects a JS variable wgCirrusCommonTermsApplicable:

$wgOut->addJsConfigVars( 'wgCirrusCommonTermsApplicable', true );

WikimediaEvents rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction

Add new fields to TestSearchSatisfaction

Depends on Ia0e8a95 in mediawiki/core for access to
the exact query thta was run, and for the new wprov=
attached to all search result links.
Bug: T113291

Use a new version of EventLogging schema schema.TestSearchSatisfaction2 and some scary (to me) JavaScript https://gerrit.wikimedia.org/r/#/c/240924/10/modules/ext.wikimediaEvents.search.js

That is for T113291: Add new fields to searchsatisfaction2 schema.


WikimediaEvents rEWMV4d3bf0b16691: Add common terms A/B test to search satisfaction

Add common terms A/B test to search satisfaction

Bug: T115024

Follow up previous patch and further bump EventLogging schema.TestSearchSatisfaction2. JavaScript is at https://gerrit.wikimedia.org/r/#/c/245627/10/modules/ext.wikimediaEvents.search.js

That is for T115024: Code A/B test for AND operator relaxation test.

TL;DR above:

Both WikimediaEvents make use of $.jStorage and are related to the EventLogging schema TestSearchSatisfication2. So that is inline with @Peter found out and the addition of the script:

https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=jquery.jStorage%7Cschema.TestSearchSatisfaction2&skin=vector&version=9e6e869d512b

So maybe adding that new .js slightly delay the load time?

The root cause would be both of:

I am not sure how it works, but if that script is only used for search queries, maybe it can be loaded later on or only when the search input get focus. (random thought with no idea how all the JS/Frontend works).

rECIRab19311a7fc8: Report common terms usage to javascript

unlikely to be the cause, the change here is just too small. it amounts to a couple dozen bytes embedded into the page.

rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction

This does load a bit more code than before, but IIUC jquery.jStorage in particular shouldn't be the culprit since the extra scripts are brought in by mw.loader.load which shouldn't (again, IIUC. i've been wrong before) cause changes to the time to initial domComplete. I can refactor this bit of code to push any potential data loading after a $( document ).ready() to guarantee that though. Additionally i could refactor out the code that runs after mw.loader.using() such that it is also loaded later (when necessary) rather than for all users.

But i don't really know beyond guessing if this is the culprit.
Unfortunately the satisfaction schema doesn't just measure usage of the search inputs, it measures how much the user interacts with our sites after clicking through a search result to an article page.

Reverted wmf.2 and wmf.3 back to before rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
with https://gerrit.wikimedia.org/r/#/c/247853/ and https://gerrit.wikimedia.org/r/#/c/247853/ and 8:50 am PDT.

Post-deploy the timing dipped, came back, and dipped again. Will look again this afternoon

The sheer payload size of adding a new script should in itself not cause a regression of this sort, so I suspect is it not so much the adding of the script but rather what the script is doing.

Either:

  • The new version of the ext.wikimediaEvents.search.js script may contain code (or triggers code, such as jStorage) that is synchronous and expensive.
    • The use of $( window ).scrollTop(); seems suspicious and could be a notable cost, one that doesn't belong in the top scope.
  • or; jStorage itself may have unconditional initialisation code that is expensive and should be deferred.

On https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=8&fullscreen&from=now-14d the increase can be seen as almost half a second:

Krinkle renamed this task from Investigate loadEventEnd / domComplete regression to Investigate loadEventEnd / domComplete regression as of 2015-10-13.Oct 21 2015, 5:00 PM

Looks like rEWMVc5b6de was indeed the cause of the regression. Thanks for the help, @hashar, and thanks @EBernhardson for the quick response. Erik, could you please solicit a review of this change from someone on the performance team before re-deploying?

@ori wrote:

Thanks for the help, @hashar,

You are most welcome. It is been quite fun to dig in the log while you guys were asleep, knowing you will wake up with some digging result :-}

Kudos on shaving some 100ms.

I've asked our two new frontend engineers in discovery, @JGirault and @Jdrewniak for one of them to pick this up and try and work out what caused the perf regression. I'll make sure to ping you when we've figured something out.

This has been redeployed in wed morning SWAT (~16:45 UTC), i looked at the last 12 hours of data[1] 4 hours after deployment and it looks reasonable, but please double check.

[1] http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445445736.446&from=-12hours&target=movingAverage(timeShift(coal.domComplete%2C%221d%22)%2C10)&target=movingAverage(coal.domComplete%2C10)

https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=8&fullscreen&from=now-12h#domComplete

It was deployed around 16:45 which is normally the low point of the day (compare to yellow and blue which are the timeshifts), it seems this did cause a notable 150ms spike in the hour that followed, but it went back in into normal numbers after that. Probably just a small burst from cache invalidation.