These metrics got suddenly worse on Tuesday, October 13:
Description
Related Objects
- Mentioned In
- T116319: Fix and re-deploy satisfaction schema.
T115031: Fix the deployed version of the Search schema - Mentioned Here
- rECIRab19311a7fc8: Report common terms usage to javascript
rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
rEWMV4d3bf0b16691: Add common terms A/B test to search satisfaction
T113291: Add new fields to searchsatisfaction2 schema
T115024: Code A/B test for AND operator relaxation test
Event Timeline
@greg, could someone from your team help us isolate this by going over code changes that rolled out during this period? From graphite it looks like the regression rolled out sometime between 23:20 UTC on October 13.
Checked WebPageTest for that period, the data in Graphite is dead but when I compared between different runs in our WPT instance, there's one extra javascript that has beed added in that period (use the slider to compare the different waterfall graphs):
http://wpt.wmftest.org/video/compare.php?tests=151014_GC_8,151013_CK_40
Does that match what we have in Git?
Else I've seen some strange numbers comparing page sizes between runs (the total is larger but the individual changes doesn't add up, I'll add an issue for that at Github).
+ @EBernhardson who deployed some potentially related changes to Search
From production SAL https://tools.wmflabs.org/sal/production?d=2015-10-13
23:30 <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default (duration: 01m 14s) 23:27 <ebernhardson@tin> Synchronized php-1.27.0-wmf.2/extensions/WikimediaEvents/: Turn on cirrus common terms test (duration: 01m 15s) 23:25 <ebernhardson@tin> Synchronized php-1.27.0-wmf.2/extensions/CirrusSearch/: Add information for common terms ab test (duration: 01m 15s) 23:19 <ebernhardson@tin> Synchronized php-1.27.0-wmf.2/extensions/Flow: Make flow board descriptions editable again (duration: 01m 16s) 23:13 <ebernhardson@tin> Synchronized php-1.27.0-wmf.3/extensions/Flow: Bump flow submodule in 1.27.0-wmf.3 (duration: 01m 15s) 23:10 <ebernhardson@tin> Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/: Update WME for common terms AB test (duration: 01m 14s) 23:08 <ebernhardson@tin> Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch/: Update cirrus for common terms AB tes (duration: 01m 15s) 23:02 <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: Turn on logging of the "warning" channel (duration: 01m 13s)
2015-10-13 23:00–00:00 UTC was an evening SWAT deployment slot: https://wikitech.wikimedia.org/wiki/Deployments/Archive/2015/10#Tuesday.2C.C2.A0October.C2.A013
Roan (RoanKattouw)
[wmf.2] [wmf.3] Gerrit 246079 Gerrit 246076 Make Flow board descriptions editable again
[wmf.3] Gerrit 246080 Fix missing summarize step when resolving Flow topics
[wmf.3] Gerrit 246124 Un-disappear the categories list in Flow
Erik B (ebernhardson)
[config] Gerrit 244832 - Log messages to the 'warning' channel
[wmf.2] [wmf.3] CirrusSearch Gerrit 246108 Gerrit 246109 Add info about common terms to output for A/B test
[wmf.2] [wmf.3] WikimediaEvents Gerrit 246119 Gerrit 246120 Add more information to the TestSearchSatisfaction schema
[wmf.2] [wmf.3] WikimediaEvents Gerrit 246123 Gerrit 246122 Add common terms A/B test to satisfaction schema
Alex (Krenair)
[config] Gerrit 245945 Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default
The WikimediaEvents changes from CirrusSearch and WikimediaEvents in master branches read as:
CirrusSearch rECIRab19311a7fc8: Report common terms usage to javascript
Report common terms usage to javascript The eventlogging code in the frontend needs to know if the common terms query it turned on was actually used. This is a bit of a hack, but should be fine for a few weeks. It should be reverted after the test is complete.
Injects a JS variable wgCirrusCommonTermsApplicable:
$wgOut->addJsConfigVars( 'wgCirrusCommonTermsApplicable', true );
WikimediaEvents rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
Add new fields to TestSearchSatisfaction Depends on Ia0e8a95 in mediawiki/core for access to the exact query thta was run, and for the new wprov= attached to all search result links. Bug: T113291
Use a new version of EventLogging schema schema.TestSearchSatisfaction2 and some scary (to me) JavaScript https://gerrit.wikimedia.org/r/#/c/240924/10/modules/ext.wikimediaEvents.search.js
That is for T113291: Add new fields to searchsatisfaction2 schema.
WikimediaEvents rEWMV4d3bf0b16691: Add common terms A/B test to search satisfaction
Add common terms A/B test to search satisfaction Bug: T115024
Follow up previous patch and further bump EventLogging schema.TestSearchSatisfaction2. JavaScript is at https://gerrit.wikimedia.org/r/#/c/245627/10/modules/ext.wikimediaEvents.search.js
That is for T115024: Code A/B test for AND operator relaxation test.
TL;DR above:
Both WikimediaEvents make use of $.jStorage and are related to the EventLogging schema TestSearchSatisfication2. So that is inline with @Peter found out and the addition of the script:
So maybe adding that new .js slightly delay the load time?
The root cause would be both of:
- https://gerrit.wikimedia.org/r/#/c/240924/10/modules/ext.wikimediaEvents.search.js
- https://gerrit.wikimedia.org/r/#/c/245627/10/modules/ext.wikimediaEvents.search.js
I am not sure how it works, but if that script is only used for search queries, maybe it can be loaded later on or only when the search input get focus. (random thought with no idea how all the JS/Frontend works).
rECIRab19311a7fc8: Report common terms usage to javascript
unlikely to be the cause, the change here is just too small. it amounts to a couple dozen bytes embedded into the page.
rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
This does load a bit more code than before, but IIUC jquery.jStorage in particular shouldn't be the culprit since the extra scripts are brought in by mw.loader.load which shouldn't (again, IIUC. i've been wrong before) cause changes to the time to initial domComplete. I can refactor this bit of code to push any potential data loading after a $( document ).ready() to guarantee that though. Additionally i could refactor out the code that runs after mw.loader.using() such that it is also loaded later (when necessary) rather than for all users.
But i don't really know beyond guessing if this is the culprit.
Unfortunately the satisfaction schema doesn't just measure usage of the search inputs, it measures how much the user interacts with our sites after clicking through a search result to an article page.
@EBernhardson: Could you revert rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction? It'd be easy to un-revert if we're wrong.
Reverted wmf.2 and wmf.3 back to before rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
with https://gerrit.wikimedia.org/r/#/c/247853/ and https://gerrit.wikimedia.org/r/#/c/247853/ and 8:50 am PDT.
Post-deploy the timing dipped, came back, and dipped again. Will look again this afternoon
The sheer payload size of adding a new script should in itself not cause a regression of this sort, so I suspect is it not so much the adding of the script but rather what the script is doing.
Either:
- The new version of the ext.wikimediaEvents.search.js script may contain code (or triggers code, such as jStorage) that is synchronous and expensive.
- The use of $( window ).scrollTop(); seems suspicious and could be a notable cost, one that doesn't belong in the top scope.
- or; jStorage itself may have unconditional initialisation code that is expensive and should be deferred.
On https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=8&fullscreen&from=now-14d the increase can be seen as almost half a second:
Can refer to this graph (from ori) later to compare the today vs. yesterday timings: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445445736.446&from=-4hours&target=movingAverage(timeShift(coal.domComplete%2C%221d%22)%2C10)&target=movingAverage(coal.domComplete%2C10)
Looks like rEWMVc5b6de was indeed the cause of the regression. Thanks for the help, @hashar, and thanks @EBernhardson for the quick response. Erik, could you please solicit a review of this change from someone on the performance team before re-deploying?
I've asked our two new frontend engineers in discovery, @JGirault and @Jdrewniak for one of them to pick this up and try and work out what caused the perf regression. I'll make sure to ping you when we've figured something out.
This has been redeployed in wed morning SWAT (~16:45 UTC), i looked at the last 12 hours of data[1] 4 hours after deployment and it looks reasonable, but please double check.
It was deployed around 16:45 which is normally the low point of the day (compare to yellow and blue which are the timeshifts), it seems this did cause a notable 150ms spike in the hour that followed, but it went back in into normal numbers after that. Probably just a small burst from cache invalidation.