These metrics got suddenly worse on Tuesday, October 13:
- Mentioned In
- T116319: Fix and re-deploy satisfaction schema.
T115031: Fix the deployed version of the Search schema
- Mentioned Here
rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
rEWMV4d3bf0b16691: Add common terms A/B test to search satisfaction
T113291: Add new fields to searchsatisfaction2 schema
T115024: Code A/B test for AND operator relaxation test
Does that match what we have in Git?
Else I've seen some strange numbers comparing page sizes between runs (the total is larger but the individual changes doesn't add up, I'll add an issue for that at Github).
+ @EBernhardson who deployed some potentially related changes to Search
From production SAL https://tools.wmflabs.org/sal/production?d=2015-10-13
23:30 <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default (duration: 01m 14s) 23:27 <ebernhardson@tin> Synchronized php-1.27.0-wmf.2/extensions/WikimediaEvents/: Turn on cirrus common terms test (duration: 01m 15s) 23:25 <ebernhardson@tin> Synchronized php-1.27.0-wmf.2/extensions/CirrusSearch/: Add information for common terms ab test (duration: 01m 15s) 23:19 <ebernhardson@tin> Synchronized php-1.27.0-wmf.2/extensions/Flow: Make flow board descriptions editable again (duration: 01m 16s) 23:13 <ebernhardson@tin> Synchronized php-1.27.0-wmf.3/extensions/Flow: Bump flow submodule in 1.27.0-wmf.3 (duration: 01m 15s) 23:10 <ebernhardson@tin> Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/: Update WME for common terms AB test (duration: 01m 14s) 23:08 <ebernhardson@tin> Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch/: Update cirrus for common terms AB tes (duration: 01m 15s) 23:02 <ebernhardson@tin> Synchronized wmf-config/InitialiseSettings.php: Turn on logging of the "warning" channel (duration: 01m 13s)
2015-10-13 23:00–00:00 UTC was an evening SWAT deployment slot: https://wikitech.wikimedia.org/wiki/Deployments/Archive/2015/10#Tuesday.2C.C2.A0October.C2.A013
[wmf.2] [wmf.3] Gerrit 246079 Gerrit 246076 Make Flow board descriptions editable again
[wmf.3] Gerrit 246080 Fix missing summarize step when resolving Flow topics
[wmf.3] Gerrit 246124 Un-disappear the categories list in Flow
Erik B (ebernhardson)
[config] Gerrit 244832 - Log messages to the 'warning' channel
[wmf.2] [wmf.3] CirrusSearch Gerrit 246108 Gerrit 246109 Add info about common terms to output for A/B test
[wmf.2] [wmf.3] WikimediaEvents Gerrit 246119 Gerrit 246120 Add more information to the TestSearchSatisfaction schema
[wmf.2] [wmf.3] WikimediaEvents Gerrit 246123 Gerrit 246122 Add common terms A/B test to satisfaction schema
[config] Gerrit 245945 Set $wgUploadNavigationUrl to use uselang=$lang for commonsuploads wikis by default
The WikimediaEvents changes from CirrusSearch and WikimediaEvents in master branches read as:
Injects a JS variable wgCirrusCommonTermsApplicable:
$wgOut->addJsConfigVars( 'wgCirrusCommonTermsApplicable', true );
WikimediaEvents rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
Add new fields to TestSearchSatisfaction Depends on Ia0e8a95 in mediawiki/core for access to the exact query thta was run, and for the new wprov= attached to all search result links. Bug: T113291
Add common terms A/B test to search satisfaction Bug: T115024
Both WikimediaEvents make use of $.jStorage and are related to the EventLogging schema TestSearchSatisfication2. So that is inline with @Peter found out and the addition of the script:
So maybe adding that new .js slightly delay the load time?
The root cause would be both of:
I am not sure how it works, but if that script is only used for search queries, maybe it can be loaded later on or only when the search input get focus. (random thought with no idea how all the JS/Frontend works).
unlikely to be the cause, the change here is just too small. it amounts to a couple dozen bytes embedded into the page.
rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
This does load a bit more code than before, but IIUC jquery.jStorage in particular shouldn't be the culprit since the extra scripts are brought in by mw.loader.load which shouldn't (again, IIUC. i've been wrong before) cause changes to the time to initial domComplete. I can refactor this bit of code to push any potential data loading after a $( document ).ready() to guarantee that though. Additionally i could refactor out the code that runs after mw.loader.using() such that it is also loaded later (when necessary) rather than for all users.
But i don't really know beyond guessing if this is the culprit.
Unfortunately the satisfaction schema doesn't just measure usage of the search inputs, it measures how much the user interacts with our sites after clicking through a search result to an article page.
Reverted wmf.2 and wmf.3 back to before rEWMVc5b6de098f16: Add new fields to TestSearchSatisfaction
with https://gerrit.wikimedia.org/r/#/c/247853/ and https://gerrit.wikimedia.org/r/#/c/247853/ and 8:50 am PDT.
Post-deploy the timing dipped, came back, and dipped again. Will look again this afternoon
The sheer payload size of adding a new script should in itself not cause a regression of this sort, so I suspect is it not so much the adding of the script but rather what the script is doing.
- The new version of the ext.wikimediaEvents.search.js script may contain code (or triggers code, such as jStorage) that is synchronous and expensive.
- The use of $( window ).scrollTop(); seems suspicious and could be a notable cost, one that doesn't belong in the top scope.
- or; jStorage itself may have unconditional initialisation code that is expensive and should be deferred.
On https://grafana.wikimedia.org/dashboard/db/performance-metrics?panelId=8&fullscreen&from=now-14d the increase can be seen as almost half a second:
Can refer to this graph (from ori) later to compare the today vs. yesterday timings: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445445736.446&from=-4hours&target=movingAverage(timeShift(coal.domComplete%2C%221d%22)%2C10)&target=movingAverage(coal.domComplete%2C10)
This has been redeployed in wed morning SWAT (~16:45 UTC), i looked at the last 12 hours of data 4 hours after deployment and it looks reasonable, but please double check.
It was deployed around 16:45 which is normally the low point of the day (compare to yellow and blue which are the timeshifts), it seems this did cause a notable 150ms spike in the hour that followed, but it went back in into normal numbers after that. Probably just a small burst from cache invalidation.