Investigate 2X regression in StartupModule::buildContent generation time
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	May 14 2019, 11:52 AM

Description

The p95 for generating the startup module response appears to have regressed very significantly. From less than 35 milliseconds to almost a second (750 ms).

Regression 1: 2018-09-05

From 35 ms to 350 ms.

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?refresh=5m&panelId=25&fullscreen&orgId=1&from=1534736814482&to=1537368960396

Screenshot 2019-05-14 at 12.50.56.png (1×2 px, 245 KB)

Regression 2: 2019-03-01

From 450 ms to 720 ms.

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?panelId=25&fullscreen&orgId=1&from=1536364800000&to=1579478400000

Screenshot 2019-05-14 at 12.52.48.png (1×2 px, 301 KB)

Details

Subject	Repo	Branch	Lines +/-
ChangesListSpecialPage: Cache the change tag list and its summary	mediawiki/core	master	+55 -57
RCFilters: Reduce startup overhead from 'config.json' computation	mediawiki/core	master	+175 -86
resourceloader: Support 'versionCallback' for computed package files	mediawiki/core	master	+144 -12

Customize query in gerrit

Related Objects

Mentioned Here: T189229: VisualEditorDataModule spends significant time in hot startup module
rMW7545ee16c495: resourceloader: Tolerate string as deps for test suites
rMW9ba1d8f526e1: RCFilters: Export config vars in the RL modules where possible (take 2)
rMW0e1edec723c2: Apply the pseudo-css selector rules to diff and changelist pages
rMW9316f2e3c38c: resourceloader: Change 'packageFiles' format to be JSON-compatible
rMWb444d3ff399b: resourceloader: Remove back compat for old packageFiles format
T201686: Changes to module dependencies are not propagating to load.php startup manifest

Event Timeline

Krinkle created this task.May 14 2019, 11:52 AM

Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptMay 14 2019, 11:52 AM

The first regression correlates exactly two the two points in time that deployment a change of mine, which was meant to fix T201686 and actually decrease time spent in computing the startup response, by sharing the in-process cache between getScript and getVersionHash (https://gerrit.wikimedia.org/r/456327).

Server Admin Log

[2018-09-05] 01:42 krinkle@deploy1001: Synchronized php-1.32.0-wmf.20/includes/resourceloader/: I8e8d3a2cd2cc - T201686 (duration: 01m 31s)

Here it goes to group0.

[2018-09-05] 13:08 hashar@deploy2001: Synchronized php: group1 wikis to 1.32.0-wmf.20 (duration: 01m 07s)

Here it goes to group1.

[2018-09-06] 13:08 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.20

Here it goes to group2 (Wikipedias).

As for why it caused this metric to appear 10 slower..

The metric plotted on that dashboard measures time spent in ResourceLoaderModule::buildContent, which is what calls the module's getScript method etc. It is process-cached by ResourceLoaderModule::getModuleContent, and only the first "real" call has the metric measurement. So far so good.

With the change applied, the code path code path for calculating the module response E-Tag and the code path for calculating the HTTP response body both use ResourceLoaderModule::getModuleContent, and the first call will be measured and reported to Graphite. The "after" state appears correct, and the sad reality appears that indeed this code path can in 5% of cases be incredibly slow. We need to look into why this happens and fix it.

So why did it appear so incredibly fast (35 ms) before this change? Well, before this commit, the StartupModule class was using a custom method for determining the version hash, based on getDefinitionSummary. The StartupModule class had its own custom in-process cache as well.

This means that, in the old code, when calling getDefinitionSummary (which we don't measure in Graphite), all the expensive steps where happening before we start with the response body (getScript). Then, when we get to the response body and call getScript it is super fast. And that "cache hit" is the step we measured, because from the perspective of ResourceLoaderModule::buildContent this was actually a cache miss and thus the one we measure.

This inaccurate wasn't known to me, but I got rid of it in this commit, hence, we are now measuring the correct time it takes to compute the startup module response.

Conclusion on regression 1

It is not a regression. It is a fix to the way the metric is measured for the startup module. Previously it was doing the expensive work before the stopwatch started counting the time spent. Now, we start counting the time correctly and thus realise how bad it really was.

Next steps:

In this task, investigate regression 2 as well.
In another task, figure out why this 5% is so slow, and fix it.

Regression 2: 2019-03-01

From 450 ms to 720 ms.

Correlates directly to the 1.33.0-wmf.19 train deployment.

Startup build time vs 1.33.0-wmf.19 promotion events (dashboard)

Server Admin Log

[2019-02-27 Wed] 20:04 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19

[2019-02-27 Wed] 20:50 hashar: 1.33.0-wmf.19 not rolled to group1 (reverted).

[2019-02-28 Thu] 14:40 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19

[2019-02-28 Thu] 20:13 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.19

Commits in mediawiki/core:/includes/resourceloader between 1.33.0-wmf.18 and 1.33.0-wmf.19:

Commits in mediawiki/core:/resources between 1.33.0-wmf.18 and 1.33.0-wmf.19:

This does not yet include commits made to extensions during that branch week, which is another place where the cause might be.

I suspect 9ba1d8f526e1 as the culprit. It added parsing of wikitext as a required part of computing the modules version has during the startup response.

Screenshot 2019-05-14 at 13.39.11.png (1×2 px, 623 KB)

Screenshot 2019-05-14 at 13.39.37.png (1×2 px, 816 KB)

The ChangesListSpecialPage::getRcFiltersConfigVars method is sampled at 3.87% compared to 28% for all of ResourceLoaderStartUpModule::getScript.

This means out of 3000 modules, this one module alone is consuming approximately 13.8% of the time to produce the startup module response.

This should probably be done only when the module is needed, and considering only the list of tags or their plain wikitext for cache versioning. Or something else that will allow us to not reparse the wikitext repeatedly.

Krinkle triaged this task as High priority.May 14 2019, 12:41 PM

Krinkle edited projects, added Growth-Team, Performance-Team (Radar); removed Performance-Team.

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).

Krinkle moved this task from Inbox to Blocked on the Growth-Team board.

Krinkle added a subscriber: Catrope.

Krinkle renamed this task from Investigate 10X regression in StartupModule::buildContent generation time to Investigate 2X regression in StartupModule::buildContent generation time.May 20 2019, 8:29 PM

Krinkle moved this task from Blocked to Inbox on the Growth-Team board.

Krinkle moved this task from Inbox to Blocked on the Growth-Team board.

Krinkle moved this task from Blocked to Inbox on the Growth-Team board.

Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.

(Queuing for internal re-triage next week)

@Krinkle is going to work on this task and @Catrope will provide support as needed

Krinkle claimed this task.Jun 11 2019, 6:10 PM

Change 516580 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] resourceloader: Support 'versionCallback' for computed package files

https://gerrit.wikimedia.org/r/516580

gerritbot added a project: Patch-For-Review.Jun 12 2019, 1:00 AM

Change 516850 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] RCFilters: Reduce startup overhead from 'config.json' computation

https://gerrit.wikimedia.org/r/516850

Krinkle moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Jun 14 2019, 1:18 PM

Change 516580 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Support 'versionCallback' for computed package files

https://gerrit.wikimedia.org/r/516580

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.10; 2019-06-18).Jun 15 2019, 1:00 PM

Krinkle moved this task from Doing (old) to Blocked (old) on the Performance-Team board.Jun 17 2019, 7:37 PM

Krinkle moved this task from Inbox to Confirmed Problem on the MediaWiki-ResourceLoader board.Jun 17 2019, 8:06 PM

Krinkle reassigned this task from Krinkle to Catrope.Jun 26 2019, 2:58 PM

Change 516850 merged by jenkins-bot:
[mediawiki/core@master] RCFilters: Reduce startup overhead from 'config.json' computation

https://gerrit.wikimedia.org/r/516850

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.13; 2019-07-09); removed MW-1.34-notes (1.34.0-wmf.10; 2019-06-18).Jul 1 2019, 8:00 PM

Maintenance_bot removed a project: Patch-For-Review.Jul 1 2019, 8:10 PM

This regression should fade away after this weeks train goes out (1.34.0-wmf.13). Assigning back to me to confirm.

Confirmed. It went from 3-4% to ~1%. Still non-trivial (one of a dozen or so major things in the startup module), but no longer stands out among the rest.

I'm now seeing this as 29% of the wall clock time for a startup module request on enwiki: https://performance.wikimedia.org/xhgui/run/callgraph?id=5d3b723cbb8544407f9cb8ae

Part of the issue is that OAuth's hook for dynamically injecting change tag i18n messages is slow (instantiates a DAO object and does a DB query for every tag separately; 70 tags on enwiki, 100 on wikidata), but that "only" accounts for about a third of that time 9% of the total). Building a data structure with the contents of ~300 i18n messages is just slow, even without message parsing.

Options:

Cache ChangesListSpecialPage::getChangeTagSummary() (and getChangeTagList()) for an hour (or 24 hours), without invalidation
Come up with a more complex way of caching parsed unparameterized messages (with correct invalidation) like what's contemplated in T189229
Move this blob out of an RL module and into an API module (but this would add an additional request for every RC/WL page view)

Change 525882 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] ChangesListSpecialPage: Cache the change tag list and its summary

https://gerrit.wikimedia.org/r/525882

gerritbot added a project: Patch-For-Review.Jul 27 2019, 2:33 AM

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Jul 27 2019, 3:09 PM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

Krinkle removed a project: MediaWiki-ResourceLoader.

Krinkle removed Krinkle as the assignee of this task.Jul 28 2019, 3:25 PM

Krinkle assigned this task to Catrope.Aug 26 2019, 7:10 PM

Change 525882 merged by jenkins-bot:
[mediawiki/core@master] ChangesListSpecialPage: Cache the change tag list and its summary

https://gerrit.wikimedia.org/r/525882

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.23; 2019-09-17); removed MW-1.34-notes (1.34.0-wmf.13; 2019-07-09).Sep 14 2019, 8:00 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 14 2019, 8:10 PM

Krinkle updated the task description. (Show Details)Sep 14 2019, 10:27 PM

Krinkle added a project: Wikimedia-Performance-publish.

Krinkle moved this task from Untriaged to Not yet on the Wikimedia-Performance-publish board.

Krinkle removed a project: MW-1.34-notes (1.34.0-wmf.23; 2019-09-17).Oct 5 2019, 12:38 AM

Krinkle updated the task description. (Show Details)

Task description § Regression 2: 2019-03-01

From 450 ms to 720 ms.

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?panelId=25&fullscreen&orgId=1&from=1536364800000&to=1579478400000

This has now recovered, and even got a bit better:

Krinkle moved this task from Not yet to Ready for write-up on the Wikimedia-Performance-publish board.Oct 5 2019, 12:43 AM

	F29051389: Screenshot 2019-05-14 at 13.39.37.png
	May 14 2019, 12:41 PM

	F29051385: Screenshot 2019-05-14 at 13.39.11.png
	May 14 2019, 12:41 PM

	F29051313: Screenshot 2019-05-14 at 13.12.10.png
	May 14 2019, 12:20 PM

	F29051240: Screenshot 2019-05-14 at 12.52.48.png
	May 14 2019, 11:52 AM

	F29051233: Screenshot 2019-05-14 at 12.50.56.png
	May 14 2019, 11:52 AM

	F30550353: capture.png
	Oct 5 2019, 12:43 AM

Investigate 2X regression in StartupModule::buildContent generation timeClosed, ResolvedPublicActions