Page MenuHomePhabricator

Investigate 2X regression in StartupModule::buildContent generation time
Closed, ResolvedPublic

Description

The p95 for generating the startup module response appears to have regressed very significantly. From less than 35 milliseconds to almost a second (750 ms).

Regression 1: 2018-09-05

From 35 ms to 350 ms.

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?refresh=5m&panelId=25&fullscreen&orgId=1&from=1534736814482&to=1537368960396

Screenshot 2019-05-14 at 12.50.56.png (1×2 px, 245 KB)

Regression 2: 2019-03-01

From 450 ms to 720 ms.

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?panelId=25&fullscreen&orgId=1&from=1536364800000&to=1579478400000

Screenshot 2019-05-14 at 12.52.48.png (1×2 px, 301 KB)

Event Timeline

The first regression correlates exactly two the two points in time that deployment a change of mine, which was meant to fix T201686 and actually decrease time spent in computing the startup response, by sharing the in-process cache between getScript and getVersionHash (https://gerrit.wikimedia.org/r/456327).

  • [2018-09-05] 01:42 krinkle@deploy1001: Synchronized php-1.32.0-wmf.20/includes/resourceloader/: I8e8d3a2cd2cc - T201686 (duration: 01m 31s)

Here it goes to group0.

  • [2018-09-05] 13:08 hashar@deploy2001: Synchronized php: group1 wikis to 1.32.0-wmf.20 (duration: 01m 07s)

Here it goes to group1.

  • [2018-09-06] 13:08 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.20

Here it goes to group2 (Wikipedias).

As for why it caused this metric to appear 10 slower..

The metric plotted on that dashboard measures time spent in ResourceLoaderModule::buildContent, which is what calls the module's getScript method etc. It is process-cached by ResourceLoaderModule::getModuleContent, and only the first "real" call has the metric measurement. So far so good.

With the change applied, the code path code path for calculating the module response E-Tag and the code path for calculating the HTTP response body both use ResourceLoaderModule::getModuleContent, and the first call will be measured and reported to Graphite. The "after" state appears correct, and the sad reality appears that indeed this code path can in 5% of cases be incredibly slow. We need to look into why this happens and fix it.

So why did it appear so incredibly fast (35 ms) before this change? Well, before this commit, the StartupModule class was using a custom method for determining the version hash, based on getDefinitionSummary. The StartupModule class had its own custom in-process cache as well.

This means that, in the old code, when calling getDefinitionSummary (which we don't measure in Graphite), all the expensive steps where happening before we start with the response body (getScript). Then, when we get to the response body and call getScript it is super fast. And that "cache hit" is the step we measured, because from the perspective of ResourceLoaderModule::buildContent this was actually a cache miss and thus the one we measure.

This inaccurate wasn't known to me, but I got rid of it in this commit, hence, we are now measuring the correct time it takes to compute the startup module response.

Conclusion on regression 1

It is not a regression. It is a fix to the way the metric is measured for the startup module. Previously it was doing the expensive work before the stopwatch started counting the time spent. Now, we start counting the time correctly and thus realise how bad it really was.

Next steps:

  • In this task, investigate regression 2 as well.
  • In another task, figure out why this 5% is so slow, and fix it.
Regression 2: 2019-03-01

From 450 ms to 720 ms.

Correlates directly to the 1.33.0-wmf.19 train deployment.

Startup build time vs 1.33.0-wmf.19 promotion events (dashboard)
Screenshot 2019-05-14 at 13.12.10.png (1×2 px, 426 KB)
  • [2019-02-27 Wed] 20:04 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19
  • [2019-02-27 Wed] 20:50 hashar: 1.33.0-wmf.19 not rolled to group1 (reverted).
  • [2019-02-28 Thu] 14:40 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19
  • [2019-02-28 Thu] 20:13 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.19

Commits in mediawiki/core:/includes/resourceloader between 1.33.0-wmf.18 and 1.33.0-wmf.19:

Commits in mediawiki/core:/resources between 1.33.0-wmf.18 and 1.33.0-wmf.19:

This does not yet include commits made to extensions during that branch week, which is another place where the cause might be.

I suspect 9ba1d8f526e1 as the culprit. It added parsing of wikitext as a required part of computing the modules version has during the startup response.

Screenshot 2019-05-14 at 13.39.11.png (1×2 px, 623 KB)

Screenshot 2019-05-14 at 13.39.37.png (1×2 px, 816 KB)

The ChangesListSpecialPage::getRcFiltersConfigVars method is sampled at 3.87% compared to 28% for all of ResourceLoaderStartUpModule::getScript.

This means out of 3000 modules, this one module alone is consuming approximately 13.8% of the time to produce the startup module response.

This should probably be done only when the module is needed, and considering only the list of tags or their plain wikitext for cache versioning. Or something else that will allow us to not reparse the wikitext repeatedly.

Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
Krinkle moved this task from Inbox to Blocked on the Growth-Team board.
Krinkle added a subscriber: Catrope.
Krinkle renamed this task from Investigate 10X regression in StartupModule::buildContent generation time to Investigate 2X regression in StartupModule::buildContent generation time.May 20 2019, 8:29 PM
Krinkle moved this task from Blocked to Inbox on the Growth-Team board.
Krinkle moved this task from Inbox to Blocked on the Growth-Team board.
Krinkle moved this task from Blocked to Inbox on the Growth-Team board.
Krinkle moved this task from Inbox, needs triage to Blocked (old) on the Performance-Team board.

(Queuing for internal re-triage next week)

JTannerWMF subscribed.

@Krinkle is going to work on this task and @Catrope will provide support as needed

Change 516580 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] resourceloader: Support 'versionCallback' for computed package files

https://gerrit.wikimedia.org/r/516580

Change 516850 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] RCFilters: Reduce startup overhead from 'config.json' computation

https://gerrit.wikimedia.org/r/516850

Change 516580 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Support 'versionCallback' for computed package files

https://gerrit.wikimedia.org/r/516580

Change 516850 merged by jenkins-bot:
[mediawiki/core@master] RCFilters: Reduce startup overhead from 'config.json' computation

https://gerrit.wikimedia.org/r/516850

Krinkle moved this task from Blocked (old) to Doing (old) on the Performance-Team board.

This regression should fade away after this weeks train goes out (1.34.0-wmf.13). Assigning back to me to confirm.

Confirmed. It went from 3-4% to ~1%. Still non-trivial (one of a dozen or so major things in the startup module), but no longer stands out among the rest.

I'm now seeing this as 29% of the wall clock time for a startup module request on enwiki: https://performance.wikimedia.org/xhgui/run/callgraph?id=5d3b723cbb8544407f9cb8ae

Part of the issue is that OAuth's hook for dynamically injecting change tag i18n messages is slow (instantiates a DAO object and does a DB query for every tag separately; 70 tags on enwiki, 100 on wikidata), but that "only" accounts for about a third of that time 9% of the total). Building a data structure with the contents of ~300 i18n messages is just slow, even without message parsing.

Options:

  • Cache ChangesListSpecialPage::getChangeTagSummary() (and getChangeTagList()) for an hour (or 24 hours), without invalidation
  • Come up with a more complex way of caching parsed unparameterized messages (with correct invalidation) like what's contemplated in T189229
  • Move this blob out of an RL module and into an API module (but this would add an additional request for every RC/WL page view)

Change 525882 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] ChangesListSpecialPage: Cache the change tag list and its summary

https://gerrit.wikimedia.org/r/525882

Change 525882 merged by jenkins-bot:
[mediawiki/core@master] ChangesListSpecialPage: Cache the change tag list and its summary

https://gerrit.wikimedia.org/r/525882

Task description § Regression 2: 2019-03-01

This has now recovered, and even got a bit better:

capture.png (1×2 px, 88 KB)