Page MenuHomePhabricator

Investigate 2X regression in StartupModule::buildContent generation time
Open, HighPublic

Description

The p95 for generating the startup module response appears to have regressed very significantly. From less than 35 milliseconds to almost a second (750 ms).

Regression 1: 2018-09-05

From 35 ms to 350 ms.

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?refresh=5m&panelId=25&fullscreen&orgId=1&from=1534736814482&to=1537368960396

Regression 2: 2019-03-01

From 450 m to 720 ms.

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?panelId=25&fullscreen&orgId=1&from=1536364800000&to=1579478400000

Event Timeline

Krinkle created this task.May 14 2019, 11:52 AM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptMay 14 2019, 11:52 AM

The first regression correlates exactly two the two points in time that deployment a change of mine, which was meant to fix T201686 and actually decrease time spent in computing the startup response, by sharing the in-process cache between getScript and getVersionHash (https://gerrit.wikimedia.org/r/456327).

  • [2018-09-05] 01:42 krinkle@deploy1001: Synchronized php-1.32.0-wmf.20/includes/resourceloader/: I8e8d3a2cd2cc - T201686 (duration: 01m 31s)

Here it goes to group0.

  • [2018-09-05] 13:08 hashar@deploy2001: Synchronized php: group1 wikis to 1.32.0-wmf.20 (duration: 01m 07s)

Here it goes to group1.

  • [2018-09-06] 13:08 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.20

Here it goes to group2 (Wikipedias).

As for why it caused this metric to appear 10 slower..

The metric plotted on that dashboard measures time spent in ResourceLoaderModule::buildContent, which is what calls the module's getScript method etc. It is process-cached by ResourceLoaderModule::getModuleContent, and only the first "real" call has the metric measurement. So far so good.

With the change applied, the code path code path for calculating the module response E-Tag and the code path for calculating the HTTP response body both use ResourceLoaderModule::getModuleContent, and the first call will be measured and reported to Graphite. The "after" state appears correct, and the sad reality appears that indeed this code path can in 5% of cases be incredibly slow. We need to look into why this happens and fix it.

So why did it appear so incredibly fast (35 ms) before this change? Well, before this commit, the StartupModule class was using a custom method for determining the version hash, based on getDefinitionSummary. The StartupModule class had its own custom in-process cache as well.

This means that, in the old code, when calling getDefinitionSummary (which we don't measure in Graphite), all the expensive steps where happening before we start with the response body (getScript). Then, when we get to the response body and call getScript it is super fast. And that "cache hit" is the step we measured, because from the perspective of ResourceLoaderModule::buildContent this was actually a cache miss and thus the one we measure.

This inaccurate wasn't known to me, but I got rid of it in this commit, hence, we are now measuring the correct time it takes to compute the startup module response.

Conclusion on regression 1

It is not a regression. It is a fix to the way the metric is measured for the startup module. Previously it was doing the expensive work before the stopwatch started counting the time spent. Now, we start counting the time correctly and thus realise how bad it really was.

Next steps:

  • In this task, investigate regression 2 as well.
  • In another task, figure out why this 5% is so slow, and fix it.
Regression 2: 2019-03-01

From 450 ms to 720 ms.

Correlates directly to the 1.33.0-wmf.19 train deployment.

Startup build time vs 1.33.0-wmf.19 promotion events (dashboard)
  • [2019-02-27 Wed] 20:04 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19
  • [2019-02-27 Wed] 20:50 hashar: 1.33.0-wmf.19 not rolled to group1 (reverted).
  • [2019-02-28 Thu] 14:40 hashar@deploy1001: rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.19
  • [2019-02-28 Thu] 20:13 hashar@deploy1001: rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.19

Commits in mediawiki/core:/includes/resourceloader between 1.33.0-wmf.18 and 1.33.0-wmf.19:

Commits in mediawiki/core:/resources between 1.33.0-wmf.18 and 1.33.0-wmf.19:

This does not yet include commits made to extensions during that branch week, which is another place where the cause might be.

I suspect 9ba1d8f526e1 as the culprit. It added parsing of wikitext as a required part of computing the modules version has during the startup response.


The ChangesListSpecialPage::getRcFiltersConfigVars method is sampled at 3.87% compared to 28% for all of ResourceLoaderStartUpModule::getScript.

This means out of 3000 modules, this one module alone is consuming approximately 13.8% of the time to produce the startup module response.

This should probably be done only when the module is needed, and considering only the list of tags or their plain wikitext for cache versioning. Or something else that will allow us to not reparse the wikitext repeatedly.

Krinkle triaged this task as High priority.May 14 2019, 12:41 PM
Krinkle edited projects, added Performance-Team; removed Performance-Team (Radar).
Krinkle moved this task from Inbox to Blocked on the Growth-Team board.
Krinkle added a subscriber: Catrope.
Krinkle renamed this task from Investigate 10X regression in StartupModule::buildContent generation time to Investigate 2X regression in StartupModule::buildContent generation time.May 20 2019, 8:29 PM
Krinkle moved this task from Blocked to Inbox on the Growth-Team board.
Krinkle moved this task from Inbox to Blocked on the Growth-Team board.
Krinkle moved this task from Blocked to Inbox on the Growth-Team board.
Krinkle moved this task from Inbox to Blocked or Needs-CR on the Performance-Team board.
Krinkle moved this task from Blocked or Needs-CR to Inbox on the Performance-Team board.EditedJun 10 2019, 1:49 PM

(Queuing for internal re-triage next week)

JTannerWMF moved this task from Inbox to External on the Growth-Team board.Jun 11 2019, 6:08 PM
JTannerWMF added a subscriber: JTannerWMF.

@Krinkle is going to work on this task and @Catrope will provide support as needed

Krinkle claimed this task.Jun 11 2019, 6:10 PM

Change 516580 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] resourceloader: Support 'versionCallback' for computed package files

https://gerrit.wikimedia.org/r/516580

Change 516850 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] [WIP] RCFilters: Reduce startup overhead from 'config.json' computation

https://gerrit.wikimedia.org/r/516850

Krinkle moved this task from Inbox to Doing on the Performance-Team board.Jun 14 2019, 1:18 PM

Change 516580 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Support 'versionCallback' for computed package files

https://gerrit.wikimedia.org/r/516580

Krinkle reassigned this task from Krinkle to Catrope.Jun 26 2019, 2:58 PM

Change 516850 merged by jenkins-bot:
[mediawiki/core@master] RCFilters: Reduce startup overhead from 'config.json' computation

https://gerrit.wikimedia.org/r/516850

Krinkle claimed this task.EditedJul 1 2019, 8:37 PM
Krinkle moved this task from Blocked or Needs-CR to Doing on the Performance-Team board.

This regression should fade away after this weeks train goes out (1.34.0-wmf.13). Assigning back to me to confirm.

Krinkle closed this task as Resolved.Jul 11 2019, 11:09 PM

Confirmed. It went from 3-4% to ~1%. Still non-trivial (one of a dozen or so major things in the startup module), but no longer stands out among the rest.

Catrope reopened this task as Open.Jul 27 2019, 2:03 AM

I'm now seeing this as 29% of the wall clock time for a startup module request on enwiki: https://performance.wikimedia.org/xhgui/run/callgraph?id=5d3b723cbb8544407f9cb8ae

Part of the issue is that OAuth's hook for dynamically injecting change tag i18n messages is slow (instantiates a DAO object and does a DB query for every tag separately; 70 tags on enwiki, 100 on wikidata), but that "only" accounts for about a third of that time 9% of the total). Building a data structure with the contents of ~300 i18n messages is just slow, even without message parsing.

Options:

  • Cache ChangesListSpecialPage::getChangeTagSummary() (and getChangeTagList()) for an hour (or 24 hours), without invalidation
  • Come up with a more complex way of caching parsed unparameterized messages (with correct invalidation) like what's contemplated in T189229
  • Move this blob out of an RL module and into an API module (but this would add an additional request for every RC/WL page view)

Change 525882 had a related patch set uploaded (by Catrope; owner: Catrope):
[mediawiki/core@master] ChangesListSpecialPage: Cache the change tag list and its summary

https://gerrit.wikimedia.org/r/525882

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle removed a project: MediaWiki-ResourceLoader.
Krinkle removed Krinkle as the assignee of this task.Jul 28 2019, 3:25 PM
Krinkle assigned this task to Catrope.Mon, Aug 26, 7:10 PM

Change 525882 merged by jenkins-bot:
[mediawiki/core@master] ChangesListSpecialPage: Cache the change tag list and its summary

https://gerrit.wikimedia.org/r/525882

Krinkle updated the task description. (Show Details)Sat, Sep 14, 10:27 PM
Krinkle moved this task from Untriaged to Not yet on the Performance-Team-publish board.