Page MenuHomePhabricator

Investigate cause of irregular 1s+ spikes on load.php latencies
Closed, ResolvedPublic

Assigned To
Authored By
Krinkle
Sep 16 2019, 9:41 PM
Referenced Files
F31476824: Screenshot 2019-12-15 at 18.34.06.png
Dec 15 2019, 6:53 PM
F31476840: 12-14.png
Dec 15 2019, 6:53 PM
F31476829: 11-7.png
Dec 15 2019, 6:53 PM
F31476831: 11-8.png
Dec 15 2019, 6:53 PM
F31476836: 12-13.png
Dec 15 2019, 6:53 PM
F30375021: Screenshot 2019-09-16 at 22.40.46.png
Sep 16 2019, 9:41 PM

Description

Out of the 200 req/s load.php traffic, there is more-or-less consistency every few minutes a handful that take over a second to respond, upto 10-20 seconds at times.

Scope of task

Figure out what's causing it.

Screenshot 2019-09-16 at 22.40.46.png (712×2 px, 236 KB)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle triaged this task as Medium priority.Sep 19 2019, 8:15 PM
Krinkle moved this task from Inbox to Confirmed Problem on the MediaWiki-ResourceLoader board.

To start investigating this, I'm actually looking at a recent regression in overall backend timing for load.php first. Some general improvements here may help bring down the extremes, or at least narrow down what they are caused by.

In early November, the p95 latency was around 230 ms. Around 22 and 23 November latencies shot up to over 800 ms but then recovered again the next day – almost. From 24 November onwards it hovered around 250 ms.

Then after 4 December it went up again toward 300 - 360 ms and has stayed around there ever since.

Flame graphs

Before:

Thu 7 NovFri 8 Nov
11-7.png (1×2 px, 612 KB)
11-8.png (1×2 px, 580 KB)

On Thu 7 Nov we collected 6,430 samples, and 6,343 on Fri 8 Nov.

Thu 13 DecFri 14 Dec
12-13.png (1×2 px, 318 KB)
12-14.png (1×2 px, 314 KB)

On Thu 13 Dec we collected 8,666 samples, and 7,181 on Fri 14 Dec.

Flame graphs don't measure time spent (not exactly), but the number of samples collected does give us a good idea of how much time was spent in the code overall (given equal distribution of the sampler, and a very similar request rate on these four dates). Based on this, we're seeing a 24% increase from this regression alone.

This is further confirmed by looking at the where that time is spent. For example, the left side of the flame graphs where WebStart.php is, have roughly the same number of samples collected before and after.

WebStart.php:

  • Before: 1885 samples (Fri 8 Nov). After: 1745 (Fri 14 Dec).

The same for ResourceLoader::preloadModuleInfo:

  • Before: 237 (Fri 8 Nov). After: 268 (Fri 14 Dec).

But, looking at ResourceLoaderStartUpModule::getScript we can start to narrow down where the big increase comes from:

  • Before: 1601 (Fri 8 Nov). After: 2158 (Fri 14 Dec).

Specifically: …/ResourceLoaderStartUpModule::getScript/…/ResourceLoaderFileModule::getFileHashes/ResourceLoaderModule::safeFileHash:

  • Before: 245 (3.86% of Fri 8 Nov). After: 505 (7.03% of Fri 14 Dec).

That alone doesn't explain the increase though, but it's a starting point perhaps.

Change 557620 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Remove ResourceLoaderModule::safeFilemtime()

https://gerrit.wikimedia.org/r/557620

Change 557632 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Various optimisations in FileModule::getFileHashes()

https://gerrit.wikimedia.org/r/557632

Change 557620 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Remove ResourceLoaderModule::safeFilemtime()

https://gerrit.wikimedia.org/r/557620

Change 557632 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Various optimisations in FileModule::getFileHashes()

https://gerrit.wikimedia.org/r/557632

Change 559189 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.35.0-wmf.11] resourceloader: Various optimisations in FileModule::getFileHashes()

https://gerrit.wikimedia.org/r/559189

Change 559189 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.11] resourceloader: Various optimisations in FileModule::getFileHashes()

https://gerrit.wikimedia.org/r/559189

Focussing on T242024 first.

Krinkle claimed this task.
Krinkle removed a project: Wikimedia-Incident.

Looking much better nowdays:

https://grafana.wikimedia.org/d/000000067/resourceloadermodule?orgId=1&from=now-7d&to=now

There is still a random spike in the max() time once or twice a week, but that's a very very small proportion. I'm hoping it'll be explained by T245464, but if not, it's not worth investigating right now I think.