Out of the 200 req/s load.php traffic, there is more-or-less consistency every few minutes a handful that take over a second to respond, upto 10-20 seconds at times.
Scope of task
Figure out what's causing it.
| Krinkle | |
| Sep 16 2019, 9:41 PM |
| F31476824: Screenshot 2019-12-15 at 18.34.06.png | |
| Dec 15 2019, 6:53 PM |
| F31476840: 12-14.png | |
| Dec 15 2019, 6:53 PM |
| F31476829: 11-7.png | |
| Dec 15 2019, 6:53 PM |
| F31476831: 11-8.png | |
| Dec 15 2019, 6:53 PM |
| F31476836: 12-13.png | |
| Dec 15 2019, 6:53 PM |
| F30375021: Screenshot 2019-09-16 at 22.40.46.png | |
| Sep 16 2019, 9:41 PM |
Out of the 200 req/s load.php traffic, there is more-or-less consistency every few minutes a handful that take over a second to respond, upto 10-20 seconds at times.
Figure out what's causing it.
To start investigating this, I'm actually looking at a recent regression in overall backend timing for load.php first. Some general improvements here may help bring down the extremes, or at least narrow down what they are caused by.
In early November, the p95 latency was around 230 ms. Around 22 and 23 November latencies shot up to over 800 ms but then recovered again the next day – almost. From 24 November onwards it hovered around 250 ms.
Then after 4 December it went up again toward 300 - 360 ms and has stayed around there ever since.
Before:
| Thu 7 Nov | Fri 8 Nov |
|---|---|
On Thu 7 Nov we collected 6,430 samples, and 6,343 on Fri 8 Nov.
| Thu 13 Dec | Fri 14 Dec |
|---|---|
On Thu 13 Dec we collected 8,666 samples, and 7,181 on Fri 14 Dec.
Flame graphs don't measure time spent (not exactly), but the number of samples collected does give us a good idea of how much time was spent in the code overall (given equal distribution of the sampler, and a very similar request rate on these four dates). Based on this, we're seeing a 24% increase from this regression alone.
This is further confirmed by looking at the where that time is spent. For example, the left side of the flame graphs where WebStart.php is, have roughly the same number of samples collected before and after.
WebStart.php:
The same for ResourceLoader::preloadModuleInfo:
But, looking at ResourceLoaderStartUpModule::getScript we can start to narrow down where the big increase comes from:
Specifically: …/ResourceLoaderStartUpModule::getScript/…/ResourceLoaderFileModule::getFileHashes/ResourceLoaderModule::safeFileHash:
That alone doesn't explain the increase though, but it's a starting point perhaps.
Change 557620 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Remove ResourceLoaderModule::safeFilemtime()
Change 557632 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] resourceloader: Various optimisations in FileModule::getFileHashes()
Change 557620 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Remove ResourceLoaderModule::safeFilemtime()
Change 557632 merged by jenkins-bot:
[mediawiki/core@master] resourceloader: Various optimisations in FileModule::getFileHashes()
Change 559189 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@wmf/1.35.0-wmf.11] resourceloader: Various optimisations in FileModule::getFileHashes()
Change 559189 merged by jenkins-bot:
[mediawiki/core@wmf/1.35.0-wmf.11] resourceloader: Various optimisations in FileModule::getFileHashes()
Looking much better nowdays:
https://grafana.wikimedia.org/d/000000067/resourceloadermodule?orgId=1&from=now-7d&to=now
There is still a random spike in the max() time once or twice a week, but that's a very very small proportion. I'm hoping it'll be explained by T245464, but if not, it's not worth investigating right now I think.