Page MenuHomePhabricator

DiscussionTools isFeatureEnabled check is taking 5% of all requests
Closed, ResolvedPublic

Description

Currently, MediaWiki\Extension\DiscussionTools\Hooks\HookUtils::hasPagePropCached is taking 5.5% of all requests to production by querying page props non-stop, can you cache this in memcached for an hour or so?

Event Timeline

For comparison, if you look at all other consumers of databases, this is by far the biggest. The second biggest one is only consuming 1.6% of resources.

I see why it's such a big consumer, it's making the same query more than 30 times in the same request: https://logstash.wikimedia.org/goto/b44eab1488241ab5b1f124c72e66bbf5

Edit: 34 times to be exact

Change #1030603 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/DiscussionTools@master] Fix static cache access

https://gerrit.wikimedia.org/r/1030603

Found the issue a review on this would be really appreciated. ^

Ladsgroup triaged this task as High priority.
Ladsgroup moved this task from Triage to In progress on the DBA board.

FIWI, that's roughly 24% of all the load on the databases coming from the main appservers. Something like 20% of the load on the all dbs basically.

FIWI, that's roughly 24% of all the load on the databases coming from the main appservers. Something like 20% of the load on the all dbs basically.

Wow... Nice finding. Let's deploy the patch sooner rather than later. 20% is quite a lot

Nice one Amir!

Change #1030866 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/DiscussionTools@wmf/1.43.0-wmf.4] Fix static cache access

https://gerrit.wikimedia.org/r/1030866

Change #1030867 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/DiscussionTools@REL1_42] Fix static cache access

https://gerrit.wikimedia.org/r/1030867

Change #1030866 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@wmf/1.43.0-wmf.4] Fix static cache access

https://gerrit.wikimedia.org/r/1030866

Change #1030867 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@REL1_42] Fix static cache access

https://gerrit.wikimedia.org/r/1030867

Mentioned in SAL (#wikimedia-operations) [2024-05-13T07:41:30Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:1030866|Fix static cache access (T364693)]]

Mentioned in SAL (#wikimedia-operations) [2024-05-13T07:44:01Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:1030866|Fix static cache access (T364693)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-05-13T07:58:25Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:1030866|Fix static cache access (T364693)]] (duration: 16m 54s)

Here is the load on the databases going down:
(deploy finished at 7:56)

grafik.png (791×1 px, 437 KB)

Mean latency of mw requests (k8s, currently 80% of traffic):
grafik.png (822×1 px, 118 KB)

Mean latency of mw requests (bare metal)
grafik.png (822×1 px, 142 KB)

Change #1030881 had a related patch set uploaded (by Reedy; author: Amir Sarabadani):

[mediawiki/extensions/DiscussionTools@REL1_41] Fix static cache access

https://gerrit.wikimedia.org/r/1030881

Change #1030881 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@REL1_41] Fix static cache access

https://gerrit.wikimedia.org/r/1030881

Change #1030603 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Fix static cache access

https://gerrit.wikimedia.org/r/1030603

The bug was introduced in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/960158, which was attempting to improve the very same problem :( (T347123: Reduce database queries on parsercached logged-out page views (Sep 2023))

Curiously, this is the second time that we've had the same performance issue in DiscussionTools, the previous one was T297132: DiscussionTools is making duplicate DB requests back to back. I think at least some of the blame lies on the PageProps class being difficult to use for simple cases.

Ladsgroup moved this task from In progress to Done on the DBA board.

It happens. I wish we had better monitoring to see jumps like this. I actually have a plan for a SLO that would have failed with such regressions forcing us to look at it. Hopefully soon.