Page MenuHomePhabricator

Rdbms overhead due to "SELECT @@GLOBAL.read_only" queries
Closed, ResolvedPublic

Description

I did another profile for T221119: "This namespace is reserved for content page translations" when trying to translate a recently created translation unit since testing with time showed it takes about the same or even longer, contrary to my expectations.

I think I found a major regression in performance for this use case due to huge amount of additional queries:

Previous profile (call count)Latest profile on 1.34-wmf.35
15887 - Wikimedia\Rdbms\Database::query17271 - Wikimedia\Rdbms\Database::query
4279 - section.query-m: SELECT page_id,page_len,page_is_redirect,page_latest,page_restrictions,page_content_model,page_lang FROM page WHERE page_namespace = N AND page_title = 'X' LIMIT N
5790 - section.query-m: SELECT trs_key,trs_text FROM translate_sections WHERE trs_page = N ORDER BY trs_order5803 - section.query-m: SELECT trs_key,trs_text FROM page,translate_sections WHERE page_namespace = N AND page_title = 'X' AND (page_id = trs_page) ORDER BY trs_order
5798 - Wikimedia\Rdbms\LoadBalancer::isMasterConnectionReadOnly5808 - Wikimedia\Rdbms\LoadBalancer::isMasterConnectionReadOnly
10 - Wikimedia\Rdbms\DatabaseMysqlBase::serverIsReadOnly5808 - Wikimedia\Rdbms\DatabaseMysqlBase::serverIsReadOnly
10 - section.query-m: SELECT @@GLOBAL.read_only AS Value5808 - section.query-m: SELECT @@GLOBAL.read_only AS Value

These are taking about 5s of additional time. This happened between May 25 and today.

Event Timeline

Leads I came up with:

Marostegui subscribed.

The READ ONLY increase is interesting, from what I can see it is at https://gerrit.wikimedia.org/g/mediawiki/core/+/209bb5450e684d108d83f2b4ee393b13d3673543/includes/libs/rdbms/database/DatabaseMysqlBase.php#1099 and looks like it was introduced here: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/523660/7/includes/libs/rdbms/database/DatabaseMysqlBase.php

(Removing DBA tag as there is no actionable for us - but I will keep on the task as this is something to keep an eye on for us)

Krinkle renamed this task from Performance regression due to SELECT @@GLOBAL.read_only AS Value to Rdbms overhead due to "SELECT @@GLOBAL.read_only" queries.Jun 4 2020, 1:20 PM
eprodromou added subscribers: aaron, eprodromou.

@aaron this would be a good ticket to discuss while you're visiting for Clinic Duty!

eprodromou raised the priority of this task from Medium to High.Aug 4 2020, 8:39 PM

I'm lacking some context to interpret the profiling run... What method is this for? How is it triggered? On which wiki? How can I reproduce the behavior?

From looking at the code of isMasterConnectionReadOnly, the read-only check should be cached for 3 seconds. So it should call serverIsReadOnly() only a couple of times over the course of 20 seconds, not 5000 times.

It looks like the value isn't being cached, which would imply that LoadBalancer::srvCache isn't set up right. Indeed, it defaults to EmptyBagOStuff for some reason. Seems to me like it should default to a HashBagOStuff. In fact, in Database, it does. But in LoadBalancer, it's an EmptyBagOStuff... for a shot in the dark, i'd change the default to be a HashBagOStuff, and see if that fixes things.

But then, the value of srvCache ultimately comes from the service wiring call to MWLBFactory::applyDefaultConfig, and thus from $services->getLocalServerObjectCache. That should work fine. Unless in this specific case, the LocaBalancer in question is somehow special.

@daniel T253547: Command line profiling not working on production has some context. Basically I'm trying to optimize MessageIndex::rebuildof Translate. The profile is from a debug server in production for metawiki.

It seems not possible to get a profile from non-debug server, so I don't know how to rule out whether this happens on a debug server only.

Analysis:

LoadBalancer::isMasterConnectionReadOnly relies on caching in $this->srvCache. srvCache defaults to EmptyBagOStuff, which would explain why isMasterConnectionReadOnly hits serverIsReadOnly every time. However, LoadBalancer doesn't use the default value for srvCache, since LBFactory actually provides that values. It in turn gets it from ServiceWiring via MWLBFactory. ServiceWiring gets it from ObjectCache::makeLocalServerCache(). But ObjectCache::makeLocalServerCache() will return an EmptyBagOStuff in CLI mode.

I suppose that means that this isn't a real problem for us in production, since we call the job runner via the web. But for people who use maintenance/runJobs.php, it would be a problem.

In general, it seems strange that ObjectCache::makeLocalServerCache() wouldn't just return a HashBagOStuff. Code that wants a "local server cache" would probably work better with a cache that is transient, than with no cache at all. Is there a good reason not to have makeLocalServerCache() return a HashBagOStuff as a fallback?

Analysis:

LoadBalancer::isMasterConnectionReadOnly relies on caching in $this->srvCache. srvCache defaults to EmptyBagOStuff, which would explain why isMasterConnectionReadOnly hits serverIsReadOnly every time. However, LoadBalancer doesn't use the default value for srvCache, since LBFactory actually provides that values. It in turn gets it from ServiceWiring via MWLBFactory. ServiceWiring gets it from ObjectCache::makeLocalServerCache(). But ObjectCache::makeLocalServerCache() will return an EmptyBagOStuff in CLI mode.

I suppose that means that this isn't a real problem for us in production, since we call the job runner via the web. But for people who use maintenance/runJobs.php, it would be a problem.

In general, it seems strange that ObjectCache::makeLocalServerCache() wouldn't just return a HashBagOStuff. Code that wants a "local server cache" would probably work better with a cache that is transient, than with no cache at all. Is there a good reason not to have makeLocalServerCache() return a HashBagOStuff as a fallback?

Maybe some callers might only want caches that persist across script instances and would rather do something special if it does not provide one. OTOH, one could just check getQoS(), though HashBagOStuff should actually set the attributes...

But for people who use maintenance/runJobs.php, it would be a problem.

Translatewiki.net uses runJobs.php, and we have a big message index there, so from my point it's worth fixing for CLI.

Did you notice anything that would explain why this started showing up in May-June?

Maybe some callers might only want caches that persist across script instances and would rather do something special if it does not provide one. OTOH, one could just check getQoS(), though HashBagOStuff should actually set the attributes...

It seems more likely to me that callers just expect a best-effort cache that "just works", especially within the current script run.... I filed T262905, let's discuss there.

Change 627466 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Fix caching in CLI mode.

https://gerrit.wikimedia.org/r/627466

Change 627555 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] LBFactory: use HashBagOStuff in CLI mode

https://gerrit.wikimedia.org/r/627555

Change 627555 merged by jenkins-bot:
[mediawiki/core@master] LBFactory: use HashBagOStuff in CLI mode

https://gerrit.wikimedia.org/r/627555

Fix merged, moving to "done".
@Nikerabbit please confirm that this works and close the ticket.

I'll re-run the profiling once the patch is deployed and close if it looks okay.

Change 627466 abandoned by Daniel Kinzler:
[mediawiki/core@master] Fix caching in CLI mode.

Reason:
Per Krinkle'S comment

https://gerrit.wikimedia.org/r/627466

In latest profile on 1.36.0-wmf.10: 7 - section.query-m: SELECT @@GLOBAL.read_only AS Value