Page MenuHomePhabricator

⬆️ Investigate Expectation (masterConns <= 0) errors
Open, Needs TriagePublic

Assigned To
Authored By
Andrew-WMDE
Oct 23 2025, 11:52 AM
Referenced Files
F71028862: image.png
Dec 12 2025, 11:21 AM
F71027617: image.png
Dec 12 2025, 10:02 AM
F70984173: image.png
Dec 11 2025, 10:59 AM

Description

Investigate whether this warning can be resolved, or if they are simply overzealous logging by MediaWiki.

[rdbms] Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 1):

from https://console.cloud.google.com/errors/detail/COyO-aiI_5q2Zg;time=P7D;locations=global?project=wikibase-cloud

Event Timeline

dang changed the task status from Open to Stalled.Nov 5 2025, 10:31 PM
dang moved this task from Doing to To do on the Wikibase Cloud (Kanban Board) board.

I took a look at Logs Explorer for and I have a hypothesis for the warnings:

[rdbms] Expectation (masterConns <= 0) by MediaWiki\MediaWikiEntryPoint::restInPeace not met (actual: 1): [connect to sql-mariadb-primary.default.svc.cluster.local ()]

  • MediaWiki was trying to establish a new master database connection and it expected no active master connections but there was actually an existing one. Reason: we have 2 versions of mediawiki with 2 database connections, causing conflicts

[SQLBagOStuff] Duplicate get(): "mwdb_f036984f01-mwt_93f623752b_:echo:seen:message:time:1" fetched 2 times

  • We have Duplicate get() warning because the application was trying to fetch the same data multiple times (twice in this case) from the same data store. And the warning above also indicated that there was violation of master database number.

In conclusion, these messages and also messages in T408103 are in fact due to the fact that we are having 2 versions of Mediawiki running at the moment. Once we fix the routing problem, these warnings will be resolved.

dang removed dang as the assignee of this task.Nov 5 2025, 10:36 PM
dang subscribed.

We want to look again after the routing stuff has sat for a couple of days. We want to check if this is overzealous logging or if this is some how actually causing a request or a job to fail.

It seems like these did not get resolved yet, after we fixed the routing issues.

Ollie.Shotton_WMDE changed the task status from Stalled to Open.Nov 28 2025, 11:27 AM

I think I dug up this ticket T237143 as relevant to

[SQLBagOStuff] Duplicate get(): "mwdb_f036984f01-mwt_93f623752b_:echo:seen:message:time:1" fetched 2 times

Seems like the solution temporarily is to silence it; or to wait until 1.44+ where there is a fix. An example of how to silence (but for a different cache type) can be seen at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/552257/3/wmf-config/CommonSettings.php

Tarrow updated the task description. (Show Details)

To me it looked maybe a bit similar to T388165 but we aren't using SUL; maybe it is still some Auth Manager thing though?

We looked a bit more and tried to determine which request could be causing this error. We learnt that we can also see this error thrown by other classes for example:
`[rdbms] Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 1):
[connect to sql-mariadb-primary.default.svc.cluster.local ()]`. This was occurring on our production DB

We looked in the +/- 3 seconds around this log and saw the only request to this exact MediaWiki pod was [09/Dec/2025:10:47:58 +0000] "GET /wiki/Item:Q1 HTTP/1.1" 200 7255 "-" "addshore-wikibase-cloud-status/py""

The complete stacktrace was:

[rdbms] Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 1):
[connect to sql-mariadb-primary.default.svc.cluster.local ()]

#0 /var/www/html/w/includes/libs/rdbms/TransactionProfiler.php(269): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated(string, string, int)
#1 /var/www/html/w/includes/libs/rdbms/loadbalancer/LoadBalancer.php(807): Wikimedia\Rdbms\TransactionProfiler->recordConnection(string, null, bool)
#2 /var/www/html/w/includes/libs/rdbms/loadmonitor/LoadMonitor.php(244): Wikimedia\Rdbms\LoadBalancer->getServerConnection(int, string, int)
#3 /var/www/html/w/includes/libs/rdbms/loadmonitor/LoadMonitor.php(186): Wikimedia\Rdbms\LoadMonitor->computeServerState(int, null)
#4 /var/www/html/w/includes/libs/objectcache/WANObjectCache.php(1809): Wikimedia\Rdbms\LoadMonitor->Wikimedia\Rdbms\{closure}(bool, int, array, null, array)
#5 /var/www/html/w/includes/libs/objectcache/WANObjectCache.php(1623): Wikimedia\ObjectCache\WANObjectCache->fetchOrRegenerate(string, int, Closure, array, array)
#6 /var/www/html/w/includes/libs/rdbms/loadmonitor/LoadMonitor.php(188): Wikimedia\ObjectCache\WANObjectCache->getWithSetCallback(string, int, Closure, array)
#7 /var/www/html/w/includes/libs/rdbms/loadmonitor/LoadMonitor.php(168): Wikimedia\Rdbms\LoadMonitor->getStateFromWanCache(int, null)
#8 /var/www/html/w/includes/libs/rdbms/loadmonitor/LoadMonitor.php(103): Wikimedia\Rdbms\LoadMonitor->getServerStates(array)
#9 /var/www/html/w/includes/libs/rdbms/loadbalancer/LoadBalancer.php(446): Wikimedia\Rdbms\LoadMonitor->scaleLoads(array)
#10 /var/www/html/w/includes/libs/rdbms/loadbalancer/LoadBalancer.php(772): Wikimedia\Rdbms\LoadBalancer->getReaderIndex(string)
#11 /var/www/html/w/includes/libs/rdbms/database/DBConnRef.php(111): Wikimedia\Rdbms\LoadBalancer->getConnectionInternal(int, array, string, int)
#12 /var/www/html/w/includes/libs/rdbms/database/DBConnRef.php(125): Wikimedia\Rdbms\DBConnRef->ensureConnection()
#13 /var/www/html/w/includes/libs/rdbms/database/DBConnRef.php(369): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#14 /var/www/html/w/includes/libs/rdbms/querybuilder/SelectQueryBuilder.php(793): Wikimedia\Rdbms\DBConnRef->selectRow(array, array, array, string, array, array)
#15 /var/www/html/w/includes/page/WikiPage.php(363): Wikimedia\Rdbms\SelectQueryBuilder->fetchRow()
#16 /var/www/html/w/includes/page/WikiPage.php(390): WikiPage->pageData(Wikimedia\Rdbms\DBConnRef, array, array)
#17 /var/www/html/w/includes/page/WikiPage.php(434): WikiPage->pageDataFromTitle(Wikimedia\Rdbms\DBConnRef, MediaWiki\Title\Title, int)
#18 /var/www/html/w/includes/page/WikiPage.php(549): WikiPage->loadPageData()
#19 /var/www/html/w/includes/page/WikiPage.php(609): WikiPage->exists()
#20 /var/www/html/w/includes/page/WikiPage.php(252): WikiPage->getContentModel()
#21 /var/www/html/w/includes/actions/ActionEntryPoint.php(630): WikiPage->getContentHandler()
#22 /var/www/html/w/includes/actions/ActionEntryPoint.php(508): MediaWiki\Actions\ActionEntryPoint->initializeArticle()
#23 /var/www/html/w/includes/actions/ActionEntryPoint.php(146): MediaWiki\Actions\ActionEntryPoint->performRequest()
#24 /var/www/html/w/includes/MediaWikiEntryPoint.php(200): MediaWiki\Actions\ActionEntryPoint->execute()
#25 /var/www/html/w/index.php(62): MediaWiki\MediaWikiEntryPoint->run()
#26 {main}

One idea we just had looking at this stacktrace is: are we getting a primary DB to write to some SQL cache which is then causing this profiler error?

We believe we have locally reproduced this error by opening an item page, purging the parser cache and then reparsing the page such that it is inserted into the parser cache. We then suppressed it by setting $wgParserCacheType = CACHE_NONE; and $wgParserCacheType = 'redis2';

This probably means that between 1.43 and 1.39 there is some change that means that some transaction profiling is no longer silenced or something)

We struggle to see any other instances of this on phabricator so we have some questions to think about:

  • we should check if anybody else has see this problem since upgrading to MW143?
  • should we use CACHE_DB for the parsercache? Is this a bad idea anyway and we should use redis or memcached or something?
  • is is fine to keep using SQL for the cache but we should just suppress the warnings?
  • is this a bug? should this case be suppressed anyway somehow? should we open some ticket for this?
Tarrow renamed this task from ⬆️ Investigate SQL-related warnings in MW 1.43 to ⬆️ Investigate Expectation (masterConns <= 0) errors.Dec 9 2025, 3:43 PM

I suspect we now see this error post I7ca996618e41b93f488cb5c4de82000bb36e0dd3 from T320873. Since this point the previously suppressed by us DBPerformance log is no longer used for the txnProfiler. Instead it uses the generic rdbms log.

dumping extra context that we miss in production on my dev machine results in:

"context":{"db_log_category":"performance","measure":"masterConns","maxSeconds":0,"by":"MediaWiki\\Actions\\ActionEntryPoint::execute","actualSeconds":1,"query":"[connect to sql-mariadb-primary.default.svc.cluster.local ()]","exception":{},"trxId":null,"fullQuery":"[connect to sql-mariadb-primary.default.svc.cluster.local ()]","dbHost":null}}

maybe we need to now find a way to exclude stuff with "db_log_category":"performance" set in the context?

I looked at this a bit this morning trying to understand our $wgObjectCaches configuration and if I could adjust it so that we'd silence these errors.
I had hoped that perhaps more explicitly setting a DB server here would mean that no TransactionProfiler was used as per T154424#3413306
I failed to make this work by any of:

  • adding an explicit server or servers line to the $wgObjectCaches['db-replicated'] definition. This didn't prevent the error from appearing in the logs
  • attempting to set it in a similar way to pre-update like
$wgObjectCaches['db-replicated'] = [
    'factory' => [ 'ObjectCache', 'newFromParams' ],
        'args'  => [ [
            'class' => SqlBagOStuff::class,
            'purgePeriod' => 5,
            'purgeLimit' => 1000,
        ] ]
    'loggroup'  => 'SQLBagOStuff',
    'reportDupes' => false,
];

I just see some 500s but presumably no useful stack trace

image.png (239×2 px, 114 KB)

The current logspam from this isn't great but I think we should probably not invest loads more time looking into this.

The next two quick things to try:

  • Adjust the profiler limits as per T154424#3532475
  • adjust the CustomLogger we can add a specific silence for this bit of the rdbms channel

tarrow closed https://github.com/wbstack/mediawiki/pull/503

Remove masterConns limit for TransactionProfiler

Tarrow removed a project: Patch-For-Review.

Deployed; I guess we need to see tomorrow morning if this is working?

Check if these logs really have dropped right off?

Ollie.Shotton_WMDE subscribed.

There has been a drop-off in masterConns <= 0 logs, but we are still getting about 10-15 / hour:

image.png (164×1 px, 16 KB)

Query used:

jsonPayload.message:"(masterConns <= 0)"
logName:"stderr"
resource.type="k8s_container"
resource.labels.namespace_name="default"
resource.labels.project_id="wikibase-cloud"
resource.labels.cluster_name="wbaas-3"
resource.labels.container_name="mediawiki"
resource.labels.location="europe-west3-a"

Moving to todo to investigate further.

Unassigning Tom as they are OOO today

Seems like we get different types:

At least of one type of them we got rid of, I could confirm this locally as well. Try filter by this instead [rdbms] Expectation (masterConns <= 0) by MediaWiki\\Actions\\ActionEntryPoint::execute not met
https://cloudlogging.app.goo.gl/dmgXtCcHw1te9uXn7

image.png (236×1 px, 39 KB)

The ones we still see now say [rdbms] Expectation (masterConns <= 0) by MediaWiki\MediaWikiEntryPoint::restInPeace not met

I believe at 15 an hour we can probably now ignore this and think about something else.

I did a tiny bit more looking and I see now both [rdbms] Expectation (masterConns <= 0) by MediaWiki\Actions\ActionEntryPoint::execute not met (actual: 1) and [rdbms] Expectation (masterConns <= 0) by MediaWiki\MediaWikiEntryPoint::restInPeace not met (actual: 1):

Ollie.Shotton_WMDE moved this task from In Review to Done on the Wikibase Cloud (Kanban Board) board.

I notice we also go warnings like this:

[rdbms] Expectation (masterConns <= 0) by MediaWiki\Api\ApiMain::setRequestExpectations not met (actual: 1):

but thanksfully they were supressed by wbstack/mediawiki/pull/503.

Happy to leave at this for now. Created T412698: Reduce '[rdbms] Expectation (masterConns <= 0)' warnings to follow up.