Page MenuHomePhabricator

MessageGroupStats::queueUpdates does large REPLACE query (DBPerformance: affects 1000+ rows)
Closed, ResolvedPublic

Description

Query affected 1553 row(s):
query-m: REPLACE INTO `translate_groupstats` (tgs_group,tgs_lang,tgs_total,tgs_translated,tgs_fuzzy,tgs_proofread) VALUES ('X') [TRX#90add3]
#0 /srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/Database.php(1296): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion()
#1 /srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/Database.php(1193): Wikimedia\Rdbms\Database->doProfiledQuery()
#2 /srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/Database.php(2852): Wikimedia\Rdbms\Database->query()
#3 /srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/DatabaseMysqlBase.php(498): Wikimedia\Rdbms\Database->nativeReplace()
#4 /srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/DBConnRef.php(49): Wikimedia\Rdbms\DatabaseMysqlBase->replace()
#5 /srv/mediawiki/php-1.33.0-wmf.4/includes/libs/rdbms/database/DBConnRef.php(412): Wikimedia\Rdbms\DBConnRef->__call()
#6 /srv/mediawiki/php-1.33.0-wmf.4/extensions/Translate/utils/MessageGroupStats.php(537): Wikimedia\Rdbms\DBConnRef->replace()
#7 /srv/mediawiki/php-1.33.0-wmf.4/extensions/Translate/utils/MessageGroupStats.php(558): Closure$MessageGroupStats::queueUpdates()
#8 /srv/mediawiki/php-1.33.0-wmf.4/includes/deferred/MWCallableUpdate.php(34): Closure$MessageGroupStats::withLock()
#9 /srv/mediawiki/php-1.33.0-wmf.4/includes/deferred/DeferredUpdates.php(270): MWCallableUpdate->doUpdate()
#10 /srv/mediawiki/php-1.33.0-wmf.4/includes/deferred/DeferredUpdates.php(216): DeferredUpdates::runUpdate()
#11 /srv/mediawiki/php-1.33.0-wmf.4/includes/deferred/DeferredUpdates.php(140): DeferredUpdates::execute()
#12 /srv/mediawiki/php-1.33.0-wmf.4/includes/MediaWiki.php(905): DeferredUpdates::doUpdates()
#13 /srv/mediawiki/php-1.33.0-wmf.4/includes/MediaWiki.php(728): MediaWiki->restInPeace()
#14 (): Closure$MediaWiki::doPostOutputShutdown()
#15 {main}

Impact

This is a production issue, which alerts about inefficient database updates which may cause lag and other operational issues.

Dashboard

Status

Code was improved in multiple iterations, and those changes have eliminated these warnings completely (and gone even further). Remaining cases are limited to the case when a page is marked for translation for the first time, and they are coming from the <languages> tag when the page is parsed.

Outcome

The reported production warning about large database writes to translation statistics table is no longer happening.

In more detail, there is mechanism to do writes in smaller batches, offloaded to the job queue, instead of doing them during web requests. This is working well, except for the case when page is marked for translation for the first time. It was deemed not necessary to fix this case. For these cases, we write as many rows as there are supported languages, around 400 (varies by the wiki), which is still less than 1000.

Event Timeline

Krinkle renamed this task from MessageGroupStats::queueUpdates does large writes (affects >1000 rows) [DBPerformance] to MessageGroupStats::queueUpdates does large REPLACE query (DBPerformance: affects 1000+ rows).Mar 9 2021, 9:18 PM
Krinkle moved this task from Untriaged to Older on the Wikimedia-production-error board.
Krinkle added a subscriber: Krinkle.

This was not seen in production for a few weeks, but seemingly started again around 3 Mar 2021 at a rate of several thousand a day, and then slowly declining. It's as if maybe it is repeatedly failing and mass re-trying over and over until all no more failing changes are left to ensure being applied.

cap.png (774×2 px, 152 KB)

Logstash mediawiki dashboard query:

Sample URLs:

  • GET meta.wikimedia.org /w/api.php?action=query&format=json&meta=languagestats&lslanguage=rw
  • GET commons.wikimedia.org /wiki/Template:<something>/i18n/ar
  • GET mediawiki.org /wiki/Category:<something>/tr
Expectation (writes <= 0) by MediaWiki::restInPeace not met (actual: 1):
query-m: REPLACE INTO `translate_groupstats` (tgs_group,tgs_lang,tgs_total,tgs_translated,tgs_fuzzy,tgs_proofread) VALUES ('X',N,...,N) [TRX#9f5a6b]
trace
from /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/TransactionProfiler.php(378)
#0 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/TransactionProfiler.php(250): Wikimedia\Rdbms\TransactionProfiler->reportExpectationViolated(string, Wikimedia\Rdbms\GeneralizedSql, integer)
#1 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/database/Database.php(1416): Wikimedia\Rdbms\TransactionProfiler->recordQueryCompletion(Wikimedia\Rdbms\GeneralizedSql, double, boolean, integer)
#2 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/database/Database.php(1298): Wikimedia\Rdbms\Database->executeQueryAttempt(string, string, boolean, string, integer)
#3 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/database/Database.php(1227): Wikimedia\Rdbms\Database->executeQuery(string, string, integer)
#4 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/database/DatabaseMysqlBase.php(1367): Wikimedia\Rdbms\Database->query(string, string, integer)
#5 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/database/Database.php(3256): Wikimedia\Rdbms\DatabaseMysqlBase->doReplace(string, array, array, string)
#6 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/database/DBConnRef.php(68): Wikimedia\Rdbms\Database->replace(string, array, array, string)
#7 /srv/mediawiki/php-1.36.0-wmf.33/includes/libs/rdbms/database/DBConnRef.php(496): Wikimedia\Rdbms\DBConnRef->__call(string, array)
#8 /srv/mediawiki/php-1.36.0-wmf.33/extensions/Translate/utils/MessageGroupStats.php(623): Wikimedia\Rdbms\DBConnRef->replace(string, array, array, string)
#9 /srv/mediawiki/php-1.36.0-wmf.33/extensions/Translate/utils/MessageGroupStats.php(644): MessageGroupStats::{closure}(Wikimedia\Rdbms\DBConnRef, string)
#10 /srv/mediawiki/php-1.36.0-wmf.33/includes/deferred/MWCallableUpdate.php(38): MessageGroupStats::{closure}()
#11 /srv/mediawiki/php-1.36.0-wmf.33/includes/deferred/DeferredUpdates.php(513): MWCallableUpdate->doUpdate()
#12 /srv/mediawiki/php-1.36.0-wmf.33/includes/deferred/DeferredUpdates.php(390): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#13 /srv/mediawiki/php-1.36.0-wmf.33/includes/deferred/DeferredUpdates.php(221): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#14 /srv/mediawiki/php-1.36.0-wmf.33/includes/deferred/DeferredUpdatesScope.php(267): DeferredUpdates::{closure}(MWCallableUpdate, integer)
#15 /srv/mediawiki/php-1.36.0-wmf.33/includes/deferred/DeferredUpdatesScope.php(196): DeferredUpdatesScope->processStageQueue(integer, integer, Closure)
#16 /srv/mediawiki/php-1.36.0-wmf.33/includes/deferred/DeferredUpdates.php(242): DeferredUpdatesScope->processUpdates(integer, Closure)
#17 /srv/mediawiki/php-1.36.0-wmf.33/includes/MediaWiki.php(1097): DeferredUpdates::doUpdates(string)
#18 /srv/mediawiki/php-1.36.0-wmf.33/includes/MediaWiki.php(834): MediaWiki->restInPeace()
#19 /srv/mediawiki/php-1.36.0-wmf.33/includes/MediaWiki.php(846): MediaWiki->{closure}()
#20 /srv/mediawiki/php-1.36.0-wmf.33/includes/MediaWiki.php(584): MediaWiki->doPostOutputShutdown()
#21 /srv/mediawiki/php-1.36.0-wmf.33/index.php(53): MediaWiki->run()
#22 /srv/mediawiki/php-1.36.0-wmf.33/index.php(46): wfIndexMain()
#23 /srv/mediawiki/w/index.php(3): require(string)
#24 {main}

Hard to know the cause without further information. The value of tgs_group in the query would be helpful. Is there way to get those without code changes?

Basically this code path is triggered when someone requests translation statistics, but cached statistics are not available, and they are calculated on the fly, and then a deferred update is added to write them to the cache (database).

Regular code paths in Translate do not delete cache values (they are updated on various events). But caches may be missing for new message groups, new languages, or for translatable moves (I think).

@Nikerabbit Does it seem like a bug / downstream problem that a group could have this many rows, and thus you think it doesn't need any form of batching and we should fix the source? If not, then I think it might not matter what group has that many results, as the problem is the fact that it is doing an unbounded write query based on what is presumably user-generated data and thus not limited to something reasonable like 100 rows.

Anyway, if we want to get the group, I suppose some kind of getLogger('Translate')->warning() statement could work, possibly conditional on count($updates) > 100 or some such to reduce noise.

If the information required to compute these is not limited to the current web request, then maybe queuing a job would be better, and then perform the compute and batched inserts/replacements there. The jobs could be de-duplicatable, and at run-time you'd double-check to make sure it is still needed, and use the same locks to gracefully skip/succeed when another job is doing the work already. Something like that?

The batches may contain different groups. I asked for that information because it should help to identify what is the root cause of missing statistics and fix those.

Both of Special:LanguageStats and Special:MessageGroupStats have logic to only use only cached values and spawn a job to fill in missing ones.

Also ApiQueryLanguageStats and ApiQueryMessageGroupStats use only cached values, but they spawn deferred updates to fill missing values. That's why you can see URLs like /w/api.php?action=query&format=json&meta=languagestats&lslanguage=rw. I will change these to use jobs instead.

Then there are other callers, mainly from translatable pages, which are not prepared to handle missing statistics. https://codesearch.wmcloud.org/search/?q=MessageGroupStats%3A%3Afor&i=nope&files=&excludeFiles=&repos= lists the callers.

I'm also looking at MessageGroupStats::queueUpdates method:

	protected static function queueUpdates( $flags ) {
		if ( wfReadOnly() ) {
			return;
		}

		if ( self::$updates === [] ) {
			return;
		}

		$lb = MediaWikiServices::getInstance()->getDBLoadBalancer();
		$dbw = $lb->getLazyConnectionRef( DB_MASTER ); // avoid connecting yet
		$table = self::TABLE;
		$updates = &self::$updates;

		$updateOp = self::withLock(
			$dbw,
			'updates',
			__METHOD__,
			function ( IDatabase $dbw, $method ) use ( $table, &$updates ) {
				// Maybe another deferred update already processed these
				if ( $updates === [] ) {
					return;
				}

				$primaryKey = [ 'tgs_group', 'tgs_lang' ];
				$dbw->replace( $table, [ $primaryKey ], $updates, $method );
				$updates = [];
			}
		);

		if ( $flags & self::FLAG_IMMEDIATE_WRITES ) {
			call_user_func( $updateOp );
		} else {
			DeferredUpdates::addCallableUpdate( $updateOp );
		}
	}

The batching for web requests looks broken. This method is called every time self::$updates % 100 === 0. For jobs this works fine, but for web request, it keeps spawning deferred updates, so that first one will write all the updates, and rest will be no-ops.

Nikerabbit updated the task description. (Show Details)

Change 673009 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Defer statistics rebuild in API modules to jobs

https://gerrit.wikimedia.org/r/673009

Change 673014 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[mediawiki/extensions/Translate@master] Add debug logging for big stats updates

https://gerrit.wikimedia.org/r/673014

The batches may contain different groups. I asked for that information because it should help to identify what is the root cause of missing statistics and fix those.

Thanks, I understand now!

Both of Special:LanguageStats and Special:MessageGroupStats have logic to only use only cached values and spawn a job to fill in missing ones.

Also ApiQueryLanguageStats and ApiQueryMessageGroupStats use only cached values, but they spawn deferred updates to fill missing values. That's why you can see URLs like /w/api.php?action=query&format=json&meta=languagestats&lslanguage=rw. I will change these to use jobs instead.

👍

Then there are other callers, mainly from translatable pages, which are not prepared to handle missing statistics. https://codesearch.wmcloud.org/search/?q=MessageGroupStats%3A%3Afor&i=nope&files=&excludeFiles=&repos= lists the callers.

If that can be limited to happening within (other) maintenance scripts or jobs only that would be ideal, but I guess that's not the case here since it's directly connected to a web request. I suppose in that case maybe take the compute hit (as now) but don't write the result to the DB? It can still use a job to schedule the real backfill I suppose, or a deferred update to write them - in batches - if it is okay to have a delay between the compute and the db write (still fresh? etc...). This last pattern is actually something ResourceLoader does today, but it's indeed a compromise, until a bigger refactor can happen.

When not prepared to handle missing stats - are these all POST requests? If not, you may run into execution timeouts as well, which, not saving to the DB could actually help with a little bit.

Change 673014 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Add debug logging for big stats updates

https://gerrit.wikimedia.org/r/673014

New log entries are coming in, and I looked into one of them. This is only 451 rows, e.g. all languages for one group. The page was re-marked for translation.

Apr 14, 2021 @ 06:53:06.071mw1300TranslationsUpdateJob [Extension:SimpleBatchUpload]: Starting TranslationsUpdateJob
Apr 14, 2021 @ 06:53:06.440mw1300TranslationsUpdateJob [Extension:SimpleBatchUpload]: Finished running 14 MessageUpdate jobs for 14 sections
Apr 14, 2021 @ 06:53:06.572mw1300TranslationsUpdateJob [Extension:SimpleBatchUpload]: Cleared caches
Apr 14, 2021 @ 06:53:07.134mw1300TranslationsUpdateJob [Extension:SimpleBatchUpload]: Updated the message group statsThe stats are calculated now, but saving is deferred due to missing FLAG_IMMEDIATE_WRITES
Apr 14, 2021 @ 06:53:07.137mw1300TranslationsUpdateJob [Extension:SimpleBatchUpload]: Finished purging
Apr 14, 2021 @ 06:53:07.162mw1300TranslationsUpdateJob [Extension:SimpleBatchUpload]: Added 5 RenderJobs to the queueAfter this, the spawned jobs are being executed on different executors, the mw1300 starts running deferred updates
Apr 14, 2021 @ 06:53:07.186mw1334TranslateRenderJob [Extension:SimpleBatchUpload/en]: Starting TranslateRenderJob
Apr 14, 2021 @ 06:53:07.187mw1307TranslateRenderJob [Extension:SimpleBatchUpload/ja]: Starting TranslateRenderJob
Apr 14, 2021 @ 06:53:07.188mw1300TranslationsUpdateJob [Extension:SimpleBatchUpload]: Finished TranslationsUpdateJob
Apr 14, 2021 @ 06:53:07.210mw1338TranslateRenderJob [Extension:SimpleBatchUpload/pt-br]: Starting TranslateRenderJob
Apr 14, 2021 @ 06:53:07.228mw1338[MessageIndex] Started rebuild. Initiated by MediaWiki\Extension\EventBus\JobExecutor->execute/MessageIndexRebuildJob->run/MessageIndex->rebuild
Apr 14, 2021 @ 06:53:07.343mw1338[MessageIndex] Got lock in 0.00093197822570801
Apr 14, 2021 @ 06:53:08.405mw1307TranslateRenderJob [Extension:SimpleBatchUpload/ja]: Finished TranslateRenderJob
Apr 14, 2021 @ 06:53:08.405mw1307TranslateRenderJob [Extension:SimpleBatchUpload/ja]: Finished page edit operation
Apr 14, 2021 @ 06:53:08.412mw1338TranslateRenderJob [Extension:SimpleBatchUpload/pt-br]: Finished TranslateRenderJob
Apr 14, 2021 @ 06:53:08.412mw1338TranslateRenderJob [Extension:SimpleBatchUpload/pt-br]: Finished page edit operation
Apr 14, 2021 @ 06:53:08.447mw1334TranslateRenderJob [Extension:SimpleBatchUpload/en]: Finished TranslateRenderJob
Apr 14, 2021 @ 06:53:08.447mw1334TranslateRenderJob [Extension:SimpleBatchUpload/en]: Finished page edit operation
Apr 14, 2021 @ 06:53:09.573mw1293TranslateRenderJob [Extension:SimpleBatchUpload/tr]: Starting TranslateRenderJob
Apr 14, 2021 @ 06:53:10.837mw1293TranslateRenderJob [Extension:SimpleBatchUpload/tr]: Finished TranslateRenderJob
Apr 14, 2021 @ 06:53:10.837mw1293TranslateRenderJob [Extension:SimpleBatchUpload/tr]: Finished page edit operation
Apr 14, 2021 @ 06:53:13.209mw1338[MessageIndex] Finished critical section in 5.8656761646271
Apr 14, 2021 @ 06:53:27.661mw1300Huge translation update of 451 rows for group(s) page-Extension:SimpleBatchUploadFinally, after 20 seconds, the updates have been written to the database. It is not clear what else is happening in the deferred updates. Any code reading the stats in the meanwhile reads stale values

So far no entries with > 1000 rows.

Change 679308 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] TranslationsUpdateJob: Use MessageGroupStats::FLAG_IMMEDIATE_WRITES

https://gerrit.wikimedia.org/r/679308

Change 679532 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/Translate@master] MessageGroupStats: Remove use of by-ref $updates variable

https://gerrit.wikimedia.org/r/679532

Change 679308 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] TranslationsUpdateJob: Use MessageGroupStats::FLAG_IMMEDIATE_WRITES

https://gerrit.wikimedia.org/r/679308

Change 673009 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] Defer statistics rebuild in API modules to jobs

https://gerrit.wikimedia.org/r/673009

Planning to return to this in two weeks time to recheck for frequency and causes of the big writes.

For past week there was 271 instances of "Huge translation update of {count} rows for group(s) {groups}". Out of them 229 were /rpc/RunSingleJob.php which I means they are coming from the job queue.

I am submitting a patch that should fix majority of them.

Change 693902 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageGroupStatesUpdaterJob: Avoid large stats inserts

https://gerrit.wikimedia.org/r/693902

Change 693902 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageGroupStatesUpdaterJob: Avoid large stats inserts

https://gerrit.wikimedia.org/r/693902

Change 697594 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageGroupStats: Log callers for "Huge translation update" log event

https://gerrit.wikimedia.org/r/697594

Change 697594 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageGroupStats: Log callers for "Huge translation update" log event

https://gerrit.wikimedia.org/r/697594

Looking at callers, they seem to be all coming from parsing <languages/> tag, and from the job queue. I don't know which job, nor do I know how it ends up creating more rows than there are languages.

Change 700673 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageGroupStats: Log more callers

https://gerrit.wikimedia.org/r/700673

Change 700681 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageGroupStats: check for values in queued updates

https://gerrit.wikimedia.org/r/700681

Change 700673 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageGroupStats: Log more callers

https://gerrit.wikimedia.org/r/700673

Change 701727 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/Translate@master] MessageGroupStats: Add $flags to clearGroup

https://gerrit.wikimedia.org/r/701727

Change 700681 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageGroupStats: check for values in queued updates

https://gerrit.wikimedia.org/r/700681

Change 701727 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageGroupStats: Add $flags to clearGroup

https://gerrit.wikimedia.org/r/701727

Change 679532 merged by jenkins-bot:

[mediawiki/extensions/Translate@master] MessageGroupStats: Remove use of by-ref $updates variable

https://gerrit.wikimedia.org/r/679532

Nikerabbit removed a project: Patch-For-Review.
Nikerabbit updated the task description. (Show Details)