Page MenuHomePhabricator

Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009
Closed, ResolvedPublic

Description

We recently carried out T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9]) which caused all of the current generation dumps processes to start using dbstore100[7-9] as their preferred database servers.

Performance has been largely unaffected for:

However, we have observed a serious performance regression with the Wikidata Entity Dumps ( see T386255. )

The dumps seemed to progress, but they were much slower when the file /etc/wikimedia-servergroup was present on the server and containing the work dumps

Upon removing this file, the mediawiki-config omitted updating the database servers for these dumps and the dumping processes continued at normal speed.

This ticket is about identifying the cause of the performance regression and removing it, since we would prefer to use dbstore1009 (i.e. s8) for this dump, in future.

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Switch the wikibase dumps to use the core database serversrepos/data-engineering/airflow-dags!1747btullisupdate_wikibase_servergroupmain
Customize query in GitLab

Related Objects

Event Timeline

Some leads:

There is an ongoing migration for wikidatawiki over at T183490, although it would be weird that one replica and not another would be affetected. But wanted to point it out:

enwiki ~70.7%
wikidatawiki ~57.5%


All wikibase related dumps are super slow, and not just wikidata entities. wikidata rdf are also slow, and the commonswiki mediainfo dumps. Seems like anything stemming from wikibase. That got me thinking maybe these scripts are doing something special when not connected to the usual dump / vslow replicas. Seems like T147169: Make sure Wikibase dump maintenance scripts solely use the "dump" db group points to such behavior?

I wonder if @hoo might be able to provide any insights here. I've been trying to dig into the changes around the dbgroupdefault settings have been used in wikibase.
Such as: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/583734 but I am struggling to trace through the PHP.

Maybe it could also be related to some changes in here: T321770: Entity dumpers should reload db config on the fly

Given the new setup in T382947, we only have one replica for each section (specifically s8 for wikidata).
Therefore, I'm wondering if there is any way that we can disable or bypass the reloading of database configuration for these wikibase dumps processes.

I'll have a more thorough look tomorrow or next week, but one thing to quickly test would be: Run the script with --dbgroupdefault somethingthatdoesnotexist and compare that to the speed without that parameter. AFAIR this will make MediaWiki "fallback" to the usual DBs.

I looked into this a bit (by comparing manual script runs on snapshot1015 [which has /etc/wikimedia-servergroup] and snapshot1016). The additional time the scripts need seems to be almost completely user time and increase when lowering the batch size.

My current working theory is that Wikibase calls autoReconfigure way to often. Currently we call it after every batch (and each batch runs within a couple of seconds at most)… I will test this hypotheses some more tomorrow and maybe upload a change then for reducing the calls to autoReconfigure to a reasonable amount (TBD).

dumpsgen@snapshot1015:~$ time $php $multiversionscript shell.php --wiki wikidatawiki
Psy Shell v0.12.7 (PHP 8.1.31 — cli) by Justin Hileman
> $db = wbr::getRepoDomainDbFactory()->newRepoDb();
= Wikibase\Lib\Rdbms\RepoDomainDb {#5585}

> $db->autoReconfigure();
= null

> $ts0 = microtime( 1 ); for ( $i = 0; $i < 1e3; $i++ ) { $db->autoReconfigure(); }; $ts1 = microtime( 1 ); echo ( $ts1 - $ts0 ) . ", avg: " . (  $ts1 - $ts0 ) / 1e3;
13.321059942245, avg: 0.013321059942245⏎

So autoReconfigure takes roughly 13ms. Given the current item count of roughly 115M (+ redirects and other entity types, depending on dump flavour) and the batch size used (2000), this will slow downtake each shard by at least 13ms * 115e6/2000 ≈ 750s (shards are run in parallel so this should be roughly the overall slow down). Unless I'm missing something and even considering the other entity types and redirects this is probably not the huge slow down you saw (but still significant and I'll address it).

Update 2025-06-16: I missed that autoReconfigure is also run when /etc/wikimedia-servergroup is not present (only the additional call to DBRecordCache::getInstance()->repopulateDbConf differs), thus the slow down should be even less than initially described above.

Change #1159525 had a related patch set uploaded (by Hoo man; author: Hoo man):

[mediawiki/extensions/Wikibase@master] DumpEntities: Reconfigure the DB once a minute at most

https://gerrit.wikimedia.org/r/1159525

After more poking both on mwdebug1001 (hacking ClusterConfig::getInstance()->isDumps()) and by comparing identical runs on snapshot1015 / snapshot1016, I think that the performance differences are not reproducible anymore. Earlier on I missed that snapshot1016 has a considerable faster CPU than snapshot1015 which should explain the timing difference I saw between these hosts (which I couldn't reproduce on mwdebug1001).

Maybe this was related to T183490 after all? I think it might be worth a shot to add /etc/wikimedia-servergroup on snapshot1016 again. If we want / need more assurance, we could also hack ClusterConfig::getInstance()->isDumps() there and compare similar dump runs side-by-side.

I uploaded my patch for calling autoReconfigure less often anyway, but I don't think it will make a noticeable difference (but it should make the JSON dump at least 0.2 % faster).

Change #1159525 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] DumpEntities: Reconfigure the DB once a minute at most

https://gerrit.wikimedia.org/r/1159525

Thanks @hoo for all of this investigation work. I'll try to pick it up and see where we are.
It's quite possible that this is a non-issue any more and could have been a transient issue related to T183490.
I'll try to check so that we're certain.

It's also worth knowing that a lot of changes have happened to dumps in the last few months.
The snapshot hosts are only running the dumps in a backup capacity at the moment and we're about to disable them altogether: Disable all dumps timers on snapshot hosts

All of the Dumps 1.0 have now been migrated to Airflow and Kubernetes as part of T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes

The wikibase dumps are now run as DAGs that are created by this file: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/test_k8s/dags/dumps/mediawiki_wikibase_dumps.py
You should be able to see them here: https://airflow-test-k8s.wikimedia.org/home?tags=dumps-wikibase

image.png (749×1 px, 212 KB)

It's still pretty much the same code paths, though, just in containers.

I think that the override works a little differently, so instead of checking for the /etc/wikimedia-servergroup file, it picks up the SERVERGROUP environment variable instead.
We can see from a recent wikibase run, that the following is set in the dump pod.

'env': [{'name': 'SERVERGROUP', 'value': 'kube-dumps'}, <snip>

image.png (318×1 px, 139 KB)

I think that this SERVERGROUP value of kube-dumps means that it matches the dumps trait that was added here: ClusterConfig: add support for dumps trait

...which means that it should already be running with the same settings as if it were on bare-metal with the /etc/wikimedia-servergroup file containing dumps.

I think I'm going to resolve this issue, as we think that the performance issue has gone away.
Thanks again for your help @hoo. If you have any questions about dumps on airflow, please do let me know.