Page MenuHomePhabricator

Investigate why wikidata dump generation is slowed when reading data from dump db servers
Closed, DeclinedPublic

Description

As observed in T406429#11288936 when wikibase dumps for Wikidata (or Commons) are run configured to get data from "kube-dumps" database servers the dump generation logic seems to be significantly slower than expected.
As demonstrated through https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1747/diffs when using the main database server group

This task is to coordinate the troubleshoot what part of wikibase dump logic, potentially what database queries used in wikibase dump scripts, are slow when run against one server group but not when run against the other.

Event Timeline

Thanks so much for looking into this. Let us know if we can help at all with anything.

I thought it might help to share something about my opinion around this statement...

...potentially what database queries used in wikibase dump scripts, are slow when run against one server group but not when run against the other.

Fundamentally, I don't believe that this is an issue with the database servers, themselves.
I think that it is something related to the code path by which the servers are selected and added to the mediawiki configuration.

I'll explain what I know about the recent (~last 18 months) changes to that code path.

The original reason for wanting to make a change came about because of this ticket: T368098: Dumps generation cause disruption to the production environment
The SQL/XML dumps were regularly impacting on a number of important bots running against en.wikipedia.org because the dumps and vslow load groups.

Thanks @BTullis . I might have been clumsy with words as often.
To confirm we understand each other: Do I understand correctly that you're convinced that database queries used by wikidata dump scripts run, give or take, equally fast in both cases?

I think that it is something related to the code path by which the servers are selected and added to the mediawiki configuration.

I might have misunderstood you for over a week, if not longer. Apologies.
So what you mean by the quoted sentence is that due to an issue in DB server selection logic, "dump" group gets servers that are somehow slow, and hence the dump generation ends up being excessively slow and timeouts?

What we decided to do at that point was to use the analytics replica MariaDB servers for all dumps, which we did here:
T382947: Switch dumps 1.0 processes to use the analytics MariadB replicas (dbstore100[7-9])

The mechanism was discussed at some length, but the proposal here: T382947#10436809 was agreed upon.

...I was wondering whether be practical to use our DNS SRV records to locate the right dbstore host and port.
I can see a reference to where SRV records are queried for etcd servers here, so I wonder if we might be able to use the same mechanism for _s1-analytics._tcp.eqiad.wmnet and the other sections.

The tehnical implementation is described here: T382947#10443749

I've created two patches, that implement a version of what we've said:

  • The first allows to identify dumps deployments via ClusterConfig, as long as either: the SERVERGROUP env variable contains the string "dumps" the file /etc/wikimedia-servergroup contains the string
  • The second, in case the running program is identified as dumps, will query the SRV records for the various dbstore sections, using _$secton-analytics._tcp.eqiad.wmnet, and then repopulate the db data structures using those databases. We just leave the masters in place, and of course external storage/parsercache/x1 are untouched.

These two patches are, respectively:

This change to the mediawiki-config was then deployed on Jan 20th 2025.
It was tested for a week against the enwiki dump, then enabled globally for all dumps (including wikibase dumps) on Jan 29th.

The problem with the wikibase dumps was first identified on Feb 12th in T386255: wmf.wikidata_item_page_link and wmf.wikidata_entity snapshots stuck at 2025-01-20
Investigation on that ticket continued into March. The observations were the same as the recent incident, in that the wikibase dumps didn't show any errors, they simply went really slowly by comparison.
We then created: T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009 after identifying the performance regression.

Thanks @BTullis . I might have been clumsy with words as often.

:wikiheart: - I had submitted this comment too early, by mistake, so I think that my own response came across as brusque - Apologies.
I didn't feel that there was any clumsiness, I just wanted to share how the sequence of events had happened, from my perspective.

So what you mean by the quoted sentence is that due to an issue in DB server selection logic, "dump" group gets servers that are somehow slow, and hence the dump generation ends up being excessively slow and timeouts?

Yes, I think that it's definitely in this field. What I meant to say before accidentally posting my previous half-finished comment, is that the hardware for the core database servers and the analytics replicas are, more or less, identical.
The analytics replicas use the same hardware and storage configuration, so I think that we can effectively rule out hardware differences as a cause of this massive discrepancy in the runtimes of these wikibase dumps.
We can also see from various graphs that the weekly wikibase dumps do not put significant load on the analytics replica servers.

Here's a visual representation of the discrepancy in run-time, from just one of the wikibase dump types.

image.png (1,855×669 px, 95 KB)

My feeling is that we got close to the root cause when we were investigating with @hoo in T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009 - but were unable to identify it specifically.

Historically, wikibase dumps had some specific configuration that doesn't affect the other dump typoes, and I feel that it is likely within this area that a problem with the new database server selection logic occurs.

These old tickets may hold some clues:

I feel that if we could perhaps undo some of the changes made in these tickets, so that the selection of the database servers used by wikibase follows the standard path for mediawiki (if there is such a thing) then we might be able to fix this performance regression.

Perhaps we might find that a lot of DNS SRV type queries are being sent by the wikibase dumps, is it tried to reload its config on the fly. Or perhaps it's reloading it configuration between every request. I'm not sure.

I only hope that this has been somewhat helpful. If you have any questions, please don't hesitate to reach out.

I'm closing this investigation given WMDE won't be picking up on this issue any time soon since the WMF is still debating the future strategy on wikidata dumps