Page MenuHomePhabricator

Dumps generation cause disruption to the production environment
Open, In Progress, HighPublic

Description

Today, June 20th 2024 we got paged because of MariaDB Replica lag on the s1 cluster, on db1206.

The configured threshold for paging is 300 seconds and the actual value was 300.70 seconds. Ever so slightly above the limit.

Shortly after the issue resolved itself without manual intervention.

first investigation by DBA showed:

18:50 < Amir1> db1206 starts lagging a lot when dumps start, something in the query is either wrong or it bombard the replica. Either way, it needs to be investigated.
18:55 < Amir1> > | 236130644 | wikiadmin2023   | 10.64.0.157:37742    | enwiki | Query     |       1 | Creating sort index                                    | SELECT /* WikiExporter::dumpPages  */  /*!  .. STRAIGHT_JOIN */ re

Notification Type: PROBLEM

Service: MariaDB Replica Lag: s1 #page
Host: db1206 #page
Address: 10.64.16.89
State: CRITICAL

Date/Time: Thu Jun 20 18:15:09 UTC 2024

Notes URLs: https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica

Acknowledged by :

Additional Info:

CRITICAL slave_sql_lag Replication lag: 300.70 seconds


The query:

SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_actor,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `revision` JOIN `page` ON ((rev_page=page_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `slots` ON 
20:58 ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id))   WHERE (page_id >= 673734 AND page_id < 673816) AND ((rev_page > 0 OR (rev_page = 0 AND rev_id > 0)))  ORDER BY rev_page ASC,rev_id ASC LIMIT 50000

In addition, dumps generation for english wikipedia also caused network saturation in eqiad:

 <Amir1>	yup it's dumps: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=snapshot1012&var-datasource=thanos&var-cluster=dumps&from=1719089686711&to=1719101545535
<Amir1>	it seems to be only snapshot1012 and that host has enwiki dump running

that had severe consequences including a full outage for editing that persisted for more than half an hour.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We can't fully isolate the replica from the rest of the traffic but as you can see all alerts are on db1206.

That is the part I never understood. Why can't we fully isolate the replica, even if just temporarily until Dumps 2.0 lands?

We have the capacity to lose a replica for a couple of days (even two replicas now with multi-dc) but we don't have the capacity for sustained loss of a replica for long period of time. DB servers are quite expensive (each is north of $30K) and we need to be careful in spending. And even if we can find the budget for it from somewhere (let's say data engineering budget), it needs to go through the process for new hardware which takes around a quarter (budget, approvals, procurement, racking, set up, ...)

I have prepared a patch (T373904) to lower the number of slots used for the enwiki dumps, as suggested by @xcollazo. This won't take effect until September 20th, when the next dumps kick off.

As an alternative approach, I have been wondering whether we shoud switch the DB servers that the snapshot servers use from the core servers to the analytics replica servers that are running on dbstore100[7-9].

For instance, the hardware is pretty much identical between db1206 and dbstore1008. dbstore1008 runs several sections (s1, s5, s7) whereas db1206 only runs s1, so that would be a bit of a concern. However, we can see that the utilization of dbstore1008 is pretty low, even at the beginning of the month, when we run sqoop jobs from it. This looks like under-utilization to me.

We would probably need to adjust the grants on these dbstore servers to allow the dumps, but I don't think that should be a problem.

This approach seems to me like it would be a sensible way to split the concerns between production mediawiki and the dumps architecture.
It would shift more of the responsibility of the database availability for dumps to the Data-Platform-SRE team and substantially mitigate this risk to production mediawiki services.

The main thing that I'm not sure about is how we would go about configuring the snapshot servers to use these database servers, because presumably this would have to go through dbctl and etcd.

There are custom port numbers for the different sections. We have some DNS aliases that we use already e.g. s1-analytics-replica.eqiad.wmnet, plus SRV records for helping to determine the right hostname and port.
See: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/MariaDB#Database_setup for more information on how these details are set up.

Maybe someone more familiar with the mediwiki-config than I am could help me to understand how feasible this reconfiguration of the snapshot workers is.

I have prepared a patch (T373904) to lower the number of slots used for the enwiki dumps, as suggested by @xcollazo. This won't take effect until September 20th, when the next dumps kick off.

Just FYI, that patch is deployed now, so we will be monitoring how the enwiki dump is affected, from the run beginning on September 20th.

Ladsgroup renamed this task from Dumps generation without prefetch cause disruption to the production environment to Dumps generation cause disruption to the production environment.Sep 9 2024, 11:34 AM

started alerting again. Hasn't paged yet.

started alerting again. Hasn't paged yet.

Same enwiki query?

I'm not seeing any right now, if it alerts again, I copy paste but I think it's very likely dumps, everything else should be using codfw replicas.

This isn't alerting right now as far as I can tell, but we have new information that's probably related to the original report. So, basically, dumps thinks a bunch of revisions are bad when they're not bad:

milimetric@deploy2002:~$ mwscript-k8s -- maintenance/findBadBlobs.php --wiki=enwiki --revisions 3823741
⏳ Starting maintenance/findBadBlobs.php on Kubernetes as job mw-script.codfw.rd64nxa7 ...
🚀 Job is running. ...
milimetric@deploy2002:~$ K8S_CLUSTER=codfw KUBECONFIG=/etc/kubernetes/mw-script-codfw.config kubectl logs -f job/mw-script.codfw.rd64nxa7 mediawiki-rd64nxa7-app
Scanning 1 ids
	- Scanned a batch of 1 revisions
Found 0 bad revisions.

So these are retried 5 times each, and that leads to extra load and broken dumps. This is where the check lives in code: https://gerrit.wikimedia.org/g/mediawiki/core/+/c74cab847c7e4b675d24d0822c8b76a3a978bde5/maintenance/includes/TextPassDumper.php#690

Mentioned in SAL (#wikimedia-operations) [2024-10-15T07:03:28Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2147 (re)pooling @ 25%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69889 and previous config saved to /var/cache/conftool/dbconfig/20241015-070327-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-15T07:18:33Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2147 (re)pooling @ 50%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69891 and previous config saved to /var/cache/conftool/dbconfig/20241015-071833-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-15T07:33:38Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2147 (re)pooling @ 75%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69893 and previous config saved to /var/cache/conftool/dbconfig/20241015-073338-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-10-15T07:48:44Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2147 (re)pooling @ 100%: post sunday p.age T368098', diff saved to https://phabricator.wikimedia.org/P69897 and previous config saved to /var/cache/conftool/dbconfig/20241015-074843-arnaudb.json

it seems that the issue occured again today

<arnaudb> "Creating sort index SELECT /* WikiExporter::dumpPages"
<arnaudb> Amir1: marostegui processlist.log & fullprocesslist.log are available in my homedir on db1206 if needed
<arnaudb> (it has recovered)

Mentioned in SAL (#wikimedia-operations) [2024-11-20T15:33:38Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: host overworked by dumps - T368098

Mentioned in SAL (#wikimedia-operations) [2024-11-20T15:33:51Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: host overworked by dumps - T368098

@xcollazo what the status of this? We keep seeing issues with dumps. We just got the enwiki dumps replica lagged again while dumps were running.

@xcollazo what the status of this? We keep seeing issues with dumps. We just got the enwiki dumps replica lagged again while dumps were running.

@Marostegui I am not actively working on this.

But here is a summary of the things that have been discussed here to stabilize the Dumps:

  • Considering T368098#9921287, we lowered the LIMIT of the offending query from 50K to 10K. This yielded a minor improvement.
  • Considering T368098#9918849, we lowered the maxslots for enwiki. Unfortunately, this made that wiki's dump not work. So we reverted that.

Some possible future work:

  • T368098#9961448 suggest that changing the ORDER BY of the offending query will make it significantly less expensive. But this requires surgery on MediaWiki. I am not versed on MediaWiki. Perhaps someone else could help on that.
  • There is also @BTullis's proposal on T368098#10116854 to simply start running these dumps on the analytics replicas. These replicas are only ever used for analytics and not production.

To stop these pages to upstream SREs, I think the best option would be to go forward with @BTullis's proposal, as it is actionable with the skill set of our team, and if the analytics replicas slow down or lag it doesn't materially affect our processes. But I know the DPRE SRE team is quite overwhelmed with other work right now, so I'll defer to Ben on whether we can take that work.

@xcollazo I like @BTullis idea. @BTullis do you think you could find some time to explore this idea. I am interesting in knowing how the traffic from dump hosts would flow to the analytics replica, as I guess we require some firewall things in there. Not sure if it is very doable, but I think we need to explore it.

The current situation isn't ideal as that host is usually getting lag and we get notified about it and it may have some user impact during those times where it is lagging behind.

@BTullis do you think you could find some time to explore this idea.

Yes, I think that this is workable. I'm sorry that we've not been able to have a greater effect on fixing the issue to-date.

We're going to be working extensively on dumps 1.0 in the New Year, as we have committed to get this done by the end of March 2025: T352650: WE 5.4 KR - Hypothesis 5.4.4 - Q3 FY24/25 - Migrate current-generation dumps to run on kubernetes
I think that it should be realistic for us to change the destination database server at the same time, so that we stop using the core servers and switch to the dbstore100[7-9] servers.

I appreciate that we don't want to have these alerts for another three months, or more, while we work on the containerization of dumps, so I'll try to make sure that we prioritize this part of the work accordingly.

Thanks @BTullis. These really needs some priority as it keeps paging the on-call sre. Today we got two pages for replication lag.

Should we disable dumps in enwiki during the break? This is generating pages for SRE and also affecting bots (T382625#10419982)

I've sent an email to @LSobanski @KOfori @Ladsgroup @BTullis @xcollazo to see if we can at least disable them during the break, to avoid pages and affecting bots.
The email has been sent as not everyone reads phabricator notifications everyday, especially so close to Monday 23rd.

As I've said before, dumps should be disabled until they no longer cause db lag. Causing db lag for 6 months is unacceptable.

+1. Bots have been persistently running into errors and impacting on-wiki workflows. For my own bot, I am getting failure emails almost every other day, for months now. Today, I got 18. If this continues, bot operators may have no choice other than to disable maxlag detection.

I've sent an email to @LSobanski @KOfori @Ladsgroup @BTullis @xcollazo to see if we can at least disable them during the break, to avoid pages and affecting bots.
The email has been sent as not everyone reads phabricator notifications everyday, especially so close to Monday 23rd.

Thanks @marostgui. I have forwarded your email to @odimitrijevic and @Ahoelzl to make them aware of your request and to draw their attention to this matter.

I don't feel that I have the authority to decide to omit the enwiki dump for 20250101 and to kill any remaining process from the 20241220 run. As you will be able to see from T377594: Fix Dumps - errors exporting good revisions and this report, such an omission is considered a serious data quality incident. There are many downstream consumers of dumps, who would be adversely affected by this, if we were to pause the dumps, as you suggest.

I'm not trying to say that the status quo is acceptable, but I wonder if there are any other options. For example, would it be possible to disable pages for lag on db1206, just for a couple of weeks?

I also appreciate the input from @SD0001 above, about disruption to on-wiki workflows. Again, I can only try to say that many other users' workflows would be adversely affected if the enwiki dumps are skipped, so I don't feel that I can currently make this decision, unilaterally.

When it was lagged, these were the top queries:

SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_actor,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `revision` JOIN `page` ON ((rev_page=page_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `slots` ON ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id))   WHERE (page_id >= 56646852 AND page_id < 56649214) AND ((rev_page > 56648004 OR (rev_page = 56648004 AND rev_id > 833368004))) ORDER BY rev_page ASC,rev_id ASC LIMIT 10000;



SELECT /* WikiExporter::dumpPages  */  /*! STRAIGHT_JOIN */ rev_id,rev_page,rev_actor,actor_rev_user.actor_user AS `rev_user`,actor_rev_user.actor_name AS `rev_user_text`,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,comment_rev_comment.comment_text AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,page_namespace,page_title,page_id,page_latest,page_is_redirect,page_len,slot_revision_id,slot_content_id,slot_origin,slot_role_id,content_size,content_sha1,content_address,content_model  FROM `revision` JOIN `page` ON ((rev_page=page_id)) JOIN `actor` `actor_rev_user` ON ((actor_rev_user.actor_id = rev_actor)) JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = rev_comment_id)) JOIN `slots` ON ((slot_revision_id = rev_id)) JOIN `content` ON ((content_id = slot_content_id))   WHERE (page_id >= 35191045 AND page_id < 35192867) AND ((rev_page > 0 OR (rev_page = 0 AND rev_id > 0))) ORDER BY rev_page ASC,rev_id ASC LIMIT 10000;

They are quite slow on another hosts. Here's the query plan:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: revision
         type: range
possible_keys: PRIMARY,rev_actor_timestamp,rev_page_actor_timestamp,rev_page_timestamp
          key: rev_page_actor_timestamp
      key_len: 4
          ref: NULL
         rows: 19004
        Extra: Using index condition; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
...

(The most important one is the first row)

This really shouldn't do filesort.

Changing the order to ORDER BY rev_page ASC,rev_timestamp ASC, rev_id ASC (or any order based on indexes) would remove the filesort, it doesn't make it that much faster but I think it reduces the memory footprint of this. It might fix the lag, it might not.

Based on my understanding, given that these are partial dumps they won't have downstream cascading effects on the internal use cases. We can wait to start the full run when folks are back after the New Year. @BTullis @Marostegui let's go ahead and pause the run.

Thanks @odimitrijevic - @BTullis could you go ahead and disable them? Thanks.

OK, first to kill the current run. Following guidelines from here: https://wikitech.wikimedia.org/wiki/Dumps/Rerunning_a_job#Fixing_a_broken_dump

btullis@snapshot1012:~$ sudo -u dumpsgen bash

dumpsgen@snapshot1012:/home/btullis$ cd /srv/deployment/dumps/dumps/xmldumps-backup

dumpsgen@snapshot1012:/srv/deployment/dumps/dumps/xmldumps-backup$ cat /mnt/dumpsdata/xmldatadumps/private/enwiki/lock_20241220 
snapshot1012.eqiad.wmnet 2795803

That process tree certainly exists:

dumpsgen@snapshot1012:/srv/deployment/dumps/dumps/xmldumps-backup$ pstree -alp 2832990
python3,2832990 ./worker.py --configfile /etc/dumps/confs/wikidump.conf.dumps:en --log --job metacurrentdump,metacurrentdumprecombine --skipdone --exclusive --prereqs --date 20241220
  ├─{python3},2833006
  └─{python3},2833007

But for some reason the dumpadmin.py scripts isn't giving me a list of pids to kill.

dumpsgen@snapshot1012:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --kill --configfile /etc/dumps/confs/wikidump.conf.dumps:en --wiki enwiki --dryrun
would kill processes []

I will kill them manually.

dumpsgen@snapshot1012:/srv/deployment/dumps/dumps/xmldumps-backup$ kill 2832990 2833006 2833007

Then I removed the lock file.

dumpsgen@snapshot1012:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 dumpadmin.py --unlock --wiki enwiki --configfile /etc/dumps/confs/wikidump.conf.dumps:en
dumpsgen@snapshot1012:/srv/deployment/dumps/dumps/xmldumps-backup$ ls -l /mnt/dumpsdata/xmldatadumps/private/enwiki
total 32
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Sep  1 08:05 20240901
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Sep 20 08:05 20240920
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Oct  1 08:05 20241001
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Oct 20 08:05 20241020
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Nov  6 08:05 20241101
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Nov 20 08:05 20241120
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Dec  3 14:29 20241201
drwxr-xr-x 2 dumpsgen dumpsgen 4096 Dec 20 08:05 20241220

We can confirm that the run has stopped, because there are no files open over NFS on this host.

btullis@snapshot1012:~$ sudo lsof -N
btullis@snapshot1012:~$

Next I will work on making sure that the enwiki 20241220 run doesn't get picked up and restarted by the systemd timer on this, or any other of the snapshot hosts.

Just as a point of note, the replication lag on db1206 had already returned to zero since about 1:45 this morning. It was probably just going through a compression part of the proces, so it could have triggered lag again when moving onto another database dump part.

image.png (1×1 px, 249 KB)

Change #1106019 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Temporarily disable the enwiki dumps on snapshot1012

https://gerrit.wikimedia.org/r/1106019

I have checked the logic in the fulldumps.sh script here and I am confident that if we disable/absent the systemd timers on snapshot1012, then no other snapshot host will restart the enwiki dump.

The change that I have made in https://gerrit.wikimedia.org/r/1106019 will absent both the partialdumps-rest and the fulldumps-rest services, which means that the 20250101 will also be paused.

If we revert this change before the 14th of January, then the full dump will start at the next 08:00 or 20:00 point.

Change #1106019 merged by Btullis:

[operations/puppet@production] Temporarily disable the enwiki dumps on snapshot1012

https://gerrit.wikimedia.org/r/1106019

The enwiki dumps are now disabled.

Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[fulldumps-rest]/Systemd::Timer[fulldumps-rest]/Systemd::Service[fulldumps-rest]/Systemd::Unit[fulldumps-rest.timer]/File[/lib/systemd/system/fulldumps-rest.timer]/ensure: removed
Notice: /Stage[main]/Snapshot::Dumps::Systemdjobs/Systemd::Timer::Job[partialdumps-rest]/Systemd::Timer[partialdumps-rest]/Systemd::Service[partialdumps-rest]/Systemd::Unit[partialdumps-rest.timer]/File[/lib/systemd/system/partialdumps-rest.timer]/ensure: removed

I sent an email to the xmldatadumps-l list explaining that the 20241220 dump of enwiki will not complete and that the 20250101 dump of enwiki will be delayed by a few days.

Thanks Ben!

If we revert this change before the 14th of January, then the full dump will start at the next 08:00 or 20:00 point.

I think we should gather some more thoughts about this. This is basically enabling a service back that actively pages and disrupts production traffic.

Thanks for the update on XML data dumps list. I see there's progress on the other side: https://phabricator.wikimedia.org/T382947#10476420 . Hopefully this will allow to re-enable the dumps soon.

Change #1114991 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] dumps: Re-enable the enwiki dumps on snapshot1012

https://gerrit.wikimedia.org/r/1114991

Change #1114991 merged by Btullis:

[operations/puppet@production] dumps: Re-enable the enwiki dumps on snapshot1012

https://gerrit.wikimedia.org/r/1114991

Change #1128386 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] dumps: Stop using the analytics replicas for misc dumps

https://gerrit.wikimedia.org/r/1128386

Hello. Just FYI, we are planning to switch snapshot1016 back to using the core database servers for miscellaneous dumps, for troubleshooting purposes.
The reason for this is part of the investigation for T386255: wmf.wikidata_item_page_link and wmf.wikidata_entity snapshots stuck at 2025-01-20

We wish to ascertain whether switching from db1167 to dbstore1009 has caused a performance regression in the wikidata entities dumps, or whether the cause is something else.

Change #1128386 merged by Btullis:

[operations/puppet@production] dumps: Stop using the analytics replicas for misc dumps

https://gerrit.wikimedia.org/r/1128386

Just to follow up on this, we have confirmed that there is a performance regression when using dbstore1009 (i.e. s8-analytics-replica.eqiad.wmnet) for the wikibase dumps.

We are investigating this and looking for a fix in T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009.

In the meantime, we have removed /etc/wikimedia-servergroup from snapshot1016 (in https://gerrit.wikimedia.org/r/1128386), so all of the miscellaneous dumps that run on that host will currently be using the core db servers.
For the wikibase dumps against s8, that currently means that they use db1167.

Given that all of the known disruption described on this ticket was related to enwiki and db1206, perhaps we have reached a state where this ticket could be considered resolved.
On the other hand, if you would rather wait until we have successfully migrated the wikibase dumps as well, that's fine with me.

Just to follow up on this, we have confirmed that there is a performance regression when using dbstore1009 (i.e. s8-analytics-replica.eqiad.wmnet) for the wikibase dumps.

We are investigating this and looking for a fix in T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009.

In the meantime, we have removed /etc/wikimedia-servergroup from snapshot1016 (in https://gerrit.wikimedia.org/r/1128386), so all of the miscellaneous dumps that run on that host will currently be using the core db servers.
For the wikibase dumps against s8, that currently means that they use db1167.

We'll keep an eye on this as we're going to be running all the traffic in eqiad for a couple of weeks due to the DC switch T385155: 🧭 Northward Datacentre Switchover (March 2025)

Given that all of the known disruption described on this ticket was related to enwiki and db1206, perhaps we have reached a state where this ticket could be considered resolved.
On the other hand, if you would rather wait until we have successfully migrated the wikibase dumps as well, that's fine with me.

I'd prefer to leave this ticket open so we don't forget about it.

Thank you!

Hello again. It looks like the wikibase dumps performance issue described in T389199: Fix a performance regression affecting wikibase dumps when using mediawiki analytics replica of s8 - dbstore1009 may have returned, since September 25th 2025.
We are currently investigating in T406429: No Wikidata dumps for Week 40 of 2025 (recurring issue).

As part of our troubleshooting, we have just merged this patch: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1747

This will cause the wikibase dumps to start using the core database servers again temporarily, while we troubleshoot the performance issues. I hope this is OK with everyone.

Specifically, I think that the current servers will be:

  • db1248 and db1221 for the commonswiki wikibase dumps
  • db1167 for the wikidata wikibase dumps
btullis@deploy2002:~$ kube-env mediawiki-dumps-legacy-deploy dse-k8s-eqiad

btullis@deploy2002:~$ kubectl get pods
NAME                                                            READY   STATUS    RESTARTS   AGE
enwiki-sql-xml-enwiki-dump-remaining-full-4h2akc8               4/4     Running   0          3d1h
mediawiki-cirrussearch-dump-s4                                  4/4     Running   0          2d19h
mediawiki-dump-commons-mediainfo-json-dump                      4/4     Running   0          32m
mediawiki-dump-commons-mediainfo-rdf-dump                       4/4     Running   0          3d16h
mediawiki-dump-wikidata-all-json-dump                           4/4     Running   0          3d8h
mediawiki-dump-wikidata-all-rdf-dump                            4/4     Running   0          2d12h
mediawiki-dump-wikidata-truthy-rdf-dump                         4/4     Running   0          12h
mediawiki-dumps-legacy-sync-toolbox-78dfff7f4f-5bzt8            2/2     Running   0          10d
mediawiki-dumps-legacy-toolbox-7dc4f66666-554mj                 2/2     Running   0          9d
wikidatawiki-sql-xml-wikidatawiki-dump-remaining-full-keqbd8w   4/4     Running   0          9d

btullis@deploy2002:~$ kubectl exec -it mediawiki-dump-commons-mediainfo-json-dump -- bash
Defaulted container "base" out of: base, mediawiki-production-mcrouter, mediawiki-production-tls-proxy, mediawiki-production-rsyslog
www-data@mediawiki-dump-commons-mediainfo-json-dump:/$ php /srv/mediawiki/multiversion/MWScript.php shell.php --wiki wikidatawiki
Psy Shell v0.12.10 (PHP 8.1.33 — cli) by Justin Hileman

> var_dump($wgLBFactoryConf["groupLoadsBySection"]["s4"]);
array(2) {
  ["dump"]=>
  array(2) {
    ["db1248"]=>
    int(100)
    ["db1221"]=>
    int(100)
  }
  ["vslow"]=>
  array(2) {
    ["db1248"]=>
    int(100)
    ["db1221"]=>
    int(100)
  }
}
= null

> var_dump($wgLBFactoryConf["groupLoadsBySection"]["s8"]);
array(2) {
  ["dump"]=>
  array(1) {
    ["db1167"]=>
    int(100)
  }
  ["vslow"]=>
  array(1) {
    ["db1167"]=>
    int(100)
  }
}
= null

Please do let me know if there are any unforeseen repercussions from this change. We will keep you updated as to whether or not this change of database servers fixes the issue.

Doing it temporarily and specially on the wikibase dumpers should be fineTM (we had a lot of issues with xml dumpers but not with wikibase ones unless Manuel has seen issues too)

I don't recall any - let's keep an eye on them though