Page MenuHomePhabricator

Adjust the pool parallelism settings after having performed a load test of the dumpsv1 process
Closed, ResolvedPublic

Description

The dumps v1 tasks can put strain on the databases they query, as well as on the ES (external storage) servers. Running too many dumps tasks in parallel can have a dramatic effect on production if we overload these ES databases, as they store the actual wiki contents for all wikis.

We need to perform some gradual load tests to assess what would be good parallelism settings for the dumps pools. (cc @Ladsgroup )

Reminder: we're planning to setup 2 pools. One for the misc wiki dumps and one for the large wiki dumps. This way, we can adjust parallelism on the fly, as well as ensure that the large wiki dumps start as soon as possible.

Event Timeline

brouberol triaged this task as Medium priority.

Note: we should also consider the effect of the dumps on our Ceph cluster, as they will induce a lot of IO activity on the mediawiki-dumps-legacy-fs CephFS volume.

@Ladsgroup we've just performed a dump test of 64 regular (ie: non large) wikis at parallelism=32, observable here. Did it have any adverse effect on the external storage servers?

@Ladsgroup we've just performed a dump test of 64 regular (ie: non large) wikis at parallelism=32, observable here. Did it have any adverse effect on the external storage servers?

We have not observed any issues so far \o/

That's great to hear. So far, I've settled on a parallelism of 32 for regular wiki dumps and I'll probably between 10 and 16 for the large wikis. What remain to be seen is whether we can run the dumps in the allotted time with that new setup. Thanks!

Hi, can you give the exact times the dumpers run? I'm seeing connection pile ups that only show up in x1/es and wondering if they would be related. Here is an example: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=2025-05-11T09:40:00.000Z&to=2025-05-11T10:11:59.000Z&timezone=utc&var-job=$__all&var-server=db1179&var-port=9104&refresh=1m&viewPanel=panel-9

I shouldn't think that they're related to the tests we've been doing on dumps from Airflow.
Here are the only test runs that we have done from an Airflow job in the last week:
https://airflow-test-k8s.wikimedia.org/dags/mediawiki_sql_xml_regular_dumps_a_to_b/grid

image.png (824×1 px, 146 KB)

It might still be related to the production dumps running on the snapshot servers, though.

Made https://gerrit.wikimedia.org/r/1145243 (haven't tested it). I will check the timestamp of outages with dump runs and see if I can find anything interesting.

We have run our full-scale dumps v1 orchestration on airflow/k8s, with 32 pool slots for regular-sized wiki dumps, and 16 pool slots for large wiki slots. We're satisfied with the throughput and haven't heard that it negatively impacted the ES servers (while the OG dumps v1 processes were still running) in https://phabricator.wikimedia.org/T397848.

Screenshot 2025-07-04 at 17.26.45.png (1×2 px, 259 KB)

I'll close this one.