Page MenuHomePhabricator

Find a good way to run the updateVarDumps script on large wikis
Closed, ResolvedPublic

Description

For detailed context, see the parent task, T246539.

This is a maintenance script that will update lots of rows in the abuse_filter_log and text tables -- roughly all rows in the abuse_filter_log table are updated, and ~75% of that number of text rows. The script (source) runs in batches of 500 rows, and waits for replication to catch up.

However, it takes quite a long time to run: on mediawikiwiki, it took 20 minutes to update 140k text rows and 186k abuse_filter_log rows. On wikis like enwiki, where the abuse_filter_log has over 20M rows, this script would take over a day to complete. This doesn't look great, so I'm looking for ways to reduce the pressure on servers & LBs.

Some ideas include changing the batch size, or making it sleep between batches, or spawning & killing it periodically. What do DBAs recommend here?

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
OpenNone
OpenNone
ResolvedNone
ResolvedDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedPRODUCTION ERRORDaimona
StalledNone
StalledNone
ResolvedDaimona
OpenNone
ResolvedDaimona
ResolvedDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedDaimona
ResolvedDaimona
ResolvedPRODUCTION ERRORDaimona
ResolvedUrbanecm
ResolvedDaimona

Event Timeline

Marostegui subscribed.

Actually taking a day isn't that bad if the script is that safe.
500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch, that requires coordination. However, as a rule of thumb, if you do sleep and wait for replication between batches, that's a very healthy approach that will ensure that we have no lag on the hosts, which is what would be killing us.
Normally, lots of small transactions are preferred over less but bigger ones.

My recommendation would be to do some sort of connection pooling, maybe increase your batch from 500 to 1000 rows (for now), and do a wait for replication after each batch. How often does this script run?

Actually taking a day isn't that bad if the script is that safe.
500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch, that requires coordination.

Thanks for the recommendation! If you wish, I can poke you or Jaime on IRC when running the script on bigger wikis.

However, as a rule of thumb, if you do sleep and wait for replication between batches, that's a very healthy approach that will ensure that we have no lag on the hosts, which is what would be killing us.

It currently doesn't sleep, but it does wait for replication.

My recommendation would be to do some sort of connection pooling, maybe increase your batch from 500 to 1000 rows (for now)

I'm unsure about connection pooling, but increasing the batch is simple.

How often does this script run?

Only once, fortunately.

Actually taking a day isn't that bad if the script is that safe.
500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch, that requires coordination.

Thanks for the recommendation! If you wish, I can poke you or Jaime on IRC when running the script on bigger wikis.

If you can add it to: https://wikitech.wikimedia.org/wiki/Deployments that's enough. But also give us a heads up on IRC so we are aware the script is running.

However, as a rule of thumb, if you do sleep and wait for replication between batches, that's a very healthy approach that will ensure that we have no lag on the hosts, which is what would be killing us.

It currently doesn't sleep, but it does wait for replication.

It wouldn't hurt to add a few seconds sleep between batches, if the script is meant to be run just once, being extra careful wouldn't hurt.

My recommendation would be to do some sort of connection pooling, maybe increase your batch from 500 to 1000 rows (for now)

I'm unsure about connection pooling, but increasing the batch is simple.

How often does this script run?

Only once, fortunately.

Nice, so let's:

  • Add sleep between batches
  • Increase it to 1000
  • Ping us on IRC when you are about to start
  • Add it to the deployments page

Change 597059 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] updateVarDumps: Add more options, aesthetic changes

https://gerrit.wikimedia.org/r/597059

If you can add it to: https://wikitech.wikimedia.org/wiki/Deployments that's enough. But also give us a heads up on IRC so we are aware the script is running.

Thanks, will do as soon as we and James find a good date for running the script on the remaining wikis.

It wouldn't hurt to add a few seconds sleep between batches, if the script is meant to be run just once, being extra careful wouldn't hurt.

I was a bit worried about the effect on the server (which would be still busy running those sleep statements), but given that it's a maintenance-only server, I guess there's nothing to worry about. I've added this option on master.

Nice, so let's:

  • Add sleep between batches
  • Increase it to 1000
  • Ping us on IRC when you are about to start
  • Add it to the deployments page

Wilco, thank you.

Change 597059 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] updateVarDumps: Add more options, aesthetic changes

https://gerrit.wikimedia.org/r/597059

Removing DBA as there are no actionables for us here. I will keep subscribed to the task to make sure I am aware of any updates or when the run happens
Thank you!

Daimona claimed this task.

AFAIK, the script now runs without any performance issue.