Find a good way to run the updateVarDumps script on large wikis
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Daimona
	May 13 2020, 4:52 PM

Description

For detailed context, see the parent task, T246539.

This is a maintenance script that will update lots of rows in the abuse_filter_log and text tables -- roughly all rows in the abuse_filter_log table are updated, and ~75% of that number of text rows. The script (source) runs in batches of 500 rows, and waits for replication to catch up.

However, it takes quite a long time to run: on mediawikiwiki, it took 20 minutes to update 140k text rows and 186k abuse_filter_log rows. On wikis like enwiki, where the abuse_filter_log has over 20M rows, this script would take over a day to complete. This doesn't look great, so I'm looking for ways to reduce the pressure on servers & LBs.

Some ideas include changing the batch size, or making it sleep between batches, or spawning & killing it periodically. What do DBAs recommend here?

Details

	Subject	Repo	Branch	Lines +/-
	updateVarDumps: Add more options, aesthetic changes	mediawiki/extensions/AbuseFilter	master	+37 -11

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		None	T254646 Reconsidering how we name things
Open		None	T281536 Schema:EditAttemptStep uses non-inclusive language.
Open		None	T279275 Move all the functionality of {Spam,Title}Blacklist extensions into AbuseFilter and retire them
Resolved		None	T290934 Expand the set of bundled extensions and skins in MediaWiki 1.38
Resolved		Daimona	T191740 Bundle AbuseFilter extension with MediaWiki
Resolved	PRODUCTION ERROR	Daimona	T259179 Use of Article::getContentModel was deprecated in MediaWiki 1.35. [Called from AFComputedVariable::compute]
Resolved	PRODUCTION ERROR	Daimona	T187153 Special:Abuselog throws when viewing details or examining (BadMethodCallException: Call get getId() on null)
Stalled		None	T106386 Compress data at external storage
Stalled		None	T106388 Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong)
Resolved		Daimona	T34478 AbuseFilter not setting utf-8 flag
Open		None	T216827 page_restrictions_edit empty in AbuseFilter when page protection is inherited from a cascade
Resolved		Daimona	T261069 Decouple AbuseFilterVariableHolder and AFComputedVariable
Resolved		Daimona	T213006 Create a script to update afl_var_dump, drop back-compat code
Resolved	PRODUCTION ERROR	Daimona	T214196 BadMethodCallException: Call to a member function dumpAllVars() on a non-object (boolean)
Resolved	PRODUCTION ERROR	Daimona	T259180 Use of Article::prepareContentForEdit was deprecated in MediaWiki 1.35. [Called from AFComputedVariable::compute]
			Restricted Task
Resolved		Daimona	T110854 Blank afl_var_dump on 9 testwiki abuse_filter_log entries from July 2012 causes exceptions
Resolved		Daimona	T204236 Old variables are computed wrongly for old entries
Resolved	PRODUCTION ERROR	Daimona	T214193 PHP Notice: Unable to unserialize in AbuseFilter.php
Resolved		Urbanecm	T246539 Dry-run, then actually run updateVarDumps
Resolved		Daimona	T252696 Find a good way to run the updateVarDumps script on large wikis

Event Timeline

Daimona created this task.May 13 2020, 4:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 13 2020, 4:52 PM

Daimona added a parent task: T246539: Dry-run, then actually run updateVarDumps.May 13 2020, 4:53 PM

Actually taking a day isn't that bad if the script is that safe.
500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch, that requires coordination. However, as a rule of thumb, if you do sleep and wait for replication between batches, that's a very healthy approach that will ensure that we have no lag on the hosts, which is what would be killing us.
Normally, lots of small transactions are preferred over less but bigger ones.

My recommendation would be to do some sort of connection pooling, maybe increase your batch from 500 to 1000 rows (for now), and do a wait for replication after each batch. How often does this script run?

In T252696#6143593, @Marostegui wrote:

Actually taking a day isn't that bad if the script is that safe.
500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch, that requires coordination.

Thanks for the recommendation! If you wish, I can poke you or Jaime on IRC when running the script on bigger wikis.

However, as a rule of thumb, if you do sleep and wait for replication between batches, that's a very healthy approach that will ensure that we have no lag on the hosts, which is what would be killing us.

It currently doesn't sleep, but it does wait for replication.

My recommendation would be to do some sort of connection pooling, maybe increase your batch from 500 to 1000 rows (for now)

I'm unsure about connection pooling, but increasing the batch is simple.

How often does this script run?

Only once, fortunately.

In T252696#6144133, @Daimona wrote:

In T252696#6143593, @Marostegui wrote:

Actually taking a day isn't that bad if the script is that safe.
500 rows can probably be changed to 1000, but other than that, if we want to find the perfect batch, that requires coordination.

Thanks for the recommendation! If you wish, I can poke you or Jaime on IRC when running the script on bigger wikis.

If you can add it to: https://wikitech.wikimedia.org/wiki/Deployments that's enough. But also give us a heads up on IRC so we are aware the script is running.

However, as a rule of thumb, if you do sleep and wait for replication between batches, that's a very healthy approach that will ensure that we have no lag on the hosts, which is what would be killing us.

It currently doesn't sleep, but it does wait for replication.

It wouldn't hurt to add a few seconds sleep between batches, if the script is meant to be run just once, being extra careful wouldn't hurt.

My recommendation would be to do some sort of connection pooling, maybe increase your batch from 500 to 1000 rows (for now)

I'm unsure about connection pooling, but increasing the batch is simple.

How often does this script run?

Only once, fortunately.

Nice, so let's:

Add sleep between batches
Increase it to 1000
Ping us on IRC when you are about to start
Add it to the deployments page

Change 597059 had a related patch set uploaded (by Daimona Eaytoy; owner: Daimona Eaytoy):
[mediawiki/extensions/AbuseFilter@master] updateVarDumps: Add more options, aesthetic changes

https://gerrit.wikimedia.org/r/597059

gerritbot added a project: Patch-For-Review.May 18 2020, 1:44 PM

In T252696#6144137, @Marostegui wrote:

If you can add it to: https://wikitech.wikimedia.org/wiki/Deployments that's enough. But also give us a heads up on IRC so we are aware the script is running.

Thanks, will do as soon as we and James find a good date for running the script on the remaining wikis.

It wouldn't hurt to add a few seconds sleep between batches, if the script is meant to be run just once, being extra careful wouldn't hurt.

I was a bit worried about the effect on the server (which would be still busy running those sleep statements), but given that it's a maintenance-only server, I guess there's nothing to worry about. I've added this option on master.

Nice, so let's:

Add sleep between batches

Increase it to 1000

Ping us on IRC when you are about to start

Add it to the deployments page

Wilco, thank you.

Thank you @Daimona :-)

Change 597059 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] updateVarDumps: Add more options, aesthetic changes

https://gerrit.wikimedia.org/r/597059

Maintenance_bot removed a project: Patch-For-Review.May 28 2020, 8:10 PM

Removing DBA as there are no actionables for us here. I will keep subscribed to the task to make sure I am aware of any updates or when the run happens
Thank you!

Daimona mentioned this in T246539: Dry-run, then actually run updateVarDumps.Jul 14 2020, 8:58 AM

AFAIK, the script now runs without any performance issue.

Find a good way to run the updateVarDumps script on large wikisClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Find a good way to run the updateVarDumps script on large wikis
Closed, ResolvedPublic
Actions

Related Objects
Search...