Page MenuHomePhabricator

Speed up captcha generation
Closed, ResolvedPublic

Description

Along with making captcha.py threaded in T157734 there might be further ways to make the whole process quicker.

For example, the code does numerous "store" operations in a for loop...

Event Timeline

Reedy created this task.Feb 9 2017, 9:32 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2017, 9:32 PM
[21:27:26] <Reedy> AaronSchulz: is there a way with filebackend stuff to store many files in one go?
[21:27:33] <Reedy> rather than a for loop calling quickStore?
[21:29:57] <AaronSchulz> like doQuickOperations?
	 * Perform a set of independent file operations on some files.


	 * b) Copy a file system file into storage
	 * @code
	 *     [
	 *         'op'                  => 'store',
	 *         'src'                 => <file system path, FSFile, or TempFSFile>,
	 *         'dst'                 => <storage path>,
	 *         'headers'             => <HTTP header name/value map> # since 1.21
	 *     ]
	 * @endcode
Reedy added a comment.Feb 16 2017, 6:24 PM

So for a full run, generating 10k captchas

reedy@terbium:~$ /usr/local/bin/mwscript extensions/ConfirmEdit/maintenance/GenerateFancyCaptchas.php mediawikiwiki --wordlist=/etc/fancycaptcha/words --font=/usr/share/fonts/truetype/freefont/FreeMonoBoldOblique.ttf --blacklist=/etc/fancycaptcha/badwords --fill=120000 --oldcaptcha
Current number of captchas is 110000.
Generating 10000 new captchas.. Done.

Generated 10000 captchas in 1594.0 seconds
Copying the new captchas to storage... Done.

Copied 14008 captchas to storage in 3178.9 seconds
Removing temporary files... Done.

Whole captchas generation process took 4775.3 seconds
ProcessTime
Generate Captcha26m 34s
Copying Captchas52m 59s
Total79m 35s

Change 358395 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/puppet@production] Generate FancyCaptchas in 4 threads

https://gerrit.wikimedia.org/r/358395

Reedy added a comment.Sep 4 2017, 1:36 AM

I should have a look how quick this is running now...

Reedy added a comment.EditedSep 4 2017, 2:39 AM
reedy@terbium:~$ cat /var/log/mediawiki/generate-fancycaptcha/cron.log-20170901 
Generating 10000 new captchas.. Done.

Generated 10000 captchas in 1180.9 seconds
Getting a list of old captchas to delete... Done.
Copying the new captchas to storage... Done.

Copied 10000 captchas to storage in 525.1 seconds
Deleting 10000 old captchas...
Done.

Deleted 10000 old captchas in 354.1 seconds
Removing temporary files... Done.

Whole captchas generation process took 2061.3 seconds
reedy@terbium:~$

10,000 captcha took 34 minutes.

Whole captchas generation process took 4775.3 seconds

Roughly, 57% quicker is pretty good going from where we were before.

ProcessOld TimeNew TimeImprovement
Generate Captcha1594.01180.9-25.9%
Copying Captchas3178.9525.1- 83.5%
Deleting old Captchas1587.4354.1-77.7%
Total4775.32061.3-56.8%

So the generation improvement was T157734: Add threading to captcha.py, the deleting and copying improvements was T157738: Use doQuickOperations instead of foreach loops calling quickStore/quickStore

Still need to get https://gerrit.wikimedia.org/r/#/c/358395/ reviewed and deployed. Gonna shove that in Puppet Swat tomorrow and we'll see how we look again a month :)

Change 358395 merged by Elukey:
[operations/puppet@production] Generate FancyCaptchas in 4 threads

https://gerrit.wikimedia.org/r/358395

Reedy triaged this task as Normal priority.Sep 6 2017, 4:25 PM
Reedy added a comment.Oct 4 2017, 10:41 PM

So from the 1st October, 2017 run:

Generating 10000 new captchas.. Done.

Generated 10000 captchas in 295.8 seconds
Getting a list of old captchas to delete... Done.
Copying the new captchas to storage... Done.

Copied 10000 captchas to storage in 359.9 seconds
Deleting 10000 old captchas...
Done.

Deleted 10000 old captchas in 289.4 seconds
Removing temporary files... Done.

Whole captchas generation process took 946.5 seconds

lol, so we're down to about 15 minutes, down from around 80 minutes originally

ProcessOriginal TimeTime after PHP improvementsTime after threading improvementsImprovement
Generate Captcha1594.01180.9295.8-81.4%
Copying Captchas3178.9525.1359.9-88.7%
Deleting old Captchas1587.4354.1289.4-81.8%
Total4775.32061.3946.5-80.2%

Of course, the key figure here, is the 81.4% decrease in the time spent generating captchas from the original point. And 74% quicker than after the PHP improvements done.

The difference in deleting/copying captchas could just be down to terbium/swift load, I'm sure that'll vary somewhat.

That's a hell of a lot better than where we were in February.

I'm gonna make a patch so we regenerate captchas weekly as part of the continuous improvement cycle

Change 382322 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/puppet@production] Regenerate FancyCaptchas weekly rather than monthly

https://gerrit.wikimedia.org/r/382322

Change 382322 merged by Filippo Giunchedi:
[operations/puppet@production] Regenerate FancyCaptchas weekly rather than monthly

https://gerrit.wikimedia.org/r/382322

Reedy closed this task as Resolved.Oct 26 2017, 6:54 PM
Reedy claimed this task.

I'm closing this. There may be further improvements down the line... But 15 minutes to generate 10,000 captchas end to end, doesn't seem bad to me