Page MenuHomePhabricator

beta-scap-eqiad always rebuild l10n cache since March 17th causing build to take more than 10 minutes.
Closed, ResolvedPublic

Description

Looking at the beta cluster scap time trend, the runtime went from less than a minute up to more than ten minutes.

Success	#45591	12 min	deployment-bastion.eqiad
Success	#45590	12 min	deployment-bastion.eqiad
Success	#45589	13 min	deployment-bastion.eqiad
Success	#45588	12 min	deployment-bastion.eqiad
Success	#45587	12 min	deployment-bastion.eqiad
Success	#45586	12 min	deployment-bastion.eqiad
Success	#45585	14 min	deployment-bastion.eqiad
Failed	#45584	0.32 sec	deployment-bastion.eqiad
Failed	#45583	0.86 sec	deployment-bastion.eqiad
Success	#45582	35 sec	deployment-bastion.eqiad
Success	#45581	44 sec	deployment-bastion.eqiad
Success	#45580	42 sec	deployment-bastion.eqiad
Success	#45579	40 sec	deployment-bastion.eqiad
Success	#45578	46 sec	deployment-bastion.eqiad

The two failures are trying to unlink() /var/lock/scap but failling due to a permission error:

File "/mnt/srv/deployment/scap/scap/scap/utils.py", line 256, in lock
    os.unlink(filename)
Operation not permitted: '/var/lock/scap'

The next build #45585 happened after those two failures and ran for 14 minutes. It occured on March 17th 16:54:06 UTC. Looking at the console log with elapsed time (hh:mm:ss.micro):

00:01:44.958 16:55:51 Updating LocalisationCache for master using 2 thread(s)
00:13:55.018 17:08:01 Generating JSON versions and md5 files

All subsequent builds rebuild the LocalisationCache as well. So something is broken and cause scap / l10n updater to always consider the cache to be outdated thus rebuilding it everytime :(

Seems some operation / change has been made at that time which is the slowness root cause.

Event Timeline

hashar raised the priority of this task from to High.
hashar updated the task description. (Show Details)
hashar subscribed.
hashar renamed this task from beta-scap-eqiad runtime went from less than a minute to more than 10 minutes to beta-scap-eqiad always rebuild l10n cache since March 17th causing build to take more than 10 minutes..Mar 24 2015, 11:23 AM
hashar set Security to None.

Possibly related: T92823 (The fix for that went in around that time)

But I don't see anything in that change that _should_ cause this problem.

Adding Bryan because he seems to be the only one who really understands all the moving pieces of this thing.

(Sorry Bryan!)

I think this was an unintended side effect of https://gerrit.wikimedia.org/r/#/c/197262/. Scap is now building the l10n cdb files as the www-data user in a temporary location. A quick look at the code for rebuildLocalisationCache.php and LocalisationCacheBulkLoad makes me think that we need to seed this temporary directory with the current l10n CDB files to reduce the number of keys that are added to the existing CDBs.

Change 199318 had a related patch set uploaded (by BryanDavis):
Copy l10n CDB files to rebuildLocalisationCache.php tmp dir

https://gerrit.wikimedia.org/r/199318

Earlier this morning I found a potential culprit with https://gerrit.wikimedia.org/r/#/c/197355/

commit b6b07421cc888fddff91bca62bb5dd064e159bbe
Author: YuviPanda <yuvipanda@gmail.com>
Date:   Tue Mar 17 22:37:23 2015 +0530

    scap: Clone mediawiki-config on all scap masters
    
    - Get rid of scap role, wasn't giving us much.
    - Also make group ownership in l10nupdate configurable
    - Include l10nupdate on all scap masters
    
    Bug: T88442
    Change-Id: I34112d01af093cf13f31c7f32d21925aa600f9dc

It changes l10nupdate files to be owned by group 'project-deployment-prep' instead of 'wikidev'.

I have no idea why that has been changed though nor whether ALL files have been adjusted manually. There is probably some left over.

Change 199318 merged by jenkins-bot:
Copy l10n CDB files to rebuildLocalisationCache.php tmp dir

https://gerrit.wikimedia.org/r/199318

bd808 claimed this task.

After applying https://gerrit.wikimedia.org/r/199318 on deployment-prep the average sync time went back down to under 2 minutes. Change merged and pushed to both beta and prod.

Thank you for the fix and I know how complicated our l10n cache sync is so kudos!