Page MenuHomePhabricator

LocalisationUpdate broken since 2014-12-16
Closed, ResolvedPublic

Description

Last run of LocalisationUpdate on WMF cluster was von 2014-12-16 per https://wikitech.wikimedia.org/wiki/Server_Admin_Log.

Details

Related Gerrit Patches:
mediawiki/tools/scap : masterDelete scap lock file on unlock

Event Timeline

Raymond created this task.Jan 5 2015, 11:29 AM
Raymond raised the priority of this task from to Needs Triage.
Raymond updated the task description. (Show Details)
Raymond added a subscriber: Raymond.

Are you sure this is not intentional? Last MediaWiki software deployment was 2014-12-17, and that's on purpose: https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap

Are you sure this is not intentional? Last MediaWiki software deployment was 2014-12-17, and that's on purpose: https://www.mediawiki.org/wiki/MediaWiki_1.25/Roadmap

LocalisationUpdate runs every night since some years, independent from any deployment cycle.

Aklapper triaged this task as High priority.Jan 5 2015, 4:36 PM

@Reedy / @mmodell help please.

I don't see a !log in the SAL re disabling it... so....

Reedy added a comment.EditedJan 5 2015, 4:58 PM
reedy@tin:/var/log/l10nupdatelog$ ls -al l10nupdate.log-2015010*
-rw-rw-r-- 1 l10nupdate l10nupdate 10437 Jan  1 02:09 l10nupdate.log-20150101.gz
-rw-rw-r-- 1 l10nupdate l10nupdate  9405 Jan  2 02:09 l10nupdate.log-20150102.gz
-rw-rw-r-- 1 l10nupdate l10nupdate 13952 Jan  3 02:11 l10nupdate.log-20150103.gz
-rw-rw-r-- 1 l10nupdate l10nupdate 12782 Jan  4 02:10 l10nupdate.log-20150104.gz
-rw-rw-r-- 1 l10nupdate l10nupdate 12692 Jan  5 02:10 l10nupdate.log-20150105.gz

If we zcat this... At the end we see

02:10:15 Unhandled error:
Traceback (most recent call last):
  File "/srv/deployment/scap/scap/scap/cli.py", line 273, in run
    exit_status = app.main(extra_args)
  File "/srv/deployment/scap/scap/scap/main.py", line 318, in main
    super(SyncDir, self).main(*extra_args)
  File "/srv/deployment/scap/scap/scap/main.py", line 37, in main
    with utils.lock(self.config['lock_file']):
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/srv/deployment/scap/scap/scap/utils.py", line 249, in lock
    raise LockFailedError('Failed to lock %s: %s' % (filename, e))
LockFailedError: Failed to lock /var/lock/scap: [Errno 13] Permission denied: '/var/lock/scap'
02:10:15 sync-dir failed: <LockFailedError> Failed to lock /var/lock/scap: [Errno 13] Permission denied: '/var/lock/scap'
Failed to sync-dir 'php-1.25wmf12/cache/l10n'

I see

reedy@tin:~$ ls -al /var/lock/scap
-rw-rw-r-- 1 ori wikidev 0 Jan  5 17:43 /var/lock/scap

But not sure what it was before

Reedy added a comment.Jan 6 2015, 10:09 PM

Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group....

l10nupdate@tin:/var/log/l10nupdatelog$ groups
l10nupdate
l10nupdate@tin:/var/log/l10nupdatelog$

Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group....

l10nupdate@tin:/var/log/l10nupdatelog$ groups
l10nupdate
l10nupdate@tin:/var/log/l10nupdatelog$

Operations help, please?

Reedy added a subscriber: bd808.Jan 8 2015, 5:58 PM

I think @bd808 suggested "we" just make scap delete the lock file at the end of an operation...

chasemp added a subscriber: chasemp.Jan 8 2015, 6:08 PM
In T85790#963669, @greg wrote:

Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group....

l10nupdate@tin:/var/log/l10nupdatelog$ groups
l10nupdate
l10nupdate@tin:/var/log/l10nupdatelog$

Operations help, please?

Wikidev is the default group for all users so adding a system account seems weird to me. Why isn't this file written now with perms appropriate to the intention of l10update management?

greg added a comment.Jan 8 2015, 6:11 PM

I think @bd808 suggested "we" just make scap delete the lock file at the end of an operation...

This seems like the easier solution.

Reedy added a comment.Jan 8 2015, 7:00 PM

I've just had ops chown the lock file to l10nupdate so I can at least do a manual run. Whether the permissions will actually stick...

Change 183560 had a related patch set uploaded (by BryanDavis):
Delete scap lock file on unlock

https://gerrit.wikimedia.org/r/183560

Patch-For-Review

Change 183560 merged by jenkins-bot:
Delete scap lock file on unlock

https://gerrit.wikimedia.org/r/183560

Reedy added a comment.Jan 14 2015, 5:55 PM

So we're back to

02:16:13 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf13', '--include', 'php-1.25wmf13/cache', '--include', 'php-1.25wmf13/cache/l10n', '--include', 'php-1.25wmf13/cache/l10n/***', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1018 returned [255]: Permission denied (publickey).

@bd808 did we have a task for tracking this issue? I can't see it at first glance

So we're back to

02:16:13 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf13', '--include', 'php-1.25wmf13/cache', '--include', 'php-1.25wmf13/cache/l10n', '--include', 'php-1.25wmf13/cache/l10n/***', 'mw1070.eqiad.wmnet', 'mw1161.eqiad.wmnet', 'mw1201.eqiad.wmnet'] on mw1018 returned [255]: Permission denied (publickey).

@bd808 did we have a task for tracking this issue? I can't see it at first glance

T76061: l10nupdate user can't access scap shared ssh key causing nightly l10nupdate sync process to fail which could use a renaming to something like "l10nupdate can't access shared ssh-agent"

Where are we here? Is this still a problem, or has the underlying issue been sorted out?

Per https://wikitech.wikimedia.org/wiki/Server_Admin_Log it works well since since January 9th again.

Per https://wikitech.wikimedia.org/wiki/Server_Admin_Log it works well since since January 9th again.

The SAL entries are a lie unfortunately.

02:14:23 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n', '--include', 'php-1.25wmf16', '--include', 'php-1.25wmf16/cache', '--include', 'php-1.25wmf16/cache/l10n', '--include', 'php-1.25wmf16/cache/l10n/***'] on mw1070.eqiad.wmnet returned [255]: Permission denied (publickey).

sync-proxies:  16% (ok: 0; fail: 1; left: 5)

The l10nupdate user that runs the nightly script is still not able to use scap to push the changes it makes to the wikis.

chasemp removed a subscriber: chasemp.Mar 11 2015, 8:56 PM
Se4598 added a subscriber: Se4598.Mar 12 2015, 11:29 AM

any update here? I suspect it is still not working.

I heve seen patches in Gerrit trying to work out the permissions issues. Someone more close to those is needed to give more detailed status update.

We had a successful manual run in the early morning (UTC) of 2015-03-12. If the normal cron job is successful on 2015-03-13 we may be able to declare victory. Attempts to fix the backing problem are being tracked by T76061.

greg awarded a token.Mar 12 2015, 3:31 PM
greg moved this task from In-progress to Done on the Deployments board.Mar 16 2015, 3:46 PM