Page MenuHomePhabricator

Beta cluster deployment failed since October 11
Closed, ResolvedPublic

Description

Since October 11 (that's the date mentioned on Special:Version on en.wp.beta.wmflabs), the beta cluster doesn't fetch the latest master version of our git repositories.

The beta-code-update-eqiad job in Jenkins runs without any failures, but scap seems to has a problem with a missing dbfile:
https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/74056/console

I'm not sure, if this is the cause of the problem (but I think so):

06:04:21 06:04:21 Started scap: beta-scap-eqiad (build #74056)
06:04:22 06:04:22 Copying to deployment-bastion.deployment-prep.eqiad.wmflabs from deployment-bastion.eqiad.wmflabs
06:04:22 06:04:22 Started rsync common
06:04:29 06:04:29 Finished rsync common (duration: 00m 07s)
06:04:30 06:04:30 Last output:
06:04:30 06:04:30 Unhandled error:
06:04:30 Traceback (most recent call last):
06:04:30 File "/mnt/srv/deployment/scap/scap/scap/cli.py", line 284, in run
06:04:30 exit_status = app.main(extra_args)
06:04:30 File "/mnt/srv/deployment/scap/scap/scap/main.py", line 127, in main
06:04:30 tasks.compile_wikiversions('deploy', self.config)
06:04:30 File "/mnt/srv/deployment/scap/scap/scap/utils.py", line 338, in context_wrapper
06:04:30 return func(*args, **kwargs)
06:04:30 File "/mnt/srv/deployment/scap/scap/scap/tasks.py", line 139, in compile_wikiversions
06:04:30 all_dbs = set(line.strip() for line in open(all_dblist_file))
06:04:30 IOError: [Errno 2] No such file or directory: '/srv/mediawiki/all.dblist'
06:04:30 06:04:30 compile-wikiversions failed: <IOError> [Errno 2] No such file or directory: '/srv/mediawiki/all.dblist'
06:04:30 06:04:30 Unhandled error:
06:04:30 Traceback (most recent call last):
06:04:30   File "/mnt/srv/deployment/scap/scap/scap/cli.py", line 284, in run
06:04:30     exit_status = app.main(extra_args)
06:04:30   File "/mnt/srv/deployment/scap/scap/scap/main.py", line 216, in main
06:04:30     return super(Scap, self).main(*extra_args)
06:04:30   File "/mnt/srv/deployment/scap/scap/scap/main.py", line 52, in main
06:04:30     self._before_cluster_sync()
06:04:30   File "/mnt/srv/deployment/scap/scap/scap/main.py", line 231, in _before_cluster_sync
06:04:30     self.get_script_path('compile-wikiversions'))
06:04:30   File "/mnt/srv/deployment/scap/scap/scap/utils.py", line 338, in context_wrapper
06:04:30     return func(*args, **kwargs)
06:04:30   File "/mnt/srv/deployment/scap/scap/scap/utils.py", line 448, in sudo_check_call
06:04:30     raise subprocess.CalledProcessError(proc.returncode, cmd)
06:04:30 CalledProcessError: Command '/srv/deployment/scap/scap/bin/compile-wikiversions' returned non-zero exit status 70
06:04:30 06:04:30 scap failed: CalledProcessError Command '/srv/deployment/scap/scap/bin/compile-wikiversions' returned non-zero exit status 70 (duration: 00m 08s)

Btw.: I'm not sure, if this is correct, but Jenkins was unable to send an information email to qa-alert list, which should be investigated, too, I think:

06:04:30 Email was triggered for: Failure - Any
06:04:30 Sending email for trigger: Failure - Any
06:04:30 Sending email to: qa-alerts@lists.wikimedia.org
06:04:37 Connection error sending email, retrying once more in 10 seconds...
06:04:54 Connection error sending email, retrying once more in 10 seconds...
06:05:04 Failed after second try sending email

Details

Related Gerrit Patches:
mediawiki/tools/scap : masterupdate scap for dblists/* change

Event Timeline

Florian created this task.Oct 13 2015, 6:31 AM
Florian raised the priority of this task from to Unbreak Now!.
Florian updated the task description. (Show Details)
Florian added a subscriber: Florian.
Restricted Application added subscribers: Luke081515, Aklapper. · View Herald TranscriptOct 13 2015, 6:31 AM
Legoktm set Security to None.Oct 13 2015, 7:08 AM
Legoktm added a subscriber: ori.
Legoktm added a subscriber: Legoktm.

Should be fixed by https://gerrit.wikimedia.org/r/245917 which is being deployed now.

greg assigned this task to ori.Oct 13 2015, 3:54 PM
greg added a subscriber: greg.

Change 245917 had a related patch set uploaded (by BryanDavis):
update scap for dblists/* change

https://gerrit.wikimedia.org/r/245917

Change 245917 merged by jenkins-bot:
update scap for dblists/* change

https://gerrit.wikimedia.org/r/245917

greg added a comment.Oct 13 2015, 8:36 PM

Still failing:

120:34:06 sync-common: 80% (ok: 4; fail: 0; left: 1)
2
320:34:55 20:34:55 ['/srv/deployment/scap/scap/bin/sync-common', '--no-update-l10n'] on deployment-mediawiki01.deployment-prep.eqiad.wmflabs returned [70]: Warning: Permanently added 'deployment-mediawiki01.deployment-prep.eqiad.wmflabs,10.68.17.170' (ECDSA) to the list of known hosts.
420:34:55 20:33:48 Copying to deployment-mediawiki01.deployment-prep.eqiad.wmflabs from deployment-bastion.eqiad.wmflabs
520:34:55 20:33:48 Started rsync common
620:34:55 rsync: write failed on "/srv/mediawiki/php-master/cache/l10n/upstream/l10n_cache-ab.cdb.json": No space left on device (28)
720:34:55 rsync error: error in file IO (code 11) at receiver.c(389) [receiver=3.1.0]
820:34:55 20:34:23 Finished rsync common (duration: 00m 34s)
920:34:55 20:34:23 Unhandled error:
1020:34:55 Traceback (most recent call last):
1120:34:55 File "/srv/deployment/scap/scap/scap/cli.py", line 284, in run
1220:34:55 exit_status = app.main(extra_args)
1320:34:55 File "/srv/deployment/scap/scap/scap/main.py", line 317, in main
1420:34:55 verbose=self.verbose
1520:34:55 File "/srv/deployment/scap/scap/scap/utils.py", line 338, in context_wrapper
1620:34:55 return func(*args, **kwargs)
1720:34:55 File "/srv/deployment/scap/scap/scap/tasks.py", line 300, in sync_common
1820:34:55 subprocess.check_call(rsync)
1920:34:55 File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
2020:34:55 raise CalledProcessError(retcode, cmd)
2120:34:55 CalledProcessError: Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'deployment-bastion.eqiad.wmflabs::common', '/srv/mediawiki']' returned non-zero exit status 11
2220:34:55 20:34:23 sync-common failed: <CalledProcessError> Command '['sudo', '-u', 'mwdeploy', '-n', '--', '/usr/bin/rsync', '--archive', '--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=**/.svn/lock', '--exclude=**/.git/objects', '--exclude=**/.git/**/objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'deployment-bastion.eqiad.wmflabs::common', '/srv/mediawiki']' returned non-zero exit status 11
2320:34:55
2420:34:55 sync-common: 100% (ok: 4; fail: 1; left: 0)
2520:34:55 sync-common: 100% (ok: 4; fail: 1; left: 0)
2620:34:55
2720:34:55 20:34:55 1 apaches had sync errors

Florian added a comment.EditedOct 13 2015, 8:39 PM

The update in scap seems to fix the deploy issue, I could at least see a change in MobileFrontend, which was merged yesterday, live in enwiki beta. Thanks for the quick fix! But the versions on Special:Version still show the 11 October, and the scap job still fails :/
https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/74142/console

Should I open a new task for that? :)

Now it's failing because deployment-mediawiki01 is full.

(I deployed the fixed version of scap to some of the beta hosts earlier.)

Krenair closed this task as Resolved.Oct 13 2015, 9:45 PM

Cleaned some stuff off deployment-mediawiki01, the job is succeeding now.