During the 1.23wmf18 deploy and again today scap failed to update servers in row D of eqiad. Neither time did the scap UI report the failures. Running sync-common manually on one of the failing hosts reveals that it is correctly reporting the error:
bd808@mw1202:~$ sync-common mw1010.eqiad.wmnet mw1070.eqiad.wmnet
00:20:39 DEBUG - Copying to mw1202.eqiad.wmnet from mw1010.eqiad.wmnet
00:20:39 DEBUG - Started rsync common
@Error: access denied to common from mw1202.eqiad.wmnet (10.64.48.34)
rsync error: error starting client-server protocol (code 5) at main.c(1534) [Rec
eiver=3.0.9]
00:20:39 INFO - Finished rsync common (duration: 00m 00s)
00:20:39 DEBUG - Unhandled error:
Traceback (most recent call last):
File "/srv/scap/scap/cli.py", line 201, in run exit_status = app.main(extra_args) File "/srv/scap/scap/main.py", line 70, in main tasks.sync_common(self.config, self.arguments.servers) File "/srv/scap/scap/tasks.py", line 167, in sync_common subprocess.check_call(rsync) File "/usr/lib/python2.7/subprocess.py", line 511, in check_call raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '('sudo', '-u', 'mwdeploy', '/usr/bin/rsync', '-a',
'--delete-delay', '--delay-updates', '--compress', '--delete', '--exclude=/.sv
n/lock', '--exclude=/.git/objects', '--exclude=/.git//objects', '--exclude
**/cache/l10n/*.cdb', '--no-perms', 'mw1010.eqiad.wmnet::common', '/usr/local/a
pache/common-local')' returned non-zero exit status 5
00:20:39 ERROR - sync-common failed: <CalledProcessError> Command '('sudo', '
-u', 'mwdeploy', '/usr/bin/rsync', '-a', '--delete-delay', '--delay-updates', '-
-compress', '--delete', '--exclude=/.svn/lock', '--exclude=/.git/objects', '
--exclude=/.git//objects', '--exclude=**/cache/l10n/*.cdb', '--no-perms', 'm
w1010.eqiad.wmnet::common', '/usr/local/apache/common-local')' returned non-zero
exit status 5
Version: wmf-deployment
Severity: critical
See Also:
https://rt.wikimedia.org/Ticket/Display.html?id=7080