scap sync failed on i18n
Open, HighPublic

Description

My scap sync failed on i18n changes.

It does not seem related to any code issue (e.g. invalid JSON files). That should be caught by Jenkins, and I manually checked both the files changed and one of the languages that failed (hif-latn).

First error I see is:

"15:34:43 ['/usr/bin/scap', 'pull-master', 'tin.eqiad.wmnet'] on naos.codfw.wmnet returned [70]: Could not chdir to home directory /var/lib/mwdeploy: No such file or directory"

Then the CDB issues later.

1​mattflaschen@tin:/srv/mediawiki-staging/php-1.30.0-wmf.16$ scap sync 'Prepare to enable RCFilters (WLFilters) on Watchlist'
2​ ___ ____
3​ ⎛ ⎛ ,----
4​ \ //==--'
5​ _//|,.·//==--' ____________________________
6​ _OO≣=- ︶ ᴹw ⎞_§ ______ ___\ ___\ ,\__ \/ __ \
7​ (∞)_, ) ( | ______/__ \/ /__ / /_/ / /_/ /
8​ ¨--¨|| |- ( / ______\____/ \___/ \__^_/ .__/
9​ ««_/ «_/ jgs/bd808 /_/
10
11​15:27:31 Started scap: Prepare to enable RCFilters (WLFilters) on Watchlist
12​15:27:54 Copying to tin.eqiad.wmnet from tin.eqiad.wmnet
13​15:27:54 Started rsync common
14​15:27:59 Finished rsync common (duration: 00m 05s)
15​15:27:59 Started cache_git_info
16​15:28:02 Finished cache_git_info (duration: 00m 03s)
17​15:28:03 Started l10n-update
18​15:28:03 Updating ExtensionMessages-1.30.0-wmf.16.php
19​15:28:04 Updating LocalisationCache for 1.30.0-wmf.16 using 10 thread(s)
20​15:33:01 Generating JSON versions and md5 files
21​15:33:20 Finished l10n-update (duration: 05m 17s)
22​15:33:20 Started sync-masters
23​15:34:43 ['/usr/bin/scap', 'pull-master', 'tin.eqiad.wmnet'] on naos.codfw.wmnet returned [70]: Could not chdir to home directory /var/lib/mwdeploy: No such file or directory
24​15:33:21 Copying to naos.codfw.wmnet from tin.eqiad.wmnet
25​15:33:21 Started rsync master
26​cannot delete non-empty directory: php-1.30.0-wmf.9/cache/l10n
27​cannot delete non-empty directory: php-1.30.0-wmf.9/cache/l10n
28​cannot delete non-empty directory: php-1.30.0-wmf.9/cache
29​cannot delete non-empty directory: php-1.30.0-wmf.9/cache
30​cannot delete non-empty directory: php-1.30.0-wmf.9
31​cannot delete non-empty directory: php-1.30.0-wmf.7/cache/l10n
32​cannot delete non-empty directory: php-1.30.0-wmf.7/cache/l10n
33​cannot delete non-empty directory: php-1.30.0-wmf.7/cache
34​cannot delete non-empty directory: php-1.30.0-wmf.7/cache
35​cannot delete non-empty directory: php-1.30.0-wmf.7
36​cannot delete non-empty directory: php-1.30.0-wmf.11/cache/l10n
37​cannot delete non-empty directory: php-1.30.0-wmf.11/cache/l10n
38​cannot delete non-empty directory: php-1.30.0-wmf.11/cache
39​cannot delete non-empty directory: php-1.30.0-wmf.11/cache
40​cannot delete non-empty directory: php-1.30.0-wmf.11
41​cannot delete non-empty directory: php-1.30.0-wmf.10/cache/l10n
42​cannot delete non-empty directory: php-1.30.0-wmf.10/cache/l10n
43​cannot delete non-empty directory: php-1.30.0-wmf.10/cache
44​cannot delete non-empty directory: php-1.30.0-wmf.10/cache
45​cannot delete non-empty directory: php-1.30.0-wmf.10
46​cannot delete non-empty directory: php-1.30.0-wmf.12/cache/l10n
47​cannot delete non-empty directory: php-1.30.0-wmf.13/cache/l10n
48​cannot delete non-empty directory: php-1.30.0-wmf.14/cache/l10n
49​15:34:13 Finished rsync master (duration: 00m 51s)
50​15:34:13 Started rebuild CDB staging files
51​15:34:43 Failure processing (u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n', u'l10n_cache-hif-latn.cdb', True)
52​Traceback (most recent call last):
53​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 477, in update_l10n_cdb_wrapper
54​ return update_l10n_cdb(*args)
55​ File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
56​ return func(*args, **kwargs)
57​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 458, in update_l10n_cdb
58​ raise IOError(errno.ENOENT, 'Failed to create CDB', tmp_cdb_path)
59​IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-hif-latn.cdb.tmp'
60​15:34:43 Failure processing (u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n', u'l10n_cache-mrj.cdb', True)
61​Traceback (most recent call last):
62​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 477, in update_l10n_cdb_wrapper
63​ return update_l10n_cdb(*args)
64​ File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
65​ return func(*args, **kwargs)
66​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 458, in update_l10n_cdb
67​ raise IOError(errno.ENOENT, 'Failed to create CDB', tmp_cdb_path)
68​IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
69​15:34:43 Failure processing (u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n', u'l10n_cache-din.cdb', True)
70​Traceback (most recent call last):
71​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 477, in update_l10n_cdb_wrapper
72​ return update_l10n_cdb(*args)
73​ File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
74​ return func(*args, **kwargs)
75​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 458, in update_l10n_cdb
76​ raise IOError(errno.ENOENT, 'Failed to create CDB', tmp_cdb_path)
77​IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-din.cdb.tmp'
78​15:34:43 Unhandled error:
79​Traceback (most recent call last):
80​ File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 308, in run
81​ exit_status = app.main(app.extra_arguments)
82​ File "/usr/lib/python2.7/dist-packages/scap/main.py", line 416, in main
83​ cache_dir, use_cores, True, self.arguments.mute)
84​ File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
85​ return func(*args, **kwargs)
86​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 280, in merge_cdb_updates
87​ itertools.repeat(trust_mtime))), 1):
88​ File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
89​ raise value
90​IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
91​15:34:43 cdb-rebuild failed: <IOError> [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
92​15:34:43 Finished rebuild CDB staging files (duration: 00m 29s)
93​15:34:43 Unhandled error:
94​Traceback (most recent call last):
95​ File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 308, in run
96​ exit_status = app.main(app.extra_arguments)
97​ File "/usr/lib/python2.7/dist-packages/scap/main.py", line 548, in main
98​ verbose=self.verbose
99​ File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
100​ return func(*args, **kwargs)
101​ File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 322, in sync_master
102​ '--staging', '--verbose'])
103​ File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
104​ raise CalledProcessError(retcode, cmd)
105​CalledProcessError: Command '['sudo', '-u', 'l10nupdate', '-n', '--', '/usr/bin/scap', 'cdb-rebuild', '--no-progress', '--staging', '--verbose']' returned non-zero exit status 70
106​15:34:43 pull-master failed: <CalledProcessError> Command '['sudo', '-u', 'l10nupdate', '-n', '--', '/usr/bin/scap', 'cdb-rebuild', '--no-progress', '--staging', '--verbose']' returned non-zero exit status 70
107
108​sync-masters: 100% (ok: 0; fail: 1; left: 0)
109​15:34:43 1 masters had sync errors
110​15:34:43 Finished sync-masters (duration: 01m 22s)
111​15:34:43 Started sync-pull-masters
112​sync-pull-masters: 100% (ok: 1; fail: 0; left: 0)
113​15:35:38 Finished sync-pull-masters (duration: 00m 54s)
114​15:35:38 Started sync-check-canaries
115​^C15:36:14 Finished sync-check-canaries (duration: 00m 36s)
116​15:36:14 scap aborted: Prepare to enable RCFilters (WLFilters) on Watchlist (duration: 08m 42s)

Details

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptTue, Sep 5, 4:25 PM
Mattflaschen-WMF triaged this task as High priority.Tue, Sep 5, 4:25 PM

I Ctrl-C'ed when I saw the problem, reverted the i18n changes, then did a sync-dir of everything (in progress). I think I reverted early enough above it did not start deploying to actual machines.

I Ctrl-C'ed when I saw the problem, reverted the i18n changes, then did a sync-dir of everything (in progress). I think I reverted early enough above it did not start deploying to actual machines.

Yeah, it didn't even start "Waiting for canary traffic", and without finishing that, it won't deploy.

I'm not 100% sure, but I don't think it deployed to the canaries before either, since there is no "check-canaries:" line.

Cleanup sync-dir:

1​mattflaschen@tin:/srv/mediawiki-staging$ scap sync-dir php-1.30.0-wmf.16/ 'Prepare to enable RCFilters (WLFilters) on Watchlist, but without i18n changes (reverted)' ___ ____
2​ ⎛ ⎛ ,----
3​ \ //==--'
4​ _//|,.·//==--' ____________________________
5​ _OO≣=- ︶ ᴹw ⎞_§ ______ ___\ ___\ ,\__ \/ __ \
6​ (∞)_, ) ( | ______/__ \/ /__ / /_/ / /_/ /
7​ ¨--¨|| |- ( / ______\____/ \___/ \__^_/ .__/
8​ ««_/ «_/ jgs/bd808 /_/
9
10​16:28:03 Copying to tin.eqiad.wmnet from tin.eqiad.wmnet
11​16:28:03 Started rsync common
12​16:28:43 Finished rsync common (duration: 00m 39s)
13​16:28:43 Started cache_git_info
14​16:28:47 Finished cache_git_info (duration: 00m 03s)
15​16:28:47 Started sync-masters
16​sync-masters: 100% (ok: 1; fail: 0; left: 0)
17​16:28:54 Finished sync-masters (duration: 00m 07s)
18​16:28:54 Started sync-pull-masters
19​sync-pull-masters: 100% (ok: 1; fail: 0; left: 0)
20​16:28:57 Finished sync-pull-masters (duration: 00m 03s)
21​16:28:57 Started sync-check-canaries
22​check-canaries: 100% (ok: 11; fail: 0; left: 0)
23​16:29:02 Finished sync-check-canaries (duration: 00m 04s)
24​16:29:02 Waiting for canary traffic...
25​16:29:22 Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'
26​16:29:22 Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'
27​16:29:22 Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'
28​16:29:22 Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'
29​16:29:22 Executing check 'Logstash Error rate for mwdebug1002.eqiad.wmnet'
30​16:29:22 Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'
31​16:29:22 Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'
32​16:29:22 Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'
33​16:29:22 Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'
34​16:29:22 Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'
35​16:29:22 Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'
36​16:29:22 Started sync-proxies
37​sync-proxies: 100% (ok: 8; fail: 0; left: 0)
38​16:30:32 Finished sync-proxies (duration: 01m 09s)
39​16:30:32 Started sync-apaches
40​sync-apaches: 100% (ok: 280; fail: 0; left: 0)
41​16:33:54 Finished sync-apaches (duration: 03m 22s)
42​16:33:54 Synchronized php-1.30.0-wmf.16/: Prepare to enable RCFilters (WLFilters) on Watchlist, but without i18n changes (reverted) (duration: 12m 41s)

thcipriani moved this task from Needs triage to Debt on the Scap board.Tue, Sep 5, 4:47 PM
thcipriani added a subscriber: thcipriani.

Hrm, so it's failing at:

to run sudo -u l10nupdate -n -- scap cdb-rebuild --no-progress --staging --verbose (from here: https://github.com/wikimedia/scap/blob/master/scap/tasks.py#L317-L322)

I think this is a failure in the cdb writer. I'm basing this on the fact that it seems to make it to the point that it has written the cdb file, but is failing at the verification step: https://github.com/wikimedia/scap/blob/master/scap/tasks.py#L444-L458

demon added a subscriber: demon.Tue, Sep 5, 5:28 PM
15:34:43 ['/usr/bin/scap', 'pull-master', 'tin.eqiad.wmnet'] on naos.codfw.wmnet returned [70]: Could not chdir to home directory /var/lib/mwdeploy: No such file or directory

We really should get around to fixing that....

So there a few items in the output that are red herrings:

​cannot delete non-empty directory: php-1.30.0-wmf.14/cache/l10n

This happens because there are .cdb files under php-1.30.0-wmf.14/cache/l10n and php-1.30.0-wmf.14/cache/l10n has been deleted from tin but not on naos when we try to rsync. It's a problem.

Could not chdir to home directory /var/lib/mwdeploy: No such file or directory

Also a problem, but did not cause this failure.

The weird and unexplainable problem is:

IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-din.cdb.tmp'
IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
​IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-hif-latn.cdb.tmp'

not invalid json:

>>> with open('/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/upstream/l10n_cache-hif-latn.cdb.json') as f:                                                          
...   data = json.load(f)                 
...                                       
>>> with open('/tmp/test.cdb', 'wb') as fp:                                          
...   writer = cdblib.Writer(fp)          
...   for k, v in data.items():           
...     writer.put(k.encode('utf-8'), v.encode('utf-8'))                             
...   writer.finalize()                   
...                                       
>>> os.path.isfile('/tmp/test.cdb')   
True

not disk space.

There aren't two l10nupdate users with different uids afaict:

[thcipriani@naos ~]$ stat /srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n                                                                                            
  File: ‘/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n’                                                                                                            
  Size: 20480           Blocks: 40         IO Block: 4096   directory                                                                                                    
Device: fd00h/64768d    Inode: 1573099     Links: 3                                                                                                                      
Access: (0755/drwxr-xr-x)  Uid: (10002/l10nupdate)   Gid: (10002/l10nupdate)                                                                                             
Access: 2017-09-05 17:39:31.125337137 +0000                                                                                                                              
Modify: 2017-09-05 17:39:06.029384708 +0000                                                                                                                              
Change: 2017-09-05 17:39:06.029384708 +0000                                                                                                                              
 Birth: -                                                                                                                                                                
[thcipriani@naos ~]$ getent passwd l10nupdate                                                                                                                            
l10nupdate:x:10002:10002::/home/l10nupdate:/bin/bash                                                                                                                     
[thcipriani@naos ~]$ cat /etc/passwd | grep -i l10nup                                                                                                                    
l10nupdate:x:10002:10002::/home/l10nupdate:/bin/bash

That's what I've got so far.

I think the problem was caused by (from SAL):

15:36 mattflaschen@tin: scap aborted: Prepare to enable RCFilters (WLFilters) on Watchlist (duration: 08m 42s)
15:35 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1083 - T174509 (duration: 00m 47s)
15:27 mattflaschen@tin: Started scap: Prepare to enable RCFilters (WLFilters) on Watchlist

Which shouldn't have been possible.

Both a sync-file and a sync run tasks.update_l10n_cdb which builds a file at a tmp location and moves it to the final location after it verifies that it has been created. Both processes logged Started rebuild CDB staging files and logged starting at 2017-09-05T15:34:13 and 2017-09-05T15:34:43. I think that one process moved the tmp file before the other process finished with it, causing the error.

The core issue here is that scap locking failed and allowed two sync* processes to happen at the same time. This caused some weird behavior as there are some race conditions within the deployment since it is not meant to run concurrently with itself.

The bug I found with lockfiles should be fixed by D767. In the interim removing git_repo from /srv/mediawiki-staging/scap/scap.cfg should prevent the issue.

Mentioned in SAL (#wikimedia-operations) [2017-09-08T20:10:01Z] <demon@tin> Synchronized scap/scap.cfg: drop git_repo for now, T175041 (duration: 00m 46s)