Page MenuHomePhabricator

scap sync failed on i18n
Closed, ResolvedPublic

Description

My scap sync failed on i18n changes.

It does not seem related to any code issue (e.g. invalid JSON files). That should be caught by Jenkins, and I manually checked both the files changed and one of the languages that failed (hif-latn).

First error I see is:

"15:34:43 ['/usr/bin/scap', 'pull-master', 'tin.eqiad.wmnet'] on naos.codfw.wmnet returned [70]: Could not chdir to home directory /var/lib/mwdeploy: No such file or directory"

Then the CDB issues later.

1mattflaschen@tin:/srv/mediawiki-staging/php-1.30.0-wmf.16$ scap sync 'Prepare to enable RCFilters (WLFilters) on Watchlist'
2 ___ ____
3 ⎛ ⎛ ,----
4 \ //==--'
5 _//|,.·//==--' ____________________________
6 _OO≣=- ︶ ᴹw ⎞_§ ______ ___\ ___\ ,\__ \/ __ \
7 (∞)_, ) ( | ______/__ \/ /__ / /_/ / /_/ /
8 ¨--¨|| |- ( / ______\____/ \___/ \__^_/ .__/
9 ««_/ «_/ jgs/bd808 /_/
10
1115:27:31 Started scap: Prepare to enable RCFilters (WLFilters) on Watchlist
1215:27:54 Copying to tin.eqiad.wmnet from tin.eqiad.wmnet
1315:27:54 Started rsync common
1415:27:59 Finished rsync common (duration: 00m 05s)
1515:27:59 Started cache_git_info
1615:28:02 Finished cache_git_info (duration: 00m 03s)
1715:28:03 Started l10n-update
1815:28:03 Updating ExtensionMessages-1.30.0-wmf.16.php
1915:28:04 Updating LocalisationCache for 1.30.0-wmf.16 using 10 thread(s)
2015:33:01 Generating JSON versions and md5 files
2115:33:20 Finished l10n-update (duration: 05m 17s)
2215:33:20 Started sync-masters
2315:34:43 ['/usr/bin/scap', 'pull-master', 'tin.eqiad.wmnet'] on naos.codfw.wmnet returned [70]: Could not chdir to home directory /var/lib/mwdeploy: No such file or directory
2415:33:21 Copying to naos.codfw.wmnet from tin.eqiad.wmnet
2515:33:21 Started rsync master
26cannot delete non-empty directory: php-1.30.0-wmf.9/cache/l10n
27cannot delete non-empty directory: php-1.30.0-wmf.9/cache/l10n
28cannot delete non-empty directory: php-1.30.0-wmf.9/cache
29cannot delete non-empty directory: php-1.30.0-wmf.9/cache
30cannot delete non-empty directory: php-1.30.0-wmf.9
31cannot delete non-empty directory: php-1.30.0-wmf.7/cache/l10n
32cannot delete non-empty directory: php-1.30.0-wmf.7/cache/l10n
33cannot delete non-empty directory: php-1.30.0-wmf.7/cache
34cannot delete non-empty directory: php-1.30.0-wmf.7/cache
35cannot delete non-empty directory: php-1.30.0-wmf.7
36cannot delete non-empty directory: php-1.30.0-wmf.11/cache/l10n
37cannot delete non-empty directory: php-1.30.0-wmf.11/cache/l10n
38cannot delete non-empty directory: php-1.30.0-wmf.11/cache
39cannot delete non-empty directory: php-1.30.0-wmf.11/cache
40cannot delete non-empty directory: php-1.30.0-wmf.11
41cannot delete non-empty directory: php-1.30.0-wmf.10/cache/l10n
42cannot delete non-empty directory: php-1.30.0-wmf.10/cache/l10n
43cannot delete non-empty directory: php-1.30.0-wmf.10/cache
44cannot delete non-empty directory: php-1.30.0-wmf.10/cache
45cannot delete non-empty directory: php-1.30.0-wmf.10
46cannot delete non-empty directory: php-1.30.0-wmf.12/cache/l10n
47cannot delete non-empty directory: php-1.30.0-wmf.13/cache/l10n
48cannot delete non-empty directory: php-1.30.0-wmf.14/cache/l10n
4915:34:13 Finished rsync master (duration: 00m 51s)
5015:34:13 Started rebuild CDB staging files
5115:34:43 Failure processing (u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n', u'l10n_cache-hif-latn.cdb', True)
52Traceback (most recent call last):
53 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 477, in update_l10n_cdb_wrapper
54 return update_l10n_cdb(*args)
55 File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
56 return func(*args, **kwargs)
57 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 458, in update_l10n_cdb
58 raise IOError(errno.ENOENT, 'Failed to create CDB', tmp_cdb_path)
59IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-hif-latn.cdb.tmp'
6015:34:43 Failure processing (u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n', u'l10n_cache-mrj.cdb', True)
61Traceback (most recent call last):
62 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 477, in update_l10n_cdb_wrapper
63 return update_l10n_cdb(*args)
64 File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
65 return func(*args, **kwargs)
66 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 458, in update_l10n_cdb
67 raise IOError(errno.ENOENT, 'Failed to create CDB', tmp_cdb_path)
68IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
6915:34:43 Failure processing (u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n', u'l10n_cache-din.cdb', True)
70Traceback (most recent call last):
71 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 477, in update_l10n_cdb_wrapper
72 return update_l10n_cdb(*args)
73 File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
74 return func(*args, **kwargs)
75 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 458, in update_l10n_cdb
76 raise IOError(errno.ENOENT, 'Failed to create CDB', tmp_cdb_path)
77IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-din.cdb.tmp'
7815:34:43 Unhandled error:
79Traceback (most recent call last):
80 File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 308, in run
81 exit_status = app.main(app.extra_arguments)
82 File "/usr/lib/python2.7/dist-packages/scap/main.py", line 416, in main
83 cache_dir, use_cores, True, self.arguments.mute)
84 File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
85 return func(*args, **kwargs)
86 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 280, in merge_cdb_updates
87 itertools.repeat(trust_mtime))), 1):
88 File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
89 raise value
90IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
9115:34:43 cdb-rebuild failed: <IOError> [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
9215:34:43 Finished rebuild CDB staging files (duration: 00m 29s)
9315:34:43 Unhandled error:
94Traceback (most recent call last):
95 File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 308, in run
96 exit_status = app.main(app.extra_arguments)
97 File "/usr/lib/python2.7/dist-packages/scap/main.py", line 548, in main
98 verbose=self.verbose
99 File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 401, in context_wrapper
100 return func(*args, **kwargs)
101 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 322, in sync_master
102 '--staging', '--verbose'])
103 File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
104 raise CalledProcessError(retcode, cmd)
105CalledProcessError: Command '['sudo', '-u', 'l10nupdate', '-n', '--', '/usr/bin/scap', 'cdb-rebuild', '--no-progress', '--staging', '--verbose']' returned non-zero exit status 70
10615:34:43 pull-master failed: <CalledProcessError> Command '['sudo', '-u', 'l10nupdate', '-n', '--', '/usr/bin/scap', 'cdb-rebuild', '--no-progress', '--staging', '--verbose']' returned non-zero exit status 70
107
108sync-masters: 100% (ok: 0; fail: 1; left: 0)
10915:34:43 1 masters had sync errors
11015:34:43 Finished sync-masters (duration: 01m 22s)
11115:34:43 Started sync-pull-masters
112sync-pull-masters: 100% (ok: 1; fail: 0; left: 0)
11315:35:38 Finished sync-pull-masters (duration: 00m 54s)
11415:35:38 Started sync-check-canaries
115^C15:36:14 Finished sync-check-canaries (duration: 00m 36s)
11615:36:14 scap aborted: Prepare to enable RCFilters (WLFilters) on Watchlist (duration: 08m 42s)

Revisions and Commits

rMSCA Scap
Restricted Differential Revision

Event Timeline

I Ctrl-C'ed when I saw the problem, reverted the i18n changes, then did a sync-dir of everything (in progress). I think I reverted early enough above it did not start deploying to actual machines.

I Ctrl-C'ed when I saw the problem, reverted the i18n changes, then did a sync-dir of everything (in progress). I think I reverted early enough above it did not start deploying to actual machines.

Yeah, it didn't even start "Waiting for canary traffic", and without finishing that, it won't deploy.

I'm not 100% sure, but I don't think it deployed to the canaries before either, since there is no "check-canaries:" line.

Cleanup sync-dir:

1mattflaschen@tin:/srv/mediawiki-staging$ scap sync-dir php-1.30.0-wmf.16/ 'Prepare to enable RCFilters (WLFilters) on Watchlist, but without i18n changes (reverted)' ___ ____
2 ⎛ ⎛ ,----
3 \ //==--'
4 _//|,.·//==--' ____________________________
5 _OO≣=- ︶ ᴹw ⎞_§ ______ ___\ ___\ ,\__ \/ __ \
6 (∞)_, ) ( | ______/__ \/ /__ / /_/ / /_/ /
7 ¨--¨|| |- ( / ______\____/ \___/ \__^_/ .__/
8 ««_/ «_/ jgs/bd808 /_/
9
1016:28:03 Copying to tin.eqiad.wmnet from tin.eqiad.wmnet
1116:28:03 Started rsync common
1216:28:43 Finished rsync common (duration: 00m 39s)
1316:28:43 Started cache_git_info
1416:28:47 Finished cache_git_info (duration: 00m 03s)
1516:28:47 Started sync-masters
16sync-masters: 100% (ok: 1; fail: 0; left: 0)
1716:28:54 Finished sync-masters (duration: 00m 07s)
1816:28:54 Started sync-pull-masters
19sync-pull-masters: 100% (ok: 1; fail: 0; left: 0)
2016:28:57 Finished sync-pull-masters (duration: 00m 03s)
2116:28:57 Started sync-check-canaries
22check-canaries: 100% (ok: 11; fail: 0; left: 0)
2316:29:02 Finished sync-check-canaries (duration: 00m 04s)
2416:29:02 Waiting for canary traffic...
2516:29:22 Executing check 'Logstash Error rate for mw1279.eqiad.wmnet'
2616:29:22 Executing check 'Logstash Error rate for mw1276.eqiad.wmnet'
2716:29:22 Executing check 'Logstash Error rate for mw1261.eqiad.wmnet'
2816:29:22 Executing check 'Logstash Error rate for mw1264.eqiad.wmnet'
2916:29:22 Executing check 'Logstash Error rate for mwdebug1002.eqiad.wmnet'
3016:29:22 Executing check 'Logstash Error rate for mwdebug1001.eqiad.wmnet'
3116:29:22 Executing check 'Logstash Error rate for mw1263.eqiad.wmnet'
3216:29:22 Executing check 'Logstash Error rate for mw1262.eqiad.wmnet'
3316:29:22 Executing check 'Logstash Error rate for mw1278.eqiad.wmnet'
3416:29:22 Executing check 'Logstash Error rate for mw1277.eqiad.wmnet'
3516:29:22 Executing check 'Logstash Error rate for mw1265.eqiad.wmnet'
3616:29:22 Started sync-proxies
37sync-proxies: 100% (ok: 8; fail: 0; left: 0)
3816:30:32 Finished sync-proxies (duration: 01m 09s)
3916:30:32 Started sync-apaches
40sync-apaches: 100% (ok: 280; fail: 0; left: 0)
4116:33:54 Finished sync-apaches (duration: 03m 22s)
4216:33:54 Synchronized php-1.30.0-wmf.16/: Prepare to enable RCFilters (WLFilters) on Watchlist, but without i18n changes (reverted) (duration: 12m 41s)

thcipriani subscribed.

Hrm, so it's failing at:

to run sudo -u l10nupdate -n -- scap cdb-rebuild --no-progress --staging --verbose (from here: https://github.com/wikimedia/scap/blob/master/scap/tasks.py#L317-L322)

I think this is a failure in the cdb writer. I'm basing this on the fact that it seems to make it to the point that it has written the cdb file, but is failing at the verification step: https://github.com/wikimedia/scap/blob/master/scap/tasks.py#L444-L458

15:34:43 ['/usr/bin/scap', 'pull-master', 'tin.eqiad.wmnet'] on naos.codfw.wmnet returned [70]: Could not chdir to home directory /var/lib/mwdeploy: No such file or directory

We really should get around to fixing that....

So there a few items in the output that are red herrings:

​cannot delete non-empty directory: php-1.30.0-wmf.14/cache/l10n

This happens because there are .cdb files under php-1.30.0-wmf.14/cache/l10n and php-1.30.0-wmf.14/cache/l10n has been deleted from tin but not on naos when we try to rsync. It's a problem.

Could not chdir to home directory /var/lib/mwdeploy: No such file or directory

Also a problem, but did not cause this failure.

The weird and unexplainable problem is:

IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-din.cdb.tmp'
IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-mrj.cdb.tmp'
​IOError: [Errno 2] Failed to create CDB: u'/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/l10n_cache-hif-latn.cdb.tmp'

not invalid json:

>>> with open('/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n/upstream/l10n_cache-hif-latn.cdb.json') as f:                                                          
...   data = json.load(f)                 
...                                       
>>> with open('/tmp/test.cdb', 'wb') as fp:                                          
...   writer = cdblib.Writer(fp)          
...   for k, v in data.items():           
...     writer.put(k.encode('utf-8'), v.encode('utf-8'))                             
...   writer.finalize()                   
...                                       
>>> os.path.isfile('/tmp/test.cdb')   
True

not disk space.

There aren't two l10nupdate users with different uids afaict:

[thcipriani@naos ~]$ stat /srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n                                                                                            
  File: ‘/srv/mediawiki-staging/php-1.30.0-wmf.16/cache/l10n’                                                                                                            
  Size: 20480           Blocks: 40         IO Block: 4096   directory                                                                                                    
Device: fd00h/64768d    Inode: 1573099     Links: 3                                                                                                                      
Access: (0755/drwxr-xr-x)  Uid: (10002/l10nupdate)   Gid: (10002/l10nupdate)                                                                                             
Access: 2017-09-05 17:39:31.125337137 +0000                                                                                                                              
Modify: 2017-09-05 17:39:06.029384708 +0000                                                                                                                              
Change: 2017-09-05 17:39:06.029384708 +0000                                                                                                                              
 Birth: -                                                                                                                                                                
[thcipriani@naos ~]$ getent passwd l10nupdate                                                                                                                            
l10nupdate:x:10002:10002::/home/l10nupdate:/bin/bash                                                                                                                     
[thcipriani@naos ~]$ cat /etc/passwd | grep -i l10nup                                                                                                                    
l10nupdate:x:10002:10002::/home/l10nupdate:/bin/bash

That's what I've got so far.

I think the problem was caused by (from SAL):

15:36 mattflaschen@tin: scap aborted: Prepare to enable RCFilters (WLFilters) on Watchlist (duration: 08m 42s)
15:35 marostegui@tin: Synchronized wmf-config/db-eqiad.php: Repool db1083 - T174509 (duration: 00m 47s)
15:27 mattflaschen@tin: Started scap: Prepare to enable RCFilters (WLFilters) on Watchlist

Which shouldn't have been possible.

Both a sync-file and a sync run tasks.update_l10n_cdb which builds a file at a tmp location and moves it to the final location after it verifies that it has been created. Both processes logged Started rebuild CDB staging files and logged starting at 2017-09-05T15:34:13 and 2017-09-05T15:34:43. I think that one process moved the tmp file before the other process finished with it, causing the error.

thcipriani added a revision: Restricted Differential Revision.Sep 5 2017, 7:59 PM

The core issue here is that scap locking failed and allowed two sync* processes to happen at the same time. This caused some weird behavior as there are some race conditions within the deployment since it is not meant to run concurrently with itself.

The bug I found with lockfiles should be fixed by D767. In the interim removing git_repo from /srv/mediawiki-staging/scap/scap.cfg should prevent the issue.

Mentioned in SAL (#wikimedia-operations) [2017-09-08T20:10:01Z] <demon@tin> Synchronized scap/scap.cfg: drop git_repo for now, T175041 (duration: 00m 46s)