Page MenuHomePhabricator

l10nupdate aborts due to l10n_cache-ti.cdb.json being truncated
Closed, ResolvedPublic

Description

sync-l10n aborted: (no justification provided) (duration: 01m 19s)

ValueError: Unterminated string starting at: line 23529 column 8 (char 2654110)
hashar@deploy1001$ tail -n1 /srv/mediawiki/php-1.33.0-wmf.1/cache/l10n/upstream/l10n_cache-ti.cdb.json
...\"mediastatistics-header-total\";i:4310;s:24:\"jsohashar@deploy1001$

The file is not complete :(

109:03:48 Failure processing (u'/srv/mediawiki/php-1.33.0-wmf.1/cache/l10n', u'l10n_cache-ti.cdb', True)
2Traceback (most recent call last):
3 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 502, in update_l10n_cdb_wrapper
4 return update_l10n_cdb(*args)
5 File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
6 return func(*args, **kwargs)
7 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 471, in update_l10n_cdb
8 data = json.load(f)
9 File "/usr/lib/python2.7/json/__init__.py", line 291, in load
10 **kw)
11 File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
12 return _default_decoder.decode(s)
13 File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
14 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
15 File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
16 obj, end = self.scan_once(s, idx)
17ValueError: Unterminated string starting at: line 23529 column 8 (char 2654110)
1809:03:48 Unhandled error:
19Traceback (most recent call last):
20 File "/usr/lib/python2.7/dist-packages/scap/cli.py", line 342, in run
21 exit_status = app.main(app.extra_arguments)
22 File "/usr/lib/python2.7/dist-packages/scap/main.py", line 558, in main
23 cache_dir, use_cores, True, self.arguments.mute)
24 File "/usr/lib/python2.7/dist-packages/scap/utils.py", line 402, in context_wrapper
25 return func(*args, **kwargs)
26 File "/usr/lib/python2.7/dist-packages/scap/tasks.py", line 301, in merge_cdb_updates
27 for i, result in enumerate(l10n_update_pool, 1):
28 File "/usr/lib/python2.7/multiprocessing/pool.py", line 668, in next
29 raise value
30ValueError: Unterminated string starting at: line 23529 column 8 (char 2654110)
3109:03:48 cdb-rebuild failed: <ValueError> Unterminated string starting at: line 23529 column 8 (char 2654110)

Event Timeline

In logstash looking for l10n and host: deploy1001 and _type: scap:

The automatic update from Wednesday Oct 24th worked fine:

2018-10-24T01:05:03 Started l10n-update
2018-10-24T01:07:22 Updated 414 JSON file(s) in /srv/mediawiki-staging/php-1.32.0-wmf.26/cache/l10n
2018-10-24T01:09:41 Updated 414 JSON file(s) in /srv/mediawiki-staging/php-1.33.0-wmf.1/cache/l10n
2018-10-24T01:09:42 Finished l10n-update (duration: 04m 38s)

One has been triggered manually on Oct 25th:

2018-10-25T19:30:18 Started l10n-update
2018-10-25T19:33:16 Updated 414 JSON file(s) in /srv/mediawiki-staging/php-1.32.0-wmf.26/cache/l10n
2018-10-25T19:38:10 Finished l10n-update (duration: 07m 51s)
2018-10-25T19:38:10 Updated 414 JSON file(s) in /srv/mediawiki-staging/php-1.33.0-wmf.1/cache/l10n
2018-10-25T19:54:44 Updated 414 CDB files(s) in /srv/mediawiki/php-1.32.0-wmf.26/cache/l10n
2018-10-25T19:55:33 Updated 414 CDB files(s) in /srv/mediawiki/php-1.33.0-wmf.1/cache/l10n

That is part of a full sync:

2018-10-25 19:26 	<twentyafterfour@deploy1001> 	Started scap: full sync to be sure that 1.33.0-wmf.1 is fully deployed
2018-10-25 20:02 	<twentyafterfour@deploy1001> 	Finished scap: full sync to be sure that 1.33.0-wmf.1 is fully deployed (duration: 36m 57s)

There is no entry logs matching past that until @Addshore attempt this Monday:

2018-10-29T08:32:45 Started l10n-update
2018-10-29T08:35:34 Finished l10n-update (duration: 02m 49s)
...

It was also done earlier today but it was aborted as I told gilles that they only needed to sync the dir not do a full scap.

08:42 gilles@deploy1001: Synchronized wmf-config/InitialiseSettings.php: T208088 Enable performance perception survey shuffling (duration: 00m 47s)
08:38 gilles@deploy1001: Synchronized php-1.33.0-wmf.1/extensions/QuickSurveys: T208088 Add ability to shuffle answers display order (duration: 01m 51s)
08:35 gilles@deploy1001: sync aborted: T208088 Enable performance QuickSurvey shuffling (duration: 00m 00s)
08:35 gilles@deploy1001: Started scap: T208088 Enable performance QuickSurvey shuffling
08:35 gilles@deploy1001: sync aborted: T208088 Enable performance QuickSurvey shuffling (duration: 06m 30s)
08:29 gilles@deploy1001: Started scap: T208088 Enable performance QuickSurvey shuffling

I also updated the docs slightly to try to make this more obvious https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code&diff=prev&oldid=1807154

/var/log/l10nupdate is empty. The automatic cronjob has been disabled on Sept 27th by 9482dfc3f5d4d805eb437465a615fb0e2d4c96b7

I have no idea how it is triggered nowadays :\

Mentioned in SAL (#wikimedia-operations) [2018-10-29T10:12:01Z] <hashar> Moving l10n_cache-ti.cdb files on deploy1001: sudo -u l10nupdate mv l10n_cache-ti.cdb.json l10n_cache-ti.cdb.json-back-T208196 && sudo -u l10nupdate mv l10n_cache-ti.cdb.MD5 l10n_cache-ti.cdb.MD5-back-T208196 # T208196

Mentioned in SAL (#wikimedia-operations) [2018-10-29T10:12:01Z] <hashar> Moving l10n_cache-ti.cdb files on deploy1001: sudo -u l10nupdate mv l10n_cache-ti.cdb.json l10n_cache-ti.cdb.json-back-T208196 && sudo -u l10nupdate mv l10n_cache-ti.cdb.MD5 l10n_cache-ti.cdb.MD5-back-T208196 # T208196

Mentioned in SAL (#wikimedia-operations) [2018-10-29T10:12:01Z] <hashar> Moving l10n_cache-ti.cdb files on deploy1001: sudo -u l10nupdate mv l10n_cache-ti.cdb.json l10n_cache-ti.cdb.json-back-T208196 && sudo -u l10nupdate mv l10n_cache-ti.cdb.MD5 l10n_cache-ti.cdb.MD5-back-T208196 # T208196

Mentioned in SAL (#wikimedia-operations) [2018-10-29T10:21:21Z] <hashar> Restore ti l10n files on deploy1001:/srv/mediawiki-staging/php-1.33.0-wmf.1/cache/l10n/upstream # T208196

/srv/mediawiki-staging/php-1.33.0-wmf.1/cache/l10n/upstream/l10n_cache-ti.cdb.json is corrupted. I wanted to back them up then run the l10n update script but I could not find out how to run it. I have restored the files to known (broken) previous state.

We need someone with knowledges of l10nupdate to step in and fix it :\

08:35 gilles@deploy1001: sync aborted: T208088 Enable performance QuickSurvey shuffling (duration: 06m 30s)

Looks like we must've aborted sync before json generation was complete.

Looking at scap logs for this time period (https://logstash.wikimedia.org/app/kibana#/dashboard/scap?_g=h@a69215a&_a=h@84b4780) you see pairs "Processing: [some cdb]; Updated: [corresponding json]". This is all happening async, so several are processing and updated at roughly the same time. https://logstash.wikimedia.org/goto/50250ee480e89c4c01d9dcc28e2841f6

It looks like it started "Processing" a lot, but only "Updated" a few.

After D983: Remove `scap l10n-update` we no longer have the code to just update the localisation cache on the master, but a full sync should get everything back in order.

After D983: Remove `scap l10n-update` we no longer have the code to just update the localisation cache on the master, but a full sync should get everything back in order.

I preformed a full sync in the hour after the issue, but still got a log message, I guess we can try again though!

After D983: Remove `scap l10n-update` we no longer have the code to just update the localisation cache on the master, but a full sync should get everything back in order.

I preformed a full sync in the hour after the issue, but still got a log message, I guess we can try again though!

I'll fiddle with this in a few minutes and post what I find.

Mentioned in SAL (#wikimedia-operations) [2018-10-29T18:31:38Z] <thcipriani> deploy1001:sudo -u l10nupdate scap cdb-refresh-json --directory /srv/mediawiki-staging/php-1.33.0-wmf.1/cache/l10n (ref: T208196)

Mentioned in SAL (#wikimedia-operations) [2018-10-29T18:33:07Z] <thcipriani> deploy1001:scap pull (ref: T208196)

thcipriani claimed this task.

So it looks like

After D983: Remove `scap l10n-update` we no longer have the code to just update the localisation cache on the master, but a full sync should get everything back in order.

I preformed a full sync in the hour after the issue, but still got a log message, I guess we can try again though!

Should be fixed. At least I can confirm that the l10n_cache-ti.cdb.json file is now valid.

The command I ran to regenerate json files from existing cdbs is sudo -u l10nupdate scap cdb-refresh-json --directory /srv/mediawiki-staging/php-1.33.0-wmf.1/cache/l10n

This should happen as part of update_localization_cache which happens after sync_common (which syncs deploy1001:/srv/mediawiki-staging -> deploy1001:/srv/mediawiki).

The corrupt json was not re-written because the cdb file had not been updated since the md5 file was written. I filled in more details on wikitech https://wikitech.wikimedia.org/wiki/Scap#scap_sync

If there are other problematic json files:

  1. rm the md5 file in /srv/mediawiki-staging/php-[version]/cache/l10n/upstream/[whatever].MD5
  2. run: sudo -u l10nupdate scap cdb-refresh-json --directory /srv/mediawiki-staging/php-[version]/cache/l10n
  3. Ensure that the json file and MD5 file are rewritten
  4. scap pull for good measure

...
After D983: Remove `scap l10n-update` we no longer have the code to just update the localisation cache on the master, but a full sync should get everything back in order.

Seems that is replaced by scap cdb-refresh-json so.

Should we have a follow up task to validate the json files against the MD5 sums?

Should we have a follow up task to validate the json files against the MD5 sums?

Those are md5 sums of the cdb files.

Before train I validated json files:

[thcipriani@deploy1001 ~]$ for file in /srv/mediawiki-staging/php-1.33.0-wmf.1/cache/l10n/upstream/*.json; do if [[ $(jq '.' < "$file" | wc -l) == 0 ]]; then echo "$file"; fi; done

All seems good.