Page MenuHomePhabricator

CX Published parallel corpus is invalid json
Open, MediumPublic

Description

The json file https://dumps.wikimedia.org/other/contenttranslation/20200214/cx-corpora._2_.html.json.gz is invalid json.
The error is at line 1043441

cx-corpora._2_.html.json
1043439-        }
1043440-    },
1043441-    ,
1043442-    {
1043443:        "id": "421914/mwAqc",
1043444-        "sourceLanguage": "en",
1043445-        "targetLanguage": "si",
1043446-        "source": {

Note the extra comma.

Event Timeline

Removed that line using sed -i '1043441d' cx-corpora._2_.html.json and the json is valid. So that comma is the only issue.

https://dumps.wikimedia.org/other/contenttranslation/20200214/cx-corpora.en2tr.html.json also has this problem on line 423746

p></p>"         }     },     ,     {         "id": "431229/c
           (right here) ------^

@Nikerabbit In T217899: Duplicate commas in JSON Content Translation Dumps you had fixed the issue of dangling commas. That was around April 2019. In this ticket, we see errors in dumps from Feb 2020. Is there any chance that the code without the fix was used for generating this dumps? Asking because I could not quickly find a reason for dangling comma other than the one you already identified and fixed.

There hasn't been any changes to the output since my last fix in 2019.