Page MenuHomePhabricator

Provide parallel corpora dumps
Closed, ResolvedPublic

Description

T119618: Create API for accessing parallel corpora by providing translation id added support for JSON formatted corpora output for individual articles. We should provide dumps for easier use.

Perhaps we can use https://dumps.wikimedia.org/?

Many tools support TMX format, so that is likely candidate, although JSON would be easier to implement. Not sure if we can have both.

Also see design document https://docs.google.com/document/d/11fFVBcu190u8J4uVJdyrghuHx5BdWrRHRZvd7fArwE0

Event Timeline

santhosh raised the priority of this task from to Medium.
santhosh updated the task description. (Show Details)
santhosh added subscribers: santhosh, Nikerabbit.

TMX formatted dumps are likely to be more useful than exporting a single article in TMX format.

santhosh renamed this task from Support TMX format in contenttranslationcorpora API to Provide TMX formatted parallel corpora.Dec 23 2015, 4:55 AM
santhosh set Security to None.

The source URL in the design doc uses spaces, not underscores. Is that a typo?

Note a typo. It reflects the current output. See https://ca.wikipedia.org/w/api.php?action=query&list=cxpublishedtranslations&limit=50&offset=40000

Would you like it have underscrores instead of spaces?

Nikerabbit renamed this task from Provide TMX formatted parallel corpora to Provide parallel corpora dumps.Jan 26 2016, 10:06 AM
Nikerabbit updated the task description. (Show Details)

Change 271217 had a related patch set uploaded (by Nikerabbit):
First steps towards dumps

https://gerrit.wikimedia.org/r/271217

Change 272461 had a related patch set uploaded (by Nikerabbit):
Add support for TMX level 1 (plaintext)

https://gerrit.wikimedia.org/r/272461

Can you give me an idea of size, number of files, number of runs we are talking about here? Dataset hosts sound like the logical place but let's make sure we have/will have the capacity.

Right now we are talking about the contents of about 60k published articles. At worst we have the html markup of those articles in thee different versions (original, mt, post-edited). The median size of drafts (including unpublished) is 13k, so that would be around 60k*13k*3, about two gigabytes. If we also include plaintext version and TMX, that would add some more but way less than the html versions. There is also the overhead of the file formats, so I would say about 5 GB initially.

Number of files would at worst be the number of language pairs we have times the formats, i.e. around 300 * 300 * 3 (quarter of a million). But the script has split-at feature that groups less used languages into any2xx and any2any files. By tweaking the threshold we can keep the number of files relatively low.

As for how frequently these should be created, and whether we need to keep old versions around, we have not discussed that in the team. I would guess the dumps would be updated at least couple of times per year if not more.

Actually, the *average* size of a draft is 45k, so I think 10 GB is better estimate.

I tested script in local setup and Labs.

Change 271217 merged by jenkins-bot:
First steps towards dumps

https://gerrit.wikimedia.org/r/271217

We can host these without worry about capacity for some time then; 20 or 30GB over a year is negligible for us.

Change 272461 merged by jenkins-bot:
Add support for TMX level 1 (plaintext)

https://gerrit.wikimedia.org/r/272461