Maniphest T122042

Provide parallel corpora dumps
Closed, ResolvedPublic
Actions

Description

T119618: Create API for accessing parallel corpora by providing translation id added support for JSON formatted corpora output for individual articles. We should provide dumps for easier use.

Perhaps we can use https://dumps.wikimedia.org/?

Many tools support TMX format, so that is likely candidate, although JSON would be easier to implement. Not sure if we can have both.

Also see design document https://docs.google.com/document/d/11fFVBcu190u8J4uVJdyrghuHx5BdWrRHRZvd7fArwE0

Details

	Subject	Repo	Branch	Lines +/-
	Add support for TMX level 1 (plaintext)	mediawiki/extensions/ContentTranslation	master	+111 -21
	First steps towards dumps	mediawiki/extensions/ContentTranslation	master	+210 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	santhosh	T95886 for ContentTranslation MT, store information about source content, machine-translated content and user-edited content
Resolved	santhosh	T111905 Design the technical infrastructure for parallel corpora storage and api (tracking)
Resolved	Nikerabbit	T122042 Provide parallel corpora dumps

Event Timeline

santhosh created this task.Dec 21 2015, 10:57 AM

santhosh raised the priority of this task from to Medium.

santhosh updated the task description. (Show Details)

santhosh added a project: ContentTranslation.

santhosh added subscribers: santhosh, Nikerabbit.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 21 2015, 10:57 AM

santhosh added a parent task: T111905: Design the technical infrastructure for parallel corpora storage and api (tracking).Dec 21 2015, 11:00 AM

TMX formatted dumps are likely to be more useful than exporting a single article in TMX format.

santhosh renamed this task from Support TMX format in contenttranslationcorpora API to Provide TMX formatted parallel corpora.Dec 23 2015, 4:55 AM

santhosh set Security to None.

Amire80 moved this task from Needs Triage to CX8 candidates on the ContentTranslation board.Dec 31 2015, 8:34 AM

Amire80 updated the task description. (Show Details)Dec 31 2015, 8:37 AM

Amire80 added a project: OKR-Work.

Amire80 moved this task from CX8 candidates to CX8 on the ContentTranslation board.Jan 20 2016, 3:38 PM

The source URL in the design doc uses spaces, not underscores. Is that a typo?

Note a typo. It reflects the current output. See https://ca.wikipedia.org/w/api.php?action=query&list=cxpublishedtranslations&limit=50&offset=40000

Would you like it have underscrores instead of spaces?

santhosh added projects: ContentTranslation-Release8, LE-CX8-Sprint 2.Jan 22 2016, 9:49 AM

Amire80 moved this task from Backlog to Parallel corpora on the ContentTranslation-Release8 board.Jan 24 2016, 8:45 PM

Nikerabbit renamed this task from Provide TMX formatted parallel corpora to Provide parallel corpora dumps.Jan 26 2016, 10:06 AM

Nikerabbit updated the task description. (Show Details)

Arrbee edited projects, added LE-CX8-Sprint 3; removed LE-CX8-Sprint 2.Feb 16 2016, 6:38 AM

Change 271217 had a related patch set uploaded (by Nikerabbit):
First steps towards dumps

https://gerrit.wikimedia.org/r/271217

gerritbot added a project: Patch-For-Review.Feb 17 2016, 10:03 AM

santhosh assigned this task to Nikerabbit.Feb 18 2016, 5:12 AM

santhosh moved this task from Backlog to In Review on the LE-CX8-Sprint 3 board.

Change 272461 had a related patch set uploaded (by Nikerabbit):
Add support for TMX level 1 (plaintext)

https://gerrit.wikimedia.org/r/272461

KartikMistry mentioned this in T127793: Create Content Translation Parallel Corpora dumps.Feb 23 2016, 5:36 AM

Can you give me an idea of size, number of files, number of runs we are talking about here? Dataset hosts sound like the logical place but let's make sure we have/will have the capacity.

Right now we are talking about the contents of about 60k published articles. At worst we have the html markup of those articles in thee different versions (original, mt, post-edited). The median size of drafts (including unpublished) is 13k, so that would be around 60k*13k*3, about two gigabytes. If we also include plaintext version and TMX, that would add some more but way less than the html versions. There is also the overhead of the file formats, so I would say about 5 GB initially.

Number of files would at worst be the number of language pairs we have times the formats, i.e. around 300 * 300 * 3 (quarter of a million). But the script has split-at feature that groups less used languages into any2xx and any2any files. By tweaking the threshold we can keep the number of files relatively low.

As for how frequently these should be created, and whether we need to keep old versions around, we have not discussed that in the team. I would guess the dumps would be updated at least couple of times per year if not more.

Actually, the *average* size of a draft is 45k, so I think 10 GB is better estimate.

I tested script in local setup and Labs.

Change 271217 merged by jenkins-bot:
First steps towards dumps

https://gerrit.wikimedia.org/r/271217

We can host these without worry about capacity for some time then; 20 or 30GB over a year is negligible for us.

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2016-03-01_(1.27.0-wmf.15)).Feb 29 2016, 10:00 AM

@ArielGlenn Task to set up dump is tracked at, T127793.