The goal of this task is to retrieve 50 sample Content Translation (CX) publications and their associated initial machine translation outputs and metadata, for each of three wikis: sq, id, zh.
50 articles each for the 3 target wikis: Albanian (sq), Indonesian (id), and Standard Written Chinese (zh).
Below is what data is needed for each of these 50 items. For the nature of the sample of articles, please see the "Specification of Articles" below. There are also more details on data published about translations by Content Translation.
For each of the articles, the following data is needed:
- CX-published article - CX-published version of the article (at the time of initial publication, excluding any later edits to the article), along with any meta data such as date stamps, editor information, etc. Ideally, the meta data should be presented alongside the article text; it should minimally be linkable via a unique identifier.
- Initial unedited machine translation output for each CX publication - For each of these CX publications, the corresponding initial unedited MT output is needed, along with any information to match them with their respective CX publications.
- Corresponding CX quality algorigthm-assigned scores - CX algorithm-assigned scores(s) if available for each of the CX-published articles, presented alongside the articles or in a way that allows them to be linked. (A nice-to-have would be any information available about whether or not alerts were displayed based on algorithm-assigned scores)
- Historical snapshot of source article at time of MT output generation (nice-to-have) - For each of the CX-published articles, we'd like to have a version of the source (English) article at the time at which the MT was generated for editing. Again, these should be presented alongside the corresponding CX-published articles, or easily linkable through a unique identifier.
FORMAT OF DATA
For each of the 3 languages, data will include 50 CX publications, alongside a corresponding MT output, CX quality score, and historical snapshot of the source article. To support the linguistic analysis that will follow, ideally we need a way to store the data in which the CX publications and MT outputs are presented side-by-side, ideally in a spreadsheet. Presenting data in a spreadsheet should also faciliate the ease of also presenting any additional meta data for each of the items in these same row. Having articles broken down such that MT outputs and the CX-published article are presented paragraph-by-paragraph would be further advantageous.
SPECIFICATION OF ARTICLES
This section describes the sampling method for how to retrieve articles such that we obtain a sample that is representative enough for the type of analysis and generalizations we're interested in.
- Source language - Only articles with English as a source language should be included. English is the most frequent source language (with rates as high as 80-90%+)
- Translator diversity and experience - For each of the wikis, to establish a minimal amount of individual translator variation (i.e., we don't want to inadvertently retrieve translations from a single editor), the 50 articles should represent work of 10 or more individual editors, with no individual editor contributing more than 5. In addition, 50% of the articles should have been published by a ‘newer’ editor, defined here as an account created no longer than 2 years prior. The other half of articles should have been published by editors with CX publications beginning at least 3 years prior.
- Machine translation engine - Assuming that Google Translate (GT) may be the only service available across Albanian, Indonesian, and Chinese, and it being one of the most common services used by CX users (overall, across all languages), all articles should have been produced exclusively (across all sections/paragraphs) using initial MT outputs provided by GT
- Topic-Category - All articles should belong to the 'nature/natural phenomena' or 'biography' category.
- Article length - All articles (CX published versions) should contain a minimum of 7+ paragraphs, but if this is overly restrictive, a minimum of 5 is acceptable. These paragraphs may be contained in a single article section or across multiple sections of an article (i.e., no 'number of sections' specification).
- Percent modified - The CX quality algorithm calculates "percentage the MT is modified". We aim to define three categories for the overall 50 articles to fall into. These categories are (1) less than 10% modified, (2) between 11 and 50% modified, and (3) more than 51% modified.