Page MenuHomePhabricator

Retrieve sample of Content Translation publications and corresponding machine translation outputs and metadata
Closed, ResolvedPublic

Description

The goal of this task is to retrieve 50 sample Content Translation (CX) publications and their associated initial machine translation outputs and metadata, for each of three wikis: sq, id, zh.

DATA NEEDED
50 articles each for the 3 target wikis: Albanian (sq), Indonesian (id), and Standard Written Chinese (zh).

Below is what data is needed for each of these 50 items. For the nature of the sample of articles, please see the "Specification of Articles" below. There are also more details on data published about translations by Content Translation.

For each of the articles, the following data is needed:

  1. CX-published article - CX-published version of the article (at the time of initial publication, excluding any later edits to the article), along with any meta data such as date stamps, editor information, etc. Ideally, the meta data should be presented alongside the article text; it should minimally be linkable via a unique identifier.
  2. Initial unedited machine translation output for each CX publication - For each of these CX publications, the corresponding initial unedited MT output is needed, along with any information to match them with their respective CX publications.
  3. Corresponding CX quality algorigthm-assigned scores - CX algorithm-assigned scores(s) if available for each of the CX-published articles, presented alongside the articles or in a way that allows them to be linked. (A nice-to-have would be any information available about whether or not alerts were displayed based on algorithm-assigned scores)
  4. Historical snapshot of source article at time of MT output generation (nice-to-have) - For each of the CX-published articles, we'd like to have a version of the source (English) article at the time at which the MT was generated for editing. Again, these should be presented alongside the corresponding CX-published articles, or easily linkable through a unique identifier.

FORMAT OF DATA
For each of the 3 languages, data will include 50 CX publications, alongside a corresponding MT output, CX quality score, and historical snapshot of the source article. To support the linguistic analysis that will follow, ideally we need a way to store the data in which the CX publications and MT outputs are presented side-by-side, ideally in a spreadsheet. Presenting data in a spreadsheet should also faciliate the ease of also presenting any additional meta data for each of the items in these same row. Having articles broken down such that MT outputs and the CX-published article are presented paragraph-by-paragraph would be further advantageous.

SPECIFICATION OF ARTICLES
This section describes the sampling method for how to retrieve articles such that we obtain a sample that is representative enough for the type of analysis and generalizations we're interested in.

  1. Source language - Only articles with English as a source language should be included. English is the most frequent source language (with rates as high as 80-90%+)
  2. Translator diversity and experience - For each of the wikis, to establish a minimal amount of individual translator variation (i.e., we don't want to inadvertently retrieve translations from a single editor), the 50 articles should represent work of 10 or more individual editors, with no individual editor contributing more than 5. In addition, 50% of the articles should have been published by a ‘newer’ editor, defined here as an account created no longer than 2 years prior. The other half of articles should have been published by editors with CX publications beginning at least 3 years prior.
  3. Machine translation engine - Assuming that Google Translate (GT) may be the only service available across Albanian, Indonesian, and Chinese, and it being one of the most common services used by CX users (overall, across all languages), all articles should have been produced exclusively (across all sections/paragraphs) using initial MT outputs provided by GT
  4. Topic-Category - All articles should belong to the 'nature/natural phenomena' or 'biography' category.
  5. Article length - All articles (CX published versions) should contain a minimum of 7+ paragraphs, but if this is overly restrictive, a minimum of 5 is acceptable. These paragraphs may be contained in a single article section or across multiple sections of an article (i.e., no 'number of sections' specification).
  6. Percent modified - The CX quality algorithm calculates "percentage the MT is modified". We aim to define three categories for the overall 50 articles to fall into. These categories are (1) less than 10% modified, (2) between 11 and 50% modified, and (3) more than 51% modified.

Event Timeline

MNeisler added a subscriber: MNeisler.

Per discussions with @Pginer-WMF, I'll investigate and identify any requirements or steps I think will be needed to obtain this sample data.

CX Database captures most of the data required except the real time MT modification information about sections as users edit. This information is used for edit time warnings to translators per section. This data does not go to database. The overal MT percentage is what we have in database and that is what we use for deciding whether the translation can be published or not.

The translation debugger tool provided at https://cxdebugger.toolforge.org/ gives translation metadata and side-by-side comparison of source article, unmodified-MT and human MT for comparison. Information abut MT engines and progress calculcated by CX also provided.

Sorry for the delay! Update on my investigation into this:

  • As @santhosh commented above, we can obtain the majority of the data needed to pull the requested sample using a combination of the CX database and the translation debugger tool with the exception of edit time warnings issued to translators during in-progress translations. A few of the specifications such as user account creation (needed for the Translator diversity and experience specification) and article topic category will require joining the data to other sources, which makes querying more complicated but it is feasible.

Open Questions

@Easikingarmager - I had a few clarfiying questions about the article specifications. See below:

Topic-Category - All articles should belong to the 'nature/natural phenomena' or 'biography' category.

  1. The best place to obtain articles by topic is the ORES API, which taxonomy currently includes a biography subcategory but not a specific nature/natural phenomena category. Based on the current taxonomy, Earth and Environment subcategory would probably be the closest. Does that work? Let me know if there are other or different categories you'd like to include.

Percent modified - The CX quality algorithm calculates "percentage the MT is modified". We aim to define three categories for the overall 50 articles to fall into. These categories are (1) less than 10% modified, (2) between 11 and 50% modified, and (3) more than 51% modified.

  1. For the above specification, will this categorization occur after the sample is obtained or do want a certain split of articles across these three categories represented in the sample (e.g. 15% of the 50 articles are less than 10% modified)?

To support the linguistic analysis that will follow, ideally we need a way to store the data in which the CX publications and MT outputs are presented side-by-side, ideally in a spreadsheet.

  1. The translation debugger tool currently presents the source, MT Output and CX publication side by side with a section by section split. One option would be for me to provide a spreadsheet that includes the translation id and requested meta data such as timestamps and editor information for each article the pulled sample. You could then insert the tranlslation id for each article into the debugger tool to obtain the side by side comparison. Would that work for the puposes of the linguistic analysis or would you prefer the side by side text to also be included in spreadsheet form for each of the articles?

SPECIFICATION OF ARTICLES

  1. Do we want any of this information retained as part of the meta data presented with each article in the sample? For example, the spreadsheet could include a column titled "editor experience" to indicate the level of experience of the editors publishing the CX publication and "topic/category" column.

Recommended Process

Based on the data and sources available, I'd recommend the following steps:

  1. Write and run a query to generate a random sample of translation_ids that meet the requested article specifications using the CX databases (with joins to other data sources as needed).
  2. Use the results of the query to populate a spreadsheet that contains the translation_ids and requested meta data for each article in the sample.
  3. Use the translation_ids to obtain side by side comparison of the source, MT output and CX publication for each article (Provided in either spreadsheet form or viewed within the debugger tool interface - See open question #3 above)

Any thoughts/concerns with the above plan?

Thanks @MNeisler for the initial exploration. I'll let @Easikingarmager to provide more details based on the research needs. Only to add a bit of context for one of the questions:

  1. The best place to obtain articles by topic is the ORES API, which taxonomy currently includes a biography subcategory but not a specific nature/natural phenomena category. Based on the current taxonomy, Earth and Environment subcategory would probably be the closest. Does that work? Let me know if there are other or different categories you'd like to include.

Since Ores topics can be used in search, this search query may be useful for Eli to check the kind of articles that may fall in the intersection of both topic areas (biography + Earth and Environment).

Hi @MNeisler, thanks for the review and questions. I'll do my best to answer in-line below:

Topic-Category - All articles should belong to the 'nature/natural phenomena' or 'biography' category.

  1. The best place to obtain articles by topic is the ORES API, which taxonomy currently includes a biography subcategory but not a specific nature/natural phenomena category. Based on the current taxonomy, Earth and Environment subcategory would probably be the closest. Does that work? Let me know if there are other or different categories you'd like to include.

I don't see any major concerns with using 'Earth and Environment'. Having looked through it a bit, as Pau suggested, I think if using the intersection of 'biography' + 'earth and environment' yields sufficient results to draw from, that would be a good approach as it's more narrow. Largely, what we're trying to do is narrow topic/genre to avoid any manipulation of it as an independent variable this time.

Percent modified - The CX quality algorithm calculates "percentage the MT is modified". We aim to define three categories for the overall 50 articles to fall into. These categories are (1) less than 10% modified, (2) between 11 and 50% modified, and (3) more than 51% modified.

  1. For the above specification, will this categorization occur after the sample is obtained or do want a certain split of articles across these three categories represented in the sample (e.g. 15% of the 50 articles are less than 10% modified)?

The latter - 'we want a certain split of articles across these three categories...' Should this become difficult, we could re-consider or definitions of these categories as needed. They're meant to capture general categories, but the exact definition is flexible to a degree.

To support the linguistic analysis that will follow, ideally we need a way to store the data in which the CX publications and MT outputs are presented side-by-side, ideally in a spreadsheet.

  1. The translation debugger tool currently presents the source, MT Output and CX publication side by side with a section by section split. One option would be for me to provide a spreadsheet that includes the translation id and requested meta data such as timestamps and editor information for each article the pulled sample. You could then insert the tranlslation id for each article into the debugger tool to obtain the side by side comparison. Would that work for the puposes of the linguistic analysis or would you prefer the side by side text to also be included in spreadsheet form for each of the articles?

Ideally I think we probably want the side-by-side text to be included in the spreadsheet form for each of the articles. That way either option is available since we'd still have the translation ids available.

SPECIFICATION OF ARTICLES

  1. Do we want any of this information retained as part of the meta data presented with each article in the sample? For example, the spreadsheet could include a column titled "editor experience" to indicate the level of experience of the editors publishing the CX publication and "topic/category" column.

That would be great. It would be very helpful to have each entry clearly and easily associated with any of this additional data - editor experience, an unique editor identifier, etc... Thanks for thinking of this.

Recommended Process

Based on the data and sources available, I'd recommend the following steps:

  1. Write and run a query to generate a random sample of translation_ids that meet the requested article specifications using the CX databases (with joins to other data sources as needed).
  2. Use the results of the query to populate a spreadsheet that contains the translation_ids and requested meta data for each article in the sample.
  3. Use the translation_ids to obtain side by side comparison of the source, MT output and CX publication for each article (Provided in either spreadsheet form or viewed within the debugger tool interface - See open question #3 above)

Any thoughts/concerns with the above plan?

This sounds good (response to #3 above). Let me know how it goes with step 1 and if you're seeing any modifications needed after running that query.
Thanks for your work on this!

Hi @Easikingarmager,

I've pulled an initial sample list of translation ids for the specified target wikis: idwiki, sqwiki, and zhwiki for review. I've compiled the translation ids and associated meta data for each translation in this google sheet. The google sheet includes a tab for each target wiki sample.

I'm looking into possible ways to provide the translations in a side-by-side format in a spreadsheet as well but before doing that wanted to check that the current sample pull meets your requirements. In the meantime, you can view a side by side of each sample by inputting the translation id into the translation debugger tool provided at https://cxdebugger.toolforge.org/

I ended up needing to make some adjustments to the specifications to meet the requested sample size. See summary below and let me know if you have any questions or suggested changes:

The sample currently meets the following specifications:

  • 50 articles each for the 3 target wikis: Albanian (sq), Indonesian (id), and Standard Written Chinese (zh).
  • Source Language - The source language is English for all translations
  • Translator Diversity and Experience - The 50 articles represent the work of 10 or more individual editors, with no individual editor contributing more than 5. 50% of the articles should have been published by a ‘newer’ editor, defined here as an account created no longer than 2 years prior (see note below regarding modification to 'senior' editor definition).
  • Topic-Category - All articles should belong to the 'earth and environment' or 'biography' category.

There were not enough translations to meet all the specified requirements for all the target wikis. See list of exceptions and modifications below:

  • Machine Translation Engine: Zhwiki and idwiki currently also include some Yandex translated articles. There were not enough Google translations in the pulled data that met the article length requirements for these target wikis.
  • Translator Diversity and Experience. The sample requirements specify that "the other half of articles should have been published by editors with CX publications beginning at least 3 years prior." Sqwiki and idwiki did not have enough translations published by editors with CX publications beginning at least 3 years prior. Instead, the 50% of the translations in the sample is created by editors having an account created over 2 years prior. I can adjust this definition as needed.
  • Article Length: There's not a great way to check paragraph length in the aggregated data so some of these may be shorter than needed. I used the number of sections as a rough indicator but, if during the review, you identify a translation that is too short - let me know and I can find a replacement.
  • Percent Modified: I was not able to meet an equal split across the three categories but got as close as possible while trying to meet the other sample specifications identified above.

Let me know if there are certain specifications that you would like specifically prioritized over others or adjustments to the existing sample specification definitions. I can repull the sample as needed.

Some additional details and high-level insights from the first sample pull in case helpful:

  • There were 34,719 published translations available across all three identified target wikis (sq, id, zh) with English identified as the source language.
  • 3,511 of these translations were identified as belonging to either the 'Biography' or 'Earth and Environment' topics. (503 for idwiki, 139 for sqwiki, and 2,869 for zhwiki)

This list of 3,511 translations was what I used to pull a random sample to meet the other identified article specifications for each target wiki. Identified trends and limitations within the reviewed sample:

  • Translator Diversity and Experience:
    • About 40% of the translations were completed by translators identified as experienced (defined as editors with CX publications beginning at least 3 years prior). However, there were only 6 translations on sqwiki and 10 on idwiki by editors that met this definition.
  • Machine Translation Engine: As mentioned in my comment above, Yandex was also a commonly used machine translation tool for these articles, especially for translations with multiple sections on zhwiki and idwiki. Google was the primary machine translation tool on sqwiki.
  • Percentage the MT is modified: The majority (64%) of translations across all three target wikis were "between 11 and 50% modified" but this trend varied for each individual target wiki.
    • idwiki: the majority of translations (66%) fell into the "over 50 percent modified" category
    • zhwiki: the majority of translations (71%) fell into the "between 11 and 50 percent modified" category
    • sqwiki: Almost a 50/50 split between translations in the "less than 10 percent modified" and "between 11 and 50 percent modified" with only a small percentage of translations that were more than 50 percent modified.

Code Repo for further details.

Hi @MNeisler, thanks for all your work so far on this!

I've pulled an initial sample list of translation ids for the specified target wikis: idwiki, sqwiki, and zhwiki for review. I've compiled the translation ids and associated meta data for each translation in this google sheet. The google sheet includes a tab for each target wiki sample.

I'm looking into possible ways to provide the translations in a side-by-side format in a spreadsheet as well but before doing that wanted to check that the current sample pull meets your requirements. In the meantime, you can view a side by side of each sample by inputting the translation id into the translation debugger tool provided at https://cxdebugger.toolforge.org/

Makes sense to review before jumping into this step. To the extent that we can find a way to automate the pulling of these MT outputs and user translations (as they're referred to in the translation debugger), I think that will save a lot of manual work later on.

I ended up needing to make some adjustments to the specifications to meet the requested sample size. See summary below and let me know if you have any questions or suggested changes:

The sample currently meets the following specifications:

  • 50 articles each for the 3 target wikis: Albanian (sq), Indonesian (id), and Standard Written Chinese (zh).
  • Source Language - The source language is English for all translations
  • Translator Diversity and Experience - The 50 articles represent the work of 10 or more individual editors, with no individual editor contributing more than 5. 50% of the articles should have been published by a ‘newer’ editor, defined here as an account created no longer than 2 years prior (see note below regarding modification to 'senior' editor definition).
  • Topic-Category - All articles should belong to the 'earth and environment' or 'biography' category.

There were not enough translations to meet all the specified requirements for all the target wikis. See list of exceptions and modifications below:

  • Machine Translation Engine: Zhwiki and idwiki currently also include some Yandex translated articles. There were not enough Google translations in the pulled data that met the article length requirements for these target wikis.

Just the fact that you found this is sort of interesting as I've previously made guesses about the most frequently used engines for various language pairs. I see that zhwiki data in the spreadsheet show a 30/20 split for google/yandex, and there are just slightly fewer yandex-created translations for idwiki. Overall, I think this will be fine. The only slight modification I wonder about is if we should replace translation id 1351171 with another instance since it was produced with 2 different engines. This is interesting, and I'm curious how common that is, but for this first pass analysis maybe we should just restrict to translations published with a single MT engine. It's only one instance I see, so if it's a lot of extra work, it's not critical.

  • Translator Diversity and Experience. The sample requirements specify that "the other half of articles should have been published by editors with CX publications beginning at least 3 years prior." Sqwiki and idwiki did not have enough translations published by editors with CX publications beginning at least 3 years prior. Instead, the 50% of the translations in the sample is created by editors having an account created over 2 years prior. I can adjust this definition as needed.

The 3 year cutoff is ultimately arbitrary to some degree, so adjusting to 2 should be fine, thanks.

  • Article Length: There's not a great way to check paragraph length in the aggregated data so some of these may be shorter than needed. I used the number of sections as a rough indicator but, if during the review, you identify a translation that is too short - let me know and I can find a replacement.

Yes, I'd wondered if this would be a bit tricky, so not too surprised. What did you set the number of sections cutoff as when pulling them?

  • Percent Modified: I was not able to meet an equal split across the three categories but got as close as possible while trying to meet the other sample specifications identified above.

Ok, thanks for flagging. I see that we also have a column for raw percentages, so if we need to reconsider how we initially thought we might categorize these, that should be fine.

Let me know if there are certain specifications that you would like specifically prioritized over others or adjustments to the existing sample specification definitions. I can repull the sample as needed.

Thanks @Easikingarmager!

To the extent that we can find a way to automate the pulling of these MT outputs and user translations (as they're referred to in the translation debugger), I think that will save a lot of manual work later on.

I'm currently looking into options. I think there might be a way using the parallel corpora API. There might still be some formatting required but it would hopefully reduce a good portion of the manual work to copy and paste the articles into the side by side format.

The only slight modification I wonder about is if we should replace translation id 1351171 with another instance since it was produced with 2 different engines. This is interesting, and I'm curious how common that is, but for this first pass analysis maybe we should just restrict to translations published with a single MT engine. It's only one instance I see, so if it's a lot of extra work, it's not critical.

I only found around 3 instances of translations produced with more than 1 engine in the samples reviewed across all three of the target languages, so this is appears to be pretty uncommon. I went ahead and replaced translation id 1351171 with another one (577969) that was published with a single MT engine.

Yes, I'd wondered if this would be a bit tricky, so not too surprised. What did you set the number of sections cutoff as when pulling them?

I ended up setting it as 4 sections. Note that some of these sections might have very limited text or sections such as References, which might not be ideal for the research study. I'm unfortunately not able to detect this in the aggregate data but as I mentioned I can provide replacement translations as needed. If it looks like a lot of the articles are too short, then we might want to re-prioritize or redefine some of the article specifications to extend the translations available for review.

@Easikingarmager

I'm currently looking into options. I think there might be a way using the parallel corpora API. There might still be some formatting required but it would hopefully reduce a good portion of the manual work to copy and paste the articles into the side by side format.

Update: I've confirmed that we will be able to use the API to export and present the data so that publications and MT outputs are presented side-by-side in an excel spreadsheet. After finalizing any adjustments to the current sample pull, I can start work on providing the data in this format for the study.

[Posted on Slack as well] Do you think setting the cut off at anything more than ‘min 4 sections’ would greatly increase the difficulty of pulling a sample? Since we can’t detect references (and other non-core prose sections), maybe setting the limit slightly higher would account for 1 or more of the sections often being a non-core section? What do you think based on what you’ve seen?

Based on what I've seen, I think we can probably increase the cut-off to 5 (maybe 6 sections at the most). It will be difficult to increase the number of sections any more than that without making any adjustments to the other specifications. In the first sample pull, I tried to get a diverse sample of translations with as close to an equal split across the identified "Percent Modified" and "translator experience" catergories. If we want to prioritize article-length over either of these specifications, that will make it much easier to increase the section length cut-off.

Let me know how you'd like to proceed.

Hi @MNeisler , thanks for confirming about the side-by-side presentation you're working on automating. As for the section number cut off, I think it would be helpful to increase the cutoff to 5 or 6. At that number we should likely have at least a few core sections for each article. We can revisit regarding increasing that number and adjusting other specifications only if it seems absolutely necessary. Just for purposes of coordination (i.e., no pressure intended), when do you think the final spreadsheet may be ready to go? Thanks again for your work on this.

@Easikingarmager

I was able to increase the cutoff to 6 sections for idwiki and zhwiki while still staying within the article specifications as described in T290906#7632664. I did a quick manual review of the samples using the translation debugger tool and it looks like the majority of these have at least 5 paragraphs of text. The sample translation ids for these wikis are updated in the spreadsheet.

For sqwiki, I am not able to increase the section cutoff beyond 4 sections without adjusting the article sample specifications. The biggest limitation for this wiki is the translator diversity specification. For this wiki, there are only 51 distinct editors contributing to a total of 139 translations within the identified topic categories and using Google MT engine (with 2 of those editors contributing to over 20 translations each in the sample). After limiting each of those editors to only 5 translations each, there were not enough translations that met the article length requirement and other translator diversity specifications. If you'd like to prioritize the section length for this wiki, I think increasing the number of translations allowed by an individual editor from 5 to at least 7 should help significantly.

Just for purposes of coordination (i.e., no pressure intended), when do you think the final spreadsheet may be ready to go? Thanks again for your work on this.

@mpopov has kindly offered to work on providing the side-by-side text spreadsheet using contenttranslationcorpora API for this once the final list of content translation ids is confirmed. I'll create a separate task for that work and he can advise of timing. Is there any specific time you need the final spreadsheet by?

Also, in regard to the final side-by-side sample presentation, how would like to handle any rich content on articles (for example, images, templates, formatting (links, bold text)? Would you like to see if those can be included in the final sample presentation or are you only interested in reviewing the article text for the research study?

Hi @MNeisler, increasing the translations allowed by any individual editor from 5 to 7 is ok. I think for consistency/comparison purposes later on we could consider making that adjustment for all three languages if it allows us to increase the amount of content we're able to pull for each of the articles. So, adjusting from 5 to 7 is a minimal arbitrary change, whereas lacking sufficient content to analyze potentially compromises the analysis more.

In terms of rich content, thanks for bringing that up. To the extent that it's possible, it would be ideal to include any and all content that is translated into the target article as it gives us a better overall picture. I'm actually quite curious to know how much of this content is actually making it in, and in what form. This would be helpful in understand possible gaps in terms of where the translation tool might be able to provide better support.

@mpopov, thanks in advance for being willing to help, much appreciated. In terms of timeline, I anticipate we could be ready to start within 3-4 weeks, but if you're still wrapping up then, we could always begin by pulling them manually at first. In the very early stage we'll be sorting out some details in how we're approaching things so it'll go slower at first.

Thanks again for your continued work on this.

increasing the translations allowed by any individual editor from 5 to 7 is ok.

Thanks @Easikingarmager! I ended up having to increase the translations allowed by any individual editor to 8 in order to obtain enough samples to meet the section length requirement while meeting the other translator diversity specifications. This adjustment was done across all three target languages for consistency. However, after a quick look through some of the pulled samples, it looks like there are no instances of this occurring for either idwiki and zhwiki because there are more unique translators in these wikis.

Let me know if you have any concerns or need any adjustments. Hopefully, these adjustments provided samples with sufficient content for the analysis but, as mentioned earlier, if you discover a translation that doesn't work well due to issues we were not able to detect in the available aggregate data, let me know and I can look into replacing it.

Parallel (Side-by-Side) Presentation/Formatting
I created a separate task (T304453) for @mpopov's work on pulling the providing the side-by-side text spreadsheet using the final sample translation ids and the contenttranslationcorpora API. I took an initial pass at specifying any format requirements in that task description based on the info in this task and our discussions but please update as needed.