Page MenuHomePhabricator

Provide parallel text format of Content Translation sample publications
Open, MediumPublic

Description

In T290906, we pulled a sample of content translation ids that met a specific set of article specifications for an upcoming design research study (See T288012) The sample consists of 50 published translations for each of the 3 target languages: Albanian (sq), Indonesian (id), and Standard Written Chinese (zh).

To support the linguistic analysis that will follow, we now want to provide these translations so that the CX publications and MT outputs are presented side-by-side, ideally in a spreadsheet.

See suggested format specifications below:

Format Specifications:
For each of the translations in the sample, the following data is needed in a side-by-side format:

  • CX-published article at the time of initial publication. This is the user version from the API.
  • Initial unmodified machine-translation output for each CX publication. This is the mt version from the API.
  • [Nice to Have] Historical snapshot of source article at time of MT output generation. This is the source version from the API.
  • Associated meta data or a link to the meta data provided in the sample pull doc.
  • [Nice to Have] Having the articles broken down such that MT outputs and the CX-published article are presented paragraph-by-paragraph would be further advantageous.
  • [Nice to Have] To the extent that it's possible, it would be ideal to include any and all content (such as images, templates, etc) that is translated into the target article as it gives us a better overall picture.

Data Sources:

Timeline
Per @Easikingarmager: "I anticipate we could be ready to start within 3-4 weeks, but if you're still wrapping up then, we could always begin by pulling them manually at first. In the very early stage we'll be sorting out some details in how we're approaching things so it'll go slower at first."

Event Timeline

mpopov triaged this task as Medium priority.Mar 29 2022, 5:10 PM

Hi @mpopov , just wanted to check in on this ticket. I'm in the process of onboarding the language consultants who will be helping with this work, so we're expecting to start soon. Depending on where you are with the task, I had a couple thoughts about organization that I was wondering if I could run by you if you don't mind me proposing a short meeting on your calendar. Thank you

@Easikingarmager: I was finally able to work on this. I have the data ready to be delivered but not sure how to proceed with actually delivering it.

The maximum character limit for cells in Google Sheets is 5000, but the data we're dealing with?

Content formatVersionNumber of characters in the biggest observed translation section
HTMLMT131594
HTMLSource494966
HTMLUser549828
Plain text (HTML-stripped)MT13424
Plain text (HTML-stripped)Source32745
Plain text (HTML-stripped)User18093

Now, according to https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3 Excel's limit is 32767 characters per cell.

So if we restrict ourselves to plain text with HTML stripped out then we just barely make it, but the translations could only be looked at in Microsoft Excel and Apple Numbers.

I was able to open the plain text translations (all 3 wikis in 1 CSV file) in Numbers and didn't get any errors, so presumably it's at least as good as Excel. I was then able to see that some translations include some wikitext markup (such as "[[File…]]").

I can share the CSV via Google Drive (and can split it up into multiple files by wiki if you'd like). Happy to talk about the next steps in a short meeting, too.

@mpopov and I met 26April to discuss a solution to the gsheets cell character limit problem encountered. Thanks for your time and help with this!

I'm making note of what we discussed and the solution we arrived at so that I don't forget and make sure I've got the details correct:
We manually reviewed the cells that would exceed 5000 characters, and identified that all but one were reference sections. Because reference sections are outside of our analysis (only looking at 'core' article content), we've opted to simply remove these. We also found 1 cell exceeding 5000 characters for which the section had no corresponding machine translation output/user output (not all sections of a given article may have these). For this reason, we were also simply able to remove this row. As such, there should now be no cells that will exceed the 5000 character limit, and therefore we'll proceed with using gsheets. A separate gsheet will be produced for each of the 3 language pairs.

Spreadsheet up at https://docs.google.com/spreadsheets/d/1ASDsYrA6Jz_i2Vc69nMnFVjjc8IMiRLRUfTNcg3o8HI/edit#

Source code at https://gitlab.wikimedia.org/bearloga/cx-machine-translations-text-sample

@MNeisler: here's the guide to the scripts if you need to re-run while I'm out this week

  • translation_ids.R retrieves the translation IDs from the metadata spreadsheet – this is meant to be run locally (due to requiring Google auth/credentials)
  • translations_text.R retrieves & wrangles the translations from MW API using internal endpoint – this is meant to be run on our stat100X.eqiad.wmnet hosts
  • translations_deliverable.R puts together the final dataset and uploads it to Google Sheets – this is meant to be run locally (due to requiring Google auth/credentials)