Page MenuHomePhabricator

Retrieve more Content translation publications to add to Indonesian Wikipedia sample
Closed, ResolvedPublic

Description

The goal of this task is to look into adding more content translation publications to the Indonesian samples retrieved in T290906 with the goal of getting to 45+ total Indonesian articles that meet the sample specifications.

Proposed steps:

  1. Investigate and confirm availability by re-pull the sample from Indonesian Wikipedia to see how many new translations are available since the last sample pull.
  2. If sufficient, re-run scripts documented in T304453 to produce a spreadsheet with parallel text for the new translations.
  3. QA spreadsheet
  4. Provide spreadsheet with new samples to @Easikingarmager to transfer to a new spreadsheet for review

Event Timeline

MNeisler renamed this task from Retrieve more Content translation publications to add to to Indonesian Wikipedia sample to Retrieve more Content translation publications to add to Indonesian Wikipedia sample.Jun 9 2022, 6:36 PM
MNeisler claimed this task.
MNeisler triaged this task as Medium priority.
MNeisler updated the task description. (Show Details)
MNeisler added a subscriber: Pginer-WMF.

@Easikingarmager I retrieved all Indonesian (id) translations that have been published since the last sample pull. See summary below.

Indonesian (id)

  • There have been 388 English to Indonesian distinct published translations that were started since the last sample pull in January 2022
  • Only 7 of these were labeled as fitting within the identified topic categories ( 'nature/natural phenomena' or 'biography' category)

Since there are only 7 and it's likely a portion of these will be discarded due to insufficient text, we won't be able to meet the 45+ requirement at this point.

However, I went ahead and added these translations to the CX sample translation id list in the "idwiki_sample_new" worksheet in case you'd like to review to see if any would be worth adding to the final sample. The quickest way to preview the side by side text would be to input the translation id into the translation debugger tool.

Let me know if you think it would be worthwhile to rerun the code to retrieve the parellel text for any of these new samples.

For reference, I also took a quick look at any new published translations for Albanian (sq) and Chinese (zh) in case additional samples are needed for either of these wikis.

Albanian (sq)

  • 475 English to Albanian (sq) distinct published translations
  • 8 of these were labeled as fitting within the identified topic categories

Chinese (zh)

  • 1627 English to Chinese (zh) distinct published translations
  • 18 of these fit within the identified topic categories

Hi @Easikingarmager - Just following up. Is anything else needed on this task or can this be resolved?

Hi Megan, thanks for checking in. Yes, we can consider this resolved, I
don't think anything else will be needed. Thanks for your collaboration on
this!
Eli

Closing per last comment