Page MenuHomePhabricator

Qualitative Exploration of Content Translation Tools
Closed, ResolvedPublic

Description

Building on this notebook (https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Content%20Translation%20Example.ipynb ), start to explore what types of content is translated and what happens to these articles once they are created.

Some starting examples include trying to better understand what happens to the translated article after it is created. The page history for every Wikipedia article is publicly available. Each article also has a corresponding talk page, in which editors might discuss the content on the page and other related items. If you are unfamiliar with how to access this content, see these overviews of how to access page history (https://en.wikipedia.org/wiki/Help:Page_history) and talk pages (https://en.wikipedia.org/wiki/Help:Talk_pages)

For example, for the English version of Gradient Boosting, these can be found at:

Go through the edit histories for a few articles and begin to identify whether any trends emerge about the types of edits that happen to translated articles. Compare the translated and source articles in their current state. What types of content were added after the translation? Are the articles diverging in content or staying similar? What sorts of discussions occur on the talk pages of translated articles?

Eventually we can do this in a more robust manner: more carefully choosing which articles to examine, developing more concrete questions to answer, building a code book for annotating article histories, content, or discussions, etc.

Event Timeline

hello @Isaac Thanks for providing the list of tasks. I would like to take T218003

@Isaac I am having few doubts. actually I am unable to identify what actually I have to do. so correct me if I am wrong.
"Basically my task is to write a code which will compare all the translations done in the past?"

Hello! @Isaac, I also have a similar question regarding this task. Do I need to access History and Talk pages programmatically and then analyze them, or I can do it manually?

"Basically my task is to write a code which will compare all the translations done in the past?"

Hey @NuKira : this is a research-oriented project so it's going to be more open-ended than many bugs are. I recognize that will be new to many of us so I'll try my best to keep the goals clear, but this project, at least at the start, is going to be less about producing code that does something very specific and more about beginning to understand how the content translation tool is used and the impact of these articles on the wikis to which they are added. At this stage, you'll need to generate and implement some ideas about how you could describe the articles that are being created by the content translation tool. This might be some statistics on page views they receive, amount of new content after they are translated, number of edits by new users, etc. It could also be a description of the types of content that people are translating / choosing not to translate for a number of example articles, which could help guide future quantitative analyses. Hopefully that helps, but keep asking questions if not!

Do I need to access History and Talk pages programmatically and then analyze them, or I can do it manually?

@Cherrywins : that's up to you at this point. If you do it programmatically, you should do some statistical analyses as well (# of edits etc.). If you choose to do it manually, take advantage of looking deeply into the data and note what patterns you see in the text (that could then potentially be quantified).

hi Isaac,
I don't understand how to download dumps file which I need to get corresponding parallel translation. I am also don't have any output when I run the example code  on my notebook on PAWS.
I hope you can direct me to what I need to do after run the examples on my notebook.

I’m also try to change the target language from spanish to Arabic by put ar instead of es but also I don’t have any output so I don’t know if this right or not.

I don't understand how to download dumps file which I need to get corresponding parallel translation.

No problem - the notebook has a lot to take in. The directions are under the section "Get corresponding parallel translation". Briefly with a bit more information about the download process:

  1. In my example PAWS notebook, I directed you towards: https://www.mediawiki.org/wiki/Content_translation/Published_translations#Dumps
  2. In that wiki, you'll find the link to the dumps site: https://dumps.wikimedia.org/other/contenttranslation/
  3. Choose the most recent date and find the .text.json.gz dump file corresponding with the two languages you chose. In my example, I was interested in articles translated from English to Spanish, so that file was: cx-corpora.en2es.text.json.gz
  4. Download that file and upload it to PAWS, where you can then access it as if it were a local file on your computer (I give further details and an example in the notebook).

I am also don't have any output when I run the example code on my notebook on PAWS.

You'll have to give me further details about what code you're running for me to help you more, but for a cell to have output though, you'll need to explicitly print() something or just call the variable's name. For the future, in general it is good practice to include the exact code you are running when asking a question so that someone can easily reproduce your issue.

I’m also try to change the target language from spanish to Arabic by put ar instead of es

Yes, that should work. The English -> Arabic file would then be cx-corpora.en2ar.text.json.gz

Hi,
I have downloaded the dump file and uploaded to the PAWS but also don't have any output for the cell for parallel translation this is my URL for the PAWS https://paws-public.wmflabs.org/paws-public/User:Israashahin/Untitled.ipynb?kernel_name=python3
I want to see output because I 'm new on that untill I could change the target language and the Articles on parallel translation.
Thanks for your help.

After I restart the kernel and run the cell for parallel translation I have
name error : name ‘gzip’ is not defined

The code for gzip:
with gzip.open(‘cx-copora.en2es.text.json.gz’ , ‘r’) as fin:

And also on the cell that follows I have
name error: name ‘parallel_corpus’ is not defined

The code:
for sec in parallel_corpus:

What I want to do now after I understand why I have these error to make Arabic parallel translation for some most popular articles on wikipedia and see the fluently and accuracy in translation and see the number of translations for these articles my question here is this consider to be a contribution for the task here or not??

@Isaac I have a doubt cant I use the notebook that you have created (the example notebook)?

Please be patient with me because I am still a beginner on notebook, I have another question here when I try to put a parameter  for articles translated from English to Arabic I don't have any output after run the cell I just have an empty cell like that 

the code here is:
parameters = {'action':'query',

'format':'json',
'list':'cxpublishedtranslations',
'from':'en',
'to':'ar',
'limit':500,
'offset':20000}

 
res = session.get(parameters)
res['result']['translations'][:10]

@Israashahin Did you use correct link for the json file?

Hello again!
If I have some comments/suggestions about mwapi API work, can I include in it my report?

Ekaterina

@Israashahin : correct me if I'm wrong, but I think you're working from this notebook: https://paws-public.wmflabs.org/paws-public/57578672/wiki%20project.ipynb
No worries about being new to this - a few things:

  • When you start a Jupyter Notebook session, you'll still see output from the previous session, but you need to re-run a cell for it to be active. If the notebook is indicating that it doesn't understand the name "gzip" then that indicates you should re-run the cell with the import gzip code.
  • This is also true if a cell throws an error - any cells below it probably won't work if they depend on that cell running to completion. This is why parallel_corpus is not being recognized.
  • I'll also note that for whatever reason, your gzipped en2es corpus does seem to be misformatted. If you want to continue to use it, I'd suggest downloading another version. I'm not sure why it is misformatted (perhaps an error when downloading), but gzip should not have any issue opening the file.

I don't have any output after run the cell I just have an empty cell

I noted in my notebook that 20000 was an arbitrary offset for the API and that if you got an empty output, you should decrease this number and try again. Check the notes again at the top after the Initialization cell: https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Content%20Translation%20Example.ipynb

What I want to do now after I understand why I have these error to make Arabic parallel translation for some most popular articles on wikipedia and see the fluently and accuracy in translation and see the number of translations for these articles my question here is this consider to be a contribution for the task here or not??

Yeah, this is a good start. If you're selecting just a few popular articles to look at, I'd highly suggest going through the history of those articles since the translation to also see how they have changed and whether there is now new content in Arabic that is not in the English article and vice versa.

I have a doubt cant I use the notebook that you have created (the example notebook)?

@NuKira: could you be a bit more detailed about what your doubts are and then I can see what I can help with.
And thanks for trying to help out with the JSON issue -- in general, I highly encourage you to help each other, especially if I'm slow in responding!

If I have some comments/suggestions about mwapi API work, can I include in it my report?

@Cherrywins : sure -- I'd suggest focusing on any insights you have into the data, but do feel free to comment on the API itself if it seems pertinent.

@Isaac do I have to make exact copy of the example notebook that u hv shared or I can do something different?

@Isaac isnt there any way so that I can directly access your notebook.?

do I have to make exact copy of the example notebook that u hv shared or I can do something different?

@NuKira: You should build on the examples that I provided, but don't feel constrained by what I included. I wanted to give a few examples so that everyone could focus more on looking at the data and thinking of ways to analyze it then figuring out how to access the data in the first place. Specifically, I'm hoping that you choose languages that you know personally and can provide some insight into if you do some exploration of what types of content is or is not translated between those two languages, what types of content is added after the translation etc.

isnt there any way so that I can directly access your notebook.?

No - you should create your own notebook. If you want, you could download my notebook and upload it to your PAWS directory as a starting place.

To add some more ideas to the mix and provide some assistance if you're having trouble with choosing something to focus on: consider choosing a topic like vaccines (see this article: https://blog.wikimedia.org/2016/03/29/wikipedias-essential-vaccines/ ) and specifically focus on those articles for asking questions like "has new content in the source language been added to the target language"

Isaac triaged this task as Medium priority.Mar 13 2019, 6:06 PM

https://paws-public.wmflabs.org/paws-public/User:Israashahin/Untitled.ipynb?kernel_name=python3

This is my public link to my PAWS I have two question here.

The first one why I don't have translation for the parallel translation is it because the Arabic translation is for the general description of the Articles (not the whole article) I choose or the download for the dump file has a problem??

The second one how can I return to the history of the translations for the Articles I choose??

The first one why I don't have translation for the parallel translation is it because the Arabic translation is for the general description of the Articles (not the whole article) I choose or the download for the dump file has a problem??

Hey @Israashahin : why don't you try another example to see if the user just did not translate that particular section or whether this is a more widespread issue.

The second one how can I return to the history of the translations for the Articles I choose??

Check out the Quantitative Analyses section of the notebook that I provided for example code for that. If you have and still don't understand, let me know what parts specifically are an issue.

Hi, @Isaac I have been trying to run the code https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Content%20Translation%20Example.ipynb , But I am getting no output on running

session = mwapi.Session(host='https://en.wikipedia.org', user_agent='mwapi (python) -- outreachy content translation')

parameters ={

  'action':'query',
  'format':'json',
  'list':'cxpublishedtranslations',
  'from':'en',
  'to':'es',
  'limit':500,
  'offset':20000
}

res = session.get(parameters)
res['result']['translations'][:10]

I have trying testing for different values of offset but still I dont get any output

Hey @Mansi29ag : is this the notebook you're trying this out in: https://paws-public.wmflabs.org/paws-public/57510755/Untitled.ipynb

If so, it looks like you changed the res['result']['translations'][:10] line into a Markdown cell instead of a code cell. For any cell in a notebook, you can set it either to code (where in this case it'll assume it's Python code) or Markdown (where it will be rendered as text and so will not execute code). There is a dropdown in the toolbar at the top of the page that allows you to set this for each cell.

@Isaac I m bit confused.. can you please give an example of how to do qualitative analysis.

@NuKira at this point I understand that not everyone will have experience with qualitative methods and so do not worry if you're not certain of the right approach. What you should focus on is whether you can generate some questions or hypotheses around the content translation tool. This can be aimed at the types of content that is / is not translated or what happens after an article is translated. So go through some of the articles that have been translated and look for patterns. For example: do you see that overview content is translated but that more detailed specifics of an article are often left behind? if so, maybe give some examples of sections that correspond to each. Do you find that new content that is more culturally-specific is added to the translated article after it has been created?

Practically to report this, you could collect a few examples in your PAWS notebook and providing markdown cells that describe your observations. If you include quantitative analyses, include these outputs in the notebook as well.

Hope that helps!

Hey @Mansi29ag : is this the notebook you're trying this out in: https://paws-public.wmflabs.org/paws-public/57510755/Untitled.ipynb

If so, it looks like you changed the res['result']['translations'][:10] line into a Markdown cell instead of a code cell. For any cell in a notebook, you can set it either to code (where in this case it'll assume it's Python code) or Markdown (where it will be rendered as text and so will not execute code). There is a dropdown in the toolbar at the top of the page that allows you to set this for each cell.

yes. Thanks, @Isaac I completely forgot about it, that it might be changed to Markdown cell.

@Isaac I was working on English to Hindi Translations and see what types of edits are made by the user after machine translation. While I was in into this, I found a strange thing. The stats mentioned does not seem to be correct.
For example - for the source titled -"Chewang Norphel", the stat says changes by human are 1.07 out of 1.23, but from the history of Hindi translation: https://hi.wikipedia.org/w/index.php?title=%E0%A4%9A%E0%A5%87%E0%A4%B5%E0%A4%BE%E0%A4%82%E0%A4%97_%E0%A4%A8%E0%A5%89%E0%A4%B0%E0%A4%AB%E0%A4%B9%E0%A5%87%E0%A4%B2&action=historyy, there are only 5 edits out of which the first one is machine translation and remaining 4 edits are very minor edits made by a person. Then how can human edit percentage be so high?

image.png (180×750 px, 19 KB)

Also since this project was added quite late, will the last date of submitting application be extended for this project as some projects have 2 april as the last date?
Thank you:)

@Isaac I have similar question to @Mansi29ag ,
While I explored data from the list of published source and target titles (in initialization step), I'm curious about translation percentage in a stats data. I found that there are some articles that 'mt' (machine translation percentage) are higher than 1? Why it is possible that machine translated higher than whats on the source article?

@Mansi29ag and @Supida_h : Glad you're looking into this -- I believe those statistics are for the initial translation (not what happens afterwards, which is one reason that this is an important research project) and indicate what proportion of content is translated over and whether it was created by humans or came from the machine translation. Because it's based on word count, if the translated article has more words than the source article, this would result in a number over 1. For example if the source article had 1000 words and the translated article had 1200 words, then this would result in 1.2 for any and if half of those 1200 words were suggested by the machine translation and half was added by the editor, then that would be 0.6 for mt and 0.6 for human.

Also since this project was added quite late, will the last date of submitting application be extended for this project as some projects have 2 april as the last date? Thank you:)

I'll check and get back to you, but I'd say do your best to put your thoughts down in your notebook and I understand if they are not complete (feel free to also add ways in which you would like to continue the work).

@Mansi29ag and @Supida_h : Glad you're looking into this -- I believe those statistics are for the initial translation (not what happens afterwards, which is one reason that this is an important research project) and indicate what proportion of content is translated over and whether it was created by humans or came from the machine translation.

I think stats information in some of the articles have taken machine translation as a human edit, as I have found many articles which have most of machine translation but still shows human edit percentage very very high.

Hi, @Issac

Am I correct in understanding that the talk page basically records the changes made to the original article in English? And we use different tools to compare the article after translation?

Am I correct in understanding that the talk page basically records the changes made to the original article in English?

@Doriszhou1224 : see this for an introduction to Talk Pages: https://en.wikipedia.org/wiki/Help:Talk_pages

And we use different tools to compare the article after translation?

Feel free to define your own research questions, but I provided as an example that one question you can seek to answer is what happens to the source and target articles after translation -- for example whether they stay similar or if substantive content is added to one but not the other.

Closing this task as it served primarily as a discussion space for the qualitative analysis component of the Outreachy application. Research now ongoing under T223765