Page MenuHomePhabricator

Better understand impact of content translation tools
Closed, ResolvedPublic

Description

NOTE: Outreachy application deadline for this project extended to April 2nd

The Content Translation tool has supported the creation of over 400,000 articles across a large variety of Wikipedia language projects. We have very little understanding, however, of what parts of the tool work well and what happens to the articles after they have been created. For instance, what types of article sections are translated as-is? what sections are changed substantially? do translated articles see subsequent editing and linking to other articles in the project?

This task links to a number of related projects that could happen around these larger questions about the adoption of translated content on Wikipedia. We would hope for a mixed-methods approach that uses both quantitative analyses (e.g., edit counts, topics that are more frequently translated, etc.) and qualitative analyses (e.g., content analysis of translated pages and subsequent edits, talk pages, etc.).

Mentors

  • @Isaac (IRC channel: #wikimedia-research)

Students

  • TBD

Skills

You should have at least one of the below and a strong desire to learn:

  • Jupyter Notebooks: this is how we would like to present the output of any project. If you have prior experience with Jupyter notebooks, great! If not, we can help you learn it.
  • Quantitative: Python has the most generous support for analyzing Wikimedia data and is well-suited to these sorts of analyses. Other languages such as R might be appropriate for data visualization or other related tasks but are secondary.
  • Qualitative: a basic level of familiarity or strong desire to learn qualitative coding or other content analysis techniques.
  • Language: not required, but if you are confident in reading at least one language other than English, that can help us to define a focus area and make qualitative analyses and checking/debugging of code much simpler for translations that involve that language

Set-up

Micro Tasks

Example analyses are shown in the Content Translation Example notebook linked in Set-up. Using that notebook as a starting point, create your own notebook and do one or both of the qualitative / quantitative analyses. All PAWS notebooks have the option of generating a public link, which can be shared back so that we can evaluate what you did. Use a mixture of code cells and markdown to document what you find and your thoughts.

  • Exploratory qualitative analysis (T218003)
  • Exploratory quantitative analysis (T218004)

Further reading

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks @Pginer-WMF ! Exactly the resources we were looking for to make sure that we're not duplicating work that has already been done. Just to give you a sense of timeline:

I'm going to make this task officially part of Outreachy and then over the next few weeks, we will see if we get any interested participants. If we do, then they would be starting work on this at the end of May. So we have a little bit of time to define a specific project (and will want to match that to any participant's particular skills/interests anyways). It seems from earlier discussions with @Amire80 and what you posted above, that some of the general metrics are being worked on already, but there are still plenty of open questions around what happens to articles after translation (at a more fine-grained level that survive/delete, though that already suggests hopeful results for the tool) and whether there are certain types of content that contributors use the ML tools with only minor changes vs. more extensive editing.

We'll keep you in the loop as this moves forward and set up a call if we do get a participant to work on this research.

@Isaac Hey! I am Nikita, an Outreachy applicant. I am interested in this project.
Skills

  • I work on Jupyter Notebooks for python programming and projects.
  • I program in python, I know quite a bit.
  • Always ready to learn :P
  • I know how to read and write two Indian languages Hindi and Punjabi.

I am particularly interested in qualitative analysis. I would love to get some more insight into the project.

@Isaac
Hi I am israa shahin , an Outreachy applicant I am interested in this project, I am Arabian so know how to read and write on Arabic,I am a beginner on python language, and also interested in qualitative analysis.
I would love to know how to get started and be a contribution to this project.
Thanks.

@Isaac
Hey, I am Areefa, an outreachy applicant, this project seems really cool.
I have a little experience with python and am still a beginner, and would like to improve this skill further.
I'd love to know how to contribute towards this project and get more information on it.
Does the data come from the wikimedia API's? or are there other data sources planned for use for analysis.
Thanks :D

@Isaac
Hi, I am Aakanksha Shastri, an aspiring Outreachy applicant. This project fascinates me and I'm really excited to contribute to it.
I possess the following skills :-

I have experience working in Python using Jupyter Notebook.
I am self motivated learner who desires to learn the relevant skills mentioned.
I know how to read and write in languages : Hindi, Telugu, Marathi, Kannada.

I would love to know about how to start contributing to this project.

Thanks.

Hi!

I'm Trevina Tan and am an Outreachy applicant. I'm highly intrigued by the direction of this project!

I have extensive experience with both Python and Jupyter Notebook, and have an especially high proficiency level with the numpy library as well as the data science library (which some of my own professors were in charge of creating). I've used Jupyter Notebook for purposes anywhere from programming circuits to doing statistical analyses and data visualizations based on given datasets.

I'm also extremely interested in learning qualitative coding! I am currently enrolled in a data science course at UC Berkeley and have worked with the basics of quantifying data from natural language and written texts. I'm interested in learning more about how we can utilize more content analysis techniques to further contribute to an understanding of how content translation can affect the meaning of the original text.

Furthermore, I'm an avid language enthusiast. Other than English, I am proficient in Spanish, verbally fluent in Mandarin Chinese, and proficient in Italian. I'd love to apply my background of different languages to this project.

hello! this is Nupur Munda.. I am an Outreachy applicant. This project grabbed my attention become I have worked in Python Jupyter notebook. Most of my projects have been done using python jupyter. notebook.. But I am new to open-source. I haven't contributed before. But I really want to contribute.. the only problem is that i dont know how to get started. Proper direction will help alot.
Please help me how to get started.
Hoping to work with you in the future

Hi @Isaac ,

Thanks for sharing the micro tasks, I would want to get started by taking up T218004.

Thanks.

Welcome @NuKira @TrevinaTan @AakankshaShastri_1 @Areefa @Israashahin @Nikitrain ! Excited to see the interest! I've added new details to the task description and subtasks. Most pertinently, I generated an example Jupyter Notebook that shows how to access information about articles that have been created with the Content Translation tool. We will be using PAWS, which hosts the Jupyter notebooks on our servers, so you won't have to do any intense analyses on your own computers. There are two subtasks -- one where you can ask questions etc. about the qualitative side of this project and another where you can focus on more quantitative side of this project. There are many types of questions and approaches that you can take, so do not worry about "claiming" a task at this point.

Go through the example I provided and then if you're still interested, create your own notebook and begin to play with the data. You can focus on either qualitative or quantitative aspects right now, though I'd encourage you to mix the two. It's exciting to see the diversity of languages you all know -- choose a pair of languages to focus on (preferably with English so I can be of more assistance when giving feedback) and see what questions you can begin to raise and data you can begin to gather to answer them. I gave some example questions in my notebook, but if in exploring the translations, other questions arise about what content is being translated or what happens to these articles, feel free to explore that as well.

I'll be out Monday so if I don't get back to you immediately, I ask for your patience, but then I'll be around the rest of this week to answer questions etc.

Hi @Isaac ,

I am Sadanand, an Outreachy applicant and got to know about this upcoming project. I am excited for this project as I am a data enthusiast and have recently completed a course in business analytics from Indian School of Business which was focused on data science. I have done various machine learning projects in python using Jupyter Notebooks and have a good understanding of python and anaconda environments along with dockers/containers.

I have also studied social media and web analytics and done related projects which included data collection, data cleaning/wrangling followed by data visualization to understand the underlying patterns and deriving insights using statistical tools such as R/python for modelling using supervised learning techniques such as regression analysis, classification and text analysis/sentiment analysis. I am well acquainted with other languages such as - Hindi and Marathi in addition to English. I am currently exploring seq2seq models/library used by google for language translation and also trying to implement a research paper for automated visualization of data using this seq2seq models.

I look forward to learn, contribute and collaborate with the Wikimedia team and fellow interns for working on real life data and solve interesting problems. Thanks!

Hello, @Isaac!

My name is Ekaterina, I am from Russia, St. Petersburg. I really like this project and the tasks it involves. This is the exact skills I want to develop. I have some experience of using Jupyter Notebook during my studying process on Udacity and I will be super excited to practice in a real project which brings real benefit!

Where should I share the link on my Jupyter Notebook after I finish the task on Exploratory qualitative/quantitative analysis? Should I indicate somewhere which task I choose?

Where should I share the link on my Jupyter Notebook after I finish the task on Exploratory qualitative/quantitative analysis? Should I indicate somewhere which task I choose?

Hey @Cherrywins - thanks for the interest! Based on the Outreachy site ( https://www.outreachy.org/apply/ ), if you decide this is a project you want to apply for, make sure you include the public link to your notebook (make sure you can access it from a browser session where you are signed out) in your application. In the meantime, if you have questions or want some feedback, feel free to add a comment to the Phabricator task with the link and a detailed description of what you're stuck with or would like feedback on. I'll do my best to get back to you in a timely manner (or perhaps others will have run into the same issue and can provide assistance).

And at this stage, you do not need to indicate officially whether you're starting with a qualitative or quantitative approach. I'd suggest doing a little of both to start (generating some statistics about the translated articles and also looking at a few specific examples).

Hello, I am Mansi Agrawal, a third-year computer science student at Aligarh Muslim University. I have a subtle background of concepts of machine learning and have worked on some related projects.
This project is really interesting and aligns with my interest. It will be immensely gratifying for me to contribute to the project. I look forward to work on the project.

Hello @Isaac, I am Tanupriya Rajput, an Outreachy applicant and a Computer Science student from Bengaluru, India. I have worked on a related project with machine learning. I have also worked on Projects with Python and Jupyter Notebook before. I know English and the Indian language Hindi very well. This project matches my skills and interests perfectly.

Really excited to work with you and fellow enthusiastic members! Heading for the contribution!

Hello @Issac my name is Doris Zhou and I am a CS and Statistics student from Montreal, Canada. I've used Jupyter before in a machine learning course. I am fluent in English and Mandarin and have some comprehension of French. Excited to learn and contribute.

Hello, @Isaac this side Megha Varshney, an outreachy applicant from India. I really found this project interesting. Previously I was Data Science Scholar at Udacity and would further like to expand my skillset.

Looking forward to work in this project.

Isaac triaged this task as Medium priority.Mar 13 2019, 6:03 PM

Hello @Isaac ,
My name is Supida. I am an Outreachy applicant. I has a basic python skill from study by myself. I'm interested in data analysis and would like to use quantitative and qualitative methods to making better understanding in the use in the Content Translation tool for Wikipedia. I'm Thai so I can look into the data on articles that got translated from English to Thai. I have not many experience in Jupyter notebooks, but I'm willing to learn.

Welcome everyone who has joined in the past few days! As you may see from the others, feel free to ask questions and let me know if you're running into challenges with getting started on this research. It's an open-ended task so don't be discouraged!

Hello all
May i ask u if it is possible to get the contents of translated pages and put them in a dataframe to apply the analysis codes on it and get the informations? Is that enough for the contribution?
I found the dump file is difficult to deal with and get the informations from it
Thank u

I have a problem with open JupiterHub. The notebook load for half way and stop loading, so I closed it and reopened it again. Then it shows as "Your server is stopping. You will be able to start it again once it has finished stopping." I tried to logout and login and wait for it for more than hours now. I still not able to access my Jupyter Notebooks.

I have a problem with open JupiterHub. The notebook load for half way and stop loading, so I closed it and reopened it again. Then it shows as "Your server is stopping. You will be able to start it again once it has finished stopping." I tried to logout and login and wait for it for more than hours now. I still not able to access my Jupyter Notebooks.

I am also having the same problem

@Supida_h and @NuKira thanks for alerting me to the JupyterHub issue. I'll continue to monitor and hopefully it clears up soon, but I'll take that into account.

May i ask u if it is possible to get the contents of translated pages and put them in a dataframe to apply the analysis codes on it and get the informations? Is that enough for the contribution?

@Batoulkh12 while that is definitely a good first step to doing some analysis, you would need to also come up with and apply some of these analyses as well to show you have some ideas about how to analyze the translations. Look through the Further Reading in the task description if you're having trouble generating ideas.

@Isaac
Since I still can't access Jupyter notebook on Paws. I created a new Jupyter notebook and upload it in GitHub. For the final Outreachy submission, can I submit GitHub link or I need to upload it into Paws (once the problem on Paws is solved)?

@Supida_h yes - while I'd prefer that you upload to PAWS and submit that link, if the service is not responding, a Github link that is open would be acceptable as well.

Hey all - considering that PAWS was unreachable for a while and this project was posted later in the cycle, I am going to extend the deadline for working on this until April 2nd. That gives you another two weeks to explore the data and begin to generate questions / analyses that you could build on in a summer project. I'll update Outreachy's website as well.

@Isaac Thank you for the deadline extension. I am able to use Jupyter notebook on PAWS now.

I want to ask about the contribution what we will put on the Application Information for Qutreachy internship project timeline??

An issue raised by @Muraran : even with the removal of duplicate commas in the .text.json.gz file, there can be a trailing comma at the very end that interferes with proper loading. Here's how you can figure out what's going on when you get these errors and how to fix it:

  1. If you try something like parallel_corpus = json.loads(json_str) and get an error of the form json.decoder.JSONDecodeError: Expecting value: line 1 column 356418517 (char 356418516), then you should see what's going on around character 356418516
  2. If you look at characters in that range (print(json_str[356418400:356418600])), you'll see something like this: 'r John Cobb\'s head, outside the workshops of Thomson & Taylor\\". Brooklands photo archive.\xa0[permanent dead link]"}},]'
  3. You'll notice at the end of that string, that there is a comma but nothing after that comma (just the end of the list)
  4. You can identify which character that comma is through some trial and error (it's actually the character right before where the error occurs: json_str[356418515]
  5. You can remove that comma and load the JSON by slicing the json_str right before and after it: parallel_corpus = json.loads(json_str[:356418515] + json_str[356418515+1:])

Hey @Isaac, my name is Kripa and I'm an Outreachy applicant. I've worked extensively with Python and have some experience with Jupyter Notebook and nltk. Additionally, I am fluent in English, Malayalam and Hindi and have a reasonably strong background in French. This project looks super interesting to me and I really hope to learn a lot through this experience!

hello , how can i get all the articles which are translated from English to any other language whatever it is ?
i tried to change the 'to:' in parameters or delete it , but it keeps giving me the same target language.
thank u

@Batoulkh12 It should work on setting to parameter in parameters. For example, for Hindi , use 'to' : 'hi' . If you are not getting the correct output, add your code in the comment so that the problem in your code can be found.

@Mansi29ag hi .. thank u for ur reply
yes i edit the parameter but i dont know what i should put in the 'to' parameter .. i put (to :all) and ( to :'_' )
but i still get only one language

'from':'en',
'to':' ', (here what i need to put ?)

@Batoulkh12 'to' is the language that English article get translated to. Like in the example, 'from':'en', 'to':'es' Is to find data that translated from English to Spanish.
What you need to put in 'to' is your target language. It's in Wikipedia URL or you can find from full list here.

@Supida_h hi thank u for ur help
yes i know that 'to ' is the target language and i used it( from :en to :ar ) to get the articles translated from english to arabic
but for now i need to get all articles which are translated form english TO any other languages
in order to apply some analysis to know what is the most target language that is translated from english .
i hope that u get my point .

@Batoulkh12 I don't know that you can get this information from the api. But this : https://en.wikipedia.org/wiki/Special:ContentTranslationStats might be helpful

Hello Issac,

Thank you for your reply.
However JSON is new for me. I am not sure what you mean. I am still
fighting with JSON ....
Thanks for anyway.
Muraran

2019年3月22日(金) 23:05 Isaac <no-reply@phabricator.wikimedia.org>:

Isaac added a subscriber: Muraran.
Isaac added a comment.

An issue raised by @Muraran https://phabricator.wikimedia.org/p/Muraran/
: even with the removal of duplicate commas in the .text.json.gz file,
there can be a trailing comma at the very end that interferes with proper
loading. Here's how you can figure out what's going on when you get these
errors and how to fix it:

  1. If you try something like parallel_corpus = json.loads(json_str) and get an error of the form json.decoder.JSONDecodeError: Expecting value: line 1 column 356418517 (char 356418516), then you should see what's going on around character 356418516
  2. If you look at characters in that range ( print(json_str[356418400:356418600])), you'll see something like this: 'r John Cobb\'s head, outside the workshops of Thomson & Taylor\\". Brooklands photo archive.\xa0[permanent dead link]"}},]'
  3. You'll notice at the end of that string, that there is a comma but nothing after that comma (just the end of the list)
  4. You can identify which character that comma is through some trial and error (it's actually the character right before where the error occurs: json_str[356418515]
  5. You can remove that comma and load the JSON by slicing the json_str right before and after it: parallel_corpus = json.loads(json_str[:356418515] + json_str[356418515+1:])

*TASK DETAIL*
https://phabricator.wikimedia.org/T217699

*EMAIL PREFERENCES*
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

*To: *Isaac
*Cc: *Muraran, KusumSa, D3r1ck01, Batoulkh12, Lashuk1729, Supida_h,
Megha070, Doriszhou1224, TanupriyaRajput, Mansi29ag, Cherrywins, Saddy05,
KartikMistry, santhosh, NuKira, TrevinaTan, AakankshaShastri_1, Areefa,
Israashahin, Amire80, Nikerabbit, Pginer-WMF, Capt_Swing, leila, Isaac,
contraexemplo, GoranSMilovanovic, chapulina, Xephyr826, Miriya52,
srishakatux, merbst, Avner, Nizil, Ixocactus, Jsahleen, Arrbee, Jay8g

kindly may somebody help me
how can i get the entire information for the translated articles , i need the category , readers , article edits and the rest information .
the importing file gives me only 128461 rows × 6 columns
with this titles
id mt source sourceLanguage target targetLanguage
which are not enough to analyse and understand the work of content translation
could anyone guide me how to get all the informations

@Batoulkh12 I don't know that you can get this information from the api. But this : https://en.wikipedia.org/wiki/Special:ContentTranslationStats might be helpful

thank you
but i still have the problem .. how to get all the translated articles in coding

Hey, @Batoulkh12 have you tried just removing the 'to' parameter. It works for me. But it will give maximum 500 articles as cxpublishedtranslations does. Sometimes it will return an empty list, because parameter 'offset' will be too high, just try decreasing its value. For all articles translated from Hindi, it works when offset value is set to 900.

@Isaac For the final Outreachy application, the applicants have to fill out the timeline for the internship project. I want to ask what we should put on the timeline? Another question is does this project has community-specific questions that applicants have to answer on our final application?

I asked same question as in Supida_h comment from a few days but I don’t receive any answer.

Always worth saying: thanks all for answering each others questions and being supportive.

@Supida_h and @Israashahin I actually don't know what the application looks like so I won't be much help here but in general I would say do your best to show what you have done so far and how that raises additional questions that you would be interested in trying to answer through continued research. Scoping research to focus on a cohesive set of questions is never easy and I don't expect you do a perfect job of that.

@Isaac I see. So do you mean we should create our own timeline (like writing week 1 to week 12 goals) base on our work for our additional questions? Or do you mean we don't necessary put it in the timeline, just write what have done so far and what is the additional questions to answer next is enough?
Thanks you

@Supida_h the latter is sufficient but do your best to keep it to an amount of work that you could reasonably complete during the program.

Can anyone help me with the markdown code? For the code like this it highlight as I expected in a work Jupyter notebook, but it shows like a normal text when view from public link.

Thank you

@Isaac Thanks for providing vivid description and wonderful resources!

@Supida_h

Did you mean that the Markdown style has changed after giving public access?
This is a excellent resource for Jupyter notebooks Please look into this.

@puja_jaji Thank you for the resource. I mean it change when view from the public access link.

I try to find a solution but found that I cannot see highlight code from other Jupyter notebook too. Like in this document https://jupyter.brynmawr.edu/services/public/dblank/Jupyter%20Notebook%20Users%20Manual.ipynb#4.7-Including-Code-Examples on topic 4.7 Including Code Examples, they give an example of "The word monospace will appear in a code-like form." but in that document it appears like a normal text without highlight.

Can anyone help me with the application submission? Do we have to submit the final application only on Outreachy website or do we have to make a different proposal for it too?

As I know the submission for the final application only on Outreachy website

Yes I believe the same because there isn't information to us pertaining the Final application that we have to make the proposal here on phabricator.
Submitting the final application to Outreachy is sufficient.

This comment was removed by AggNisha.

Hi, @Isaac! Could I ask for any kind of feedback on my analysis? It would be very useful to know what I need to pay attention to next time.

Could I ask for any kind of feedback on my analysis? It would be very useful to know what I need to pay attention to next time.

Hey @Cherrywins -- yes, I can do that. I'll email you by the end of the week using the email you provided on your application.

In general, thank you everyone for the contributions and applications! I very much enjoyed reading through them and was quite impressed with the quality and breadth. I strongly encourage those who were not accepted to continue to seek out opportunities for research (especially with awesome open-source communities)!

Closing this task as it served as the main task for the Outreachy Summer '19 applications.

Research now ongoing under T223765