Page MenuHomePhabricator

Proposal - Research into translation imbalances - Outreachy - 26 - ABHISHEK BHARDWAJ
Closed, ResolvedPublic

Description

Proposal for T328597: [Outreachy round 26] Research into translation imbalances

Name: Abhishek Bhardwaj
Email: abhishek02bhardwaj.er@gmail.com
User Page: Wikimedia Page
Location: Delhi, India
LinkedIn: Abhishek Bhardwaj
Zulip: abhishek02bhardwaj.er@gmail.com (Abhishek Bhardwaj)
Phabrikator: Abhishek02bhardwaj
Blog URL: https://medium.com/@abhishek02bhardwaj.er
Time Zone: UTC+5:30
Working Hours: 9 AM to 2 AM (UTC+05:30)
Availability: 29 May, 2023 - 20 July, 2023 - Available full time (8 weeks)
On College Days, occupied between 9 AM to 2 PM (UTC+05:30)

Abstract

Wikipedia is a free, online encyclopedia that provides information on a vast array of topics, from history to technology to pop culture. It is maintained by a global community of volunteer editors who collaborate to create and update its content. With millions of articles available in multiple languages, Wikipedia has become a go-to resource for people seeking knowledge and information on a wide range of subjects.
Wikipedia provides translation services that allow users to access articles in their preferred language. This feature enables content to be available to a wider audience, and it is achieved through the efforts of volunteer translators who work to create and update articles in various languages. The translation services also include machine translation, which uses artificial intelligence to automatically translate articles into different languages. While the quality of machine translation can vary, it has made it easier for users to access information in their native language, regardless of the original language of the article.
But one important issue that concerns the wiki enthusiasts is the imbalance in the translations .By comparing the number of translations made between pairs of languages, it is observed that articles from languages with a larger presence on Wiki are being translated into languages with a smaller presence at very high rates. English, specifically, is the source language for 70% of all published translations, and this trend is also present for other colonial languages. This project will focus on researching into these imbalances and understanding the reasons behind the same.

Extension

Apart from the Analysis stream of research I will also be working in the UX research direction, taking interviews of the translators to gain an understanding of how their perception of language importance influences their language selection. Also, I will try to investigate how the software design impacts the selection of languages and the translation workflow.

Mentors

@awight and @Simulo

Experience and Contributions made to the project

Being a Wikipedia user since the age of 12 I always wondered who was the person who was so knowledgeable to write all of this information all by themselves (as a kid I thought Wikipedia was written by one person like a book). As i grew old and realised it wasn't an individual but a community who did this, I never thought in my wildest dreams that one day I will be sitting in front of my laptop being capable enough to write a proposal to those people to be a part of their team, this experience is more important to me than all of the knowledge that I have gained participating in this contribution period for the Wikimedia foundation. So it is already a dream come true for me.
During the past 21 days (from 6th of March to 26th of March) I have learned a lot of things doing this project,

  1. In the task #T331199 I summarised a paper and on the basis of it gave hypotheses and informed guesses about how it applies to translators.

My Contribution

  1. In the task #T331200 I did a light systematic review of literature that might be relevant to our research.

My Contribution

  1. In the task #T331201 I created a parser from scratch to extract the cx-server configuration and extract them to a csv.

My Contribution

  1. In the task #T331202 I created a time machine to access the git history of a data repository and analyzing the data at each commit. The information obtained is then stored in the memory, along with the time stamp of the git commit, forming a complete sequence.

My Contribution

  1. In the task #T331204 plotted flow diagrams illustrating translation imbalances.

My Contribution

  1. In the task #T331207 I learned about how to compose a survey. In this task I drafted a survey for Content Translation software users, investigating how the software is used and how the languages are chosen.

My Contribution

  1. In the task #T332643 we had to integrate the configuration scrapper that we built in task #T331201 and the time machine built in task #T331202 running the configuration scraper on every git commit of the cxserver source repository.

My Contribution

  1. In the task #T332647 I compared the API results to the output of the scrapper I built in the task #T331201. The accuracy of the scrapper is 100%. As an extension to the task I also compared other contributor's output and recorded their match percentage.

My Contribution

Past Experience with Open Source Software

As a contributor, this is my first time contributing to open-source, but I have been an active open source user since past 10 years. From using VLC Media Player to watch videos, to using the Android operating system on the smartphones, open-source softwares have been an integral part of the technology present in my life. I started to learn coding on Dev Cpp which is a free open source IDE for Windows. Then, when I expanded my horizons of learning and learned more programming languages I switched to an IDE that is compatible with multiple languages, The Visual Studio Code which is again built on open source. I use Firefox to browse internet which provides phishing and malware protection. I use WordPress to develop websites which are really good in design and function very well. I use PHP to mange dynamic content and session tracking. MySQL is my favourite RDBMS for managing databases. In this way whether it was entertainment, learning or any other utility open source has helped me a lot by providing excellent utilities.

About the Project

Among the proposed streams I am quite interested in the UX research and the Analysis as my past projects involved product development in which UX research and Analysis is an integral part. Through the UX research I aim to dig deeper into the thought process of translators while selecting languages for translation and what are the factors that motivate and prompt them into choosing the language. Can these factors be countered from the project end or are these some other factors which needs to be addressed with a better approach. Through the Analysis stream, I will be analysing the possible technical reasons for these imbalances and also try to find a way to counter them.
I have divided my timeline into 5 phases, each phase addresses one important aspect of research into the translation imbalances.

Timeline

Pre-Selection Period

April 4th - May 3rd

  • Study the code responsible for setting the default languages from CXDashboard.findValidDefaultLanguagePair.
  • Carry forward Task #T331204 further producing and illustrating translation imbalances via flow diagrams.
  • Gather information about the two algorithms at play: one chooses the pair of source and target languages between which to translate, and the other chooses which articles to show for translation.

May 4th

  • Celebrations

Community Bonding Period

May 5th - May 28th

  • Refine the survey developed in task #T331207 with the help of mentors, WMF Language Engineering Team and the already provided contributions on the task.
  • Send out the survey to translators (Contingent on discussion with WMF Language Engineering team) .
  • Find information about potential candidates for interview.
  • Prepare and discuss a questionnaire with the mentors and WMF Language Engineering Team that will be used for interviewing.
  • Communicate to the potential candidates and ask for a suitable date and time for their interview.
  • Review the papers that were listed in the task #T331200 ultralight systematic review and try to observe if there is anything we can relate to or use in our research.

Contribution Period Begins

May 29th - June 2nd
WEEK 1

  • Review the papers that were listed in the task #T331200 ultralight systematic review and try to observe if there is anything we can relate to or use in our research.

June 5
FEEDBACK #1

Phase I Testing of the Possible Hypothesis

June 5th - June 9th + June 12th - June 16th
WEEK 2 + WEEK 3

June 19th - June 23rd + June 26th - June 30th
WEEK 4 + WEEK 5

  • Hypothesis 2 - When two languages are more closely related geographically or historically, translations are more likely to occur. So languages which are more linked to other languages geographically or historically will see more translations.

July 3rd
FEEDBACK #2

Phase II Interview and Survey Documentation

July 3rd - July 7th + July 10th - July 14th
WEEK 6 + WEEK 7

  • Conduct interviews with translators to gain an understanding of how their perception of language importance influences their language selection. Also, investigate how software design impacts the selection of languages and the translation workflow.
  • Analyse the responses from the survey and deduce important observations and document them appropriately.

July 17th - July 21st
WEEK 8

  • Document the interviews and the findings into a structured blog that can be used for further reference in the research work.

July 24th
FEEDBACK #3

Phase III Analysis and Improvement of Algorithm

July 24th - July 28th + July 31st - August 4
WEEK 9 + WEEK 10

  • Analyse and research into the algorithm used for suggesting the articles for translation. Look for any potential bias and devise a way to remove if any.

Phase IV Build a Quantitative View on the Issue

August 7th - August 11th
WEEK 11

  • Complete the Configuration Time Machine work with the required documentation.

August 14th - August 18th
WEEK 12

  • Figure out the discontinuities in the machine translation and check for correlations with the step changes in the published translations.

August 21st
FEEDBACK #4

Phase V Conclusion

August 21st - August 25th
WEEK 13

  • Conclude the research work, prepare a report of the findings, and publish the raw data links, code, graphs and exceptions.
  • Write a blog article containing a concise report of all the work done that can be used to carry the research forward

August 28th and Later

  • Celebrations.
  • Continue code-based contributions to Wikimedia.
  • Be an active member of the Wikimedia community and start exploring other communities to work with.
  • Actively maintain the code and documentation and guide beginners who are interested in contributing.

Stretch Goals

  • Test the hypothesis about how some wikis are more challenging to contribute to than others (may be because of the cultural or stylistic difference of wikis ) either through survey questions or by looking at translator activity over time.

Other Deliverables during the Internship

  • Weekly Blog posts on my internship progress/experience.
  • Blog posts about my experience with the open source community and the Wikimedia Foundation.
  • Regular communication with my mentors and other members at the Wikimedia Community.

About Me

I am a sophomore, pursuing a degree in Bachelors of Technology in Information Technology and Mathematical Innovation from Cluster Innovation Centre, University of Delhi. I am currently in my 4th semester of the 8 semester program. I will be graduating in May, 2025. I am an active member of the coding society of our college where we have built an ecosystem of peer-learning and have also helped in organising numerous workshop and technical fests.

Past Projects

  1. DHAMNI :
    • It is web-development project which aims to provide a platform for blood donors and recipients to share information.
    • The donors can upload their details which can be searched by people who are in need of blood donation.
  1. Curio :
    • It is also a web-development project in development and aims to solve the language gap (in available audio) problem of YouTube.
    • Educational content is difficult to understand through subtitles due to which to which dubs and audio translations become a need.
    • Through this platform user can record their dub of a YouTube video and upload it that can be viewed by other users who wish to watch it in the language.
  2. Facial Recognition Software using Principal Component Analysis -
    • Developed a MATLAB software capable of identifying and retrieving an image if it exists in its train database.
  3. Maze Solving Algorithm Comparison :
    • Compared Djikstra and A* algorithms along with their time and space complexity

How did I learn about Outreachy?

In our college, we have a really good culture of contributing to open source and open source is something that is discussed around our campus almost all of the time. This culture got my initial interest in open source software and I started learning about them. As I explored more about the communities, I realised what kind of an impact they make into people's lives which further motivated me to be a part of this community and start my part of contributions. I came to know about Outreachy from a college senior, who interned at Outreachy and contributed to Inkscape in 2021. She told me about the program and also about how it supports diversity in free and open source software which got my interest and I started to work in this direction.

Event Timeline

Hello, @awight , @srishakatux and @Simulo. I have drafted a proposal for the project Research into translation imbalances. I wanted to ask for a feedback or review if possible. Please have a look and let me know about any changes you would like to suggest.

Thank you for the proposal!

I don't think we can accomplish everything listed here, although you're welcome to try and we would support you. It's more realistic to test one or two hypotheses in this time frame. Similarly, I'm not even sure we can send out a survey, this would be the domain of the team actively building Content Translation (which I'm not a part of), so perhaps mark that as "contingent on discussion with WMF Language Engineering". Here are links to the project and team, https://www.mediawiki.org/wiki/Content_translation and https://www.mediawiki.org/wiki/Wikimedia_Language_engineering .

I suggest picking the one or two hypotheses you're most interested in testing.

So far, we have a very provisional agreement to run a software experiment. It would be up to us to design something unobtrusive and write a proposal explaining what we hope to learn, what we see as the risks, etc.

Thank you for the proposal!
Similarly, I'm not even sure we can send out a study, this would be the domain of the team actively building Content Translation (which I'm not a part of), so perhaps mark that as "contingent on discussion with WMF Language Engineering". Here are links to the project and team, https://www.mediawiki.org/wiki/Content_translation and https://www.mediawiki.org/wiki/Wikimedia_Language_engineering

@awight, Thanks for the review. Sure, I will choose the two hypotheses of my interest and stick to testing them. Also pardon me but I couldn't understand what did you mean by sending out a study, I would be very thankful if you could elaborate that.

@awight and @Simulo I made a few changes to the timeline. I will be looking forward to your review.

[...] what did you mean by sending out a study, I would be very thankful if you could elaborate that.

Wow, I meant to write "send out a survey", thank you for asking! I'll edit my comment above.

If we want to send out a survey to translators, the team actively developing would be one of the most important stakeholders, there might be an assumption on the users' part that they are involved, users can't be asked too often to fill out a survey so our question needs to be balanced against the team's other needs, and so on.

Ebere_O updated the task description. (Show Details)
Ebere_O updated the task description. (Show Details)
Ebere_O subscribed.
Abhishek02bhardwaj renamed this task from Proposal - Research into translation imbalances - Outreachy - 26 to Proposal - Research into translation imbalances - Outreachy - 26 - ABHISHEK BHARDWAJ.Apr 3 2023, 2:58 PM

@awight and @Simulo Hi, I hope you are doing great. Firstly thank you for showing trust in me, I will give my best. So as @awight already have seen my proposal for review, I wanted to ask how should I refine it and the scope of discussing a more specific timeline for the internship. Of the tasks that I mentioned in my pre-contribution period, I am a bit slower than I expected I would be in reviewing the papers since I am having my semester end exams but I'll cover as many papers as I can before the beginning. Also it would be a great help if you could help me connect with Kavitha A the other mentor for the project.

Hi! Please consider resolving this task and moving any pending items to a new task, as GSoC/Outreachy rounds are now over, and this workboard will soon be archived.