Proposal for T328597: [Outreachy round 26] Research into translation imbalances
Name: Abhishek Bhardwaj
Email: abhishek02bhardwaj.er@gmail.com
User Page: Wikimedia Page
Location: Delhi, India
LinkedIn: Abhishek Bhardwaj
Zulip: abhishek02bhardwaj.er@gmail.com (Abhishek Bhardwaj)
Phabrikator: Abhishek02bhardwaj
Blog URL: https://medium.com/@abhishek02bhardwaj.er
Time Zone: UTC+5:30
Working Hours: 9 AM to 2 AM (UTC+05:30)
Availability: 29 May, 2023 - 20 July, 2023 - Available full time (8 weeks)
On College Days, occupied between 9 AM to 2 PM (UTC+05:30)
Abstract
Wikipedia is a free, online encyclopedia that provides information on a vast array of topics, from history to technology to pop culture. It is maintained by a global community of volunteer editors who collaborate to create and update its content. With millions of articles available in multiple languages, Wikipedia has become a go-to resource for people seeking knowledge and information on a wide range of subjects.
Wikipedia provides translation services that allow users to access articles in their preferred language. This feature enables content to be available to a wider audience, and it is achieved through the efforts of volunteer translators who work to create and update articles in various languages. The translation services also include machine translation, which uses artificial intelligence to automatically translate articles into different languages. While the quality of machine translation can vary, it has made it easier for users to access information in their native language, regardless of the original language of the article.
But one important issue that concerns the wiki enthusiasts is the imbalance in the translations .By comparing the number of translations made between pairs of languages, it is observed that articles from languages with a larger presence on Wiki are being translated into languages with a smaller presence at very high rates. English, specifically, is the source language for 70% of all published translations, and this trend is also present for other colonial languages. This project will focus on researching into these imbalances and understanding the reasons behind the same.
Extension
Apart from the Analysis stream of research I will also be working in the UX research direction, taking interviews of the translators to gain an understanding of how their perception of language importance influences their language selection. Also, I will try to investigate how the software design impacts the selection of languages and the translation workflow.
Mentors
Experience and Contributions made to the project
Being a Wikipedia user since the age of 12 I always wondered who was the person who was so knowledgeable to write all of this information all by themselves (as a kid I thought Wikipedia was written by one person like a book). As i grew old and realised it wasn't an individual but a community who did this, I never thought in my wildest dreams that one day I will be sitting in front of my laptop being capable enough to write a proposal to those people to be a part of their team, this experience is more important to me than all of the knowledge that I have gained participating in this contribution period for the Wikimedia foundation. So it is already a dream come true for me.
During the past 21 days (from 6th of March to 26th of March) I have learned a lot of things doing this project,
- In the task #T331199 I summarised a paper and on the basis of it gave hypotheses and informed guesses about how it applies to translators.
- In the task #T331200 I did a light systematic review of literature that might be relevant to our research.
- In the task #T331201 I created a parser from scratch to extract the cx-server configuration and extract them to a csv.
- In the task #T331202 I created a time machine to access the git history of a data repository and analyzing the data at each commit. The information obtained is then stored in the memory, along with the time stamp of the git commit, forming a complete sequence.
- In the task #T331204 plotted flow diagrams illustrating translation imbalances.
- In the task #T331207 I learned about how to compose a survey. In this task I drafted a survey for Content Translation software users, investigating how the software is used and how the languages are chosen.
- In the task #T332643 we had to integrate the configuration scrapper that we built in task #T331201 and the time machine built in task #T331202 running the configuration scraper on every git commit of the cxserver source repository.
- In the task #T332647 I compared the API results to the output of the scrapper I built in the task #T331201. The accuracy of the scrapper is 100%. As an extension to the task I also compared other contributor's output and recorded their match percentage.
Past Experience with Open Source Software
As a contributor, this is my first time contributing to open-source, but I have been an active open source user since past 10 years. From using VLC Media Player to watch videos, to using the Android operating system on the smartphones, open-source softwares have been an integral part of the technology present in my life. I started to learn coding on Dev Cpp which is a free open source IDE for Windows. Then, when I expanded my horizons of learning and learned more programming languages I switched to an IDE that is compatible with multiple languages, The Visual Studio Code which is again built on open source. I use Firefox to browse internet which provides phishing and malware protection. I use WordPress to develop websites which are really good in design and function very well. I use PHP to mange dynamic content and session tracking. MySQL is my favourite RDBMS for managing databases. In this way whether it was entertainment, learning or any other utility open source has helped me a lot by providing excellent utilities.
About the Project
Among the proposed streams I am quite interested in the UX research and the Analysis as my past projects involved product development in which UX research and Analysis is an integral part. Through the UX research I aim to dig deeper into the thought process of translators while selecting languages for translation and what are the factors that motivate and prompt them into choosing the language. Can these factors be countered from the project end or are these some other factors which needs to be addressed with a better approach. Through the Analysis stream, I will be analysing the possible technical reasons for these imbalances and also try to find a way to counter them.
I have divided my timeline into 5 phases, each phase addresses one important aspect of research into the translation imbalances.
Timeline
Pre-Selection Period
April 4th - May 3rd
- Study the code responsible for setting the default languages from CXDashboard.findValidDefaultLanguagePair.
- Carry forward Task #T331204 further producing and illustrating translation imbalances via flow diagrams.
- Gather information about the two algorithms at play: one chooses the pair of source and target languages between which to translate, and the other chooses which articles to show for translation.
May 4th
- Celebrations
Community Bonding Period
May 5th - May 28th
- Refine the survey developed in task #T331207 with the help of mentors, WMF Language Engineering Team and the already provided contributions on the task.
- Send out the survey to translators (Contingent on discussion with WMF Language Engineering team) .
- Find information about potential candidates for interview.
- Prepare and discuss a questionnaire with the mentors and WMF Language Engineering Team that will be used for interviewing.
- Communicate to the potential candidates and ask for a suitable date and time for their interview.
- Review the papers that were listed in the task #T331200 ultralight systematic review and try to observe if there is anything we can relate to or use in our research.
Contribution Period Begins
May 29th - June 2nd
WEEK 1
- Review the papers that were listed in the task #T331200 ultralight systematic review and try to observe if there is anything we can relate to or use in our research.
June 5
FEEDBACK #1
Phase I Testing of the Possible Hypothesis
June 5th - June 9th + June 12th - June 16th
WEEK 2 + WEEK 3
- Hypothesis 1 – The distribution of translators around the world will be significantly different from that of the editors and contributors. Analyse the data of translators on the basis of location. In a way similar to what was done for the paper Graham, Straumann, Hogan. 2015. “Digital Divisions of Labor and Informational Magnetism: Mapping Participation in Wikipedia.”
June 19th - June 23rd + June 26th - June 30th
WEEK 4 + WEEK 5
- Hypothesis 2 - When two languages are more closely related geographically or historically, translations are more likely to occur. So languages which are more linked to other languages geographically or historically will see more translations.
July 3rd
FEEDBACK #2
Phase II Interview and Survey Documentation
July 3rd - July 7th + July 10th - July 14th
WEEK 6 + WEEK 7
- Conduct interviews with translators to gain an understanding of how their perception of language importance influences their language selection. Also, investigate how software design impacts the selection of languages and the translation workflow.
- Analyse the responses from the survey and deduce important observations and document them appropriately.
July 17th - July 21st
WEEK 8
- Document the interviews and the findings into a structured blog that can be used for further reference in the research work.
July 24th
FEEDBACK #3
Phase III Analysis and Improvement of Algorithm
July 24th - July 28th + July 31st - August 4
WEEK 9 + WEEK 10
- Analyse and research into the algorithm used for suggesting the articles for translation. Look for any potential bias and devise a way to remove if any.
Phase IV Build a Quantitative View on the Issue
August 7th - August 11th
WEEK 11
- Complete the Configuration Time Machine work with the required documentation.
August 14th - August 18th
WEEK 12
- Figure out the discontinuities in the machine translation and check for correlations with the step changes in the published translations.
August 21st
FEEDBACK #4
Phase V Conclusion
August 21st - August 25th
WEEK 13
- Conclude the research work, prepare a report of the findings, and publish the raw data links, code, graphs and exceptions.
- Write a blog article containing a concise report of all the work done that can be used to carry the research forward
August 28th and Later
- Celebrations.
- Continue code-based contributions to Wikimedia.
- Be an active member of the Wikimedia community and start exploring other communities to work with.
- Actively maintain the code and documentation and guide beginners who are interested in contributing.
Stretch Goals
- Test the hypothesis about how some wikis are more challenging to contribute to than others (may be because of the cultural or stylistic difference of wikis ) either through survey questions or by looking at translator activity over time.
Other Deliverables during the Internship
- Weekly Blog posts on my internship progress/experience.
- Blog posts about my experience with the open source community and the Wikimedia Foundation.
- Regular communication with my mentors and other members at the Wikimedia Community.
About Me
I am a sophomore, pursuing a degree in Bachelors of Technology in Information Technology and Mathematical Innovation from Cluster Innovation Centre, University of Delhi. I am currently in my 4th semester of the 8 semester program. I will be graduating in May, 2025. I am an active member of the coding society of our college where we have built an ecosystem of peer-learning and have also helped in organising numerous workshop and technical fests.
Past Projects
- DHAMNI :
- It is web-development project which aims to provide a platform for blood donors and recipients to share information.
- The donors can upload their details which can be searched by people who are in need of blood donation.
- Curio :
- It is also a web-development project in development and aims to solve the language gap (in available audio) problem of YouTube.
- Educational content is difficult to understand through subtitles due to which to which dubs and audio translations become a need.
- Through this platform user can record their dub of a YouTube video and upload it that can be viewed by other users who wish to watch it in the language.
- Facial Recognition Software using Principal Component Analysis -
- Developed a MATLAB software capable of identifying and retrieving an image if it exists in its train database.
- Maze Solving Algorithm Comparison :
- Compared Djikstra and A* algorithms along with their time and space complexity
How did I learn about Outreachy?
In our college, we have a really good culture of contributing to open source and open source is something that is discussed around our campus almost all of the time. This culture got my initial interest in open source software and I started learning about them. As I explored more about the communities, I realised what kind of an impact they make into people's lives which further motivated me to be a part of this community and start my part of contributions. I came to know about Outreachy from a college senior, who interned at Outreachy and contributed to Inkscape in 2021. She told me about the program and also about how it supports diversity in free and open source software which got my interest and I started to work in this direction.