Page MenuHomePhabricator

Outreachy Proposal: Anshika Bhatt
Closed, DeclinedPublic


Proposal for research into translation imbalances: research into translation imbalances.

Note: Here I have researched and collected some resources that I believe can be helpful for the project.
Link to the resources:

Name: Anshika Bhatt
User page: Anshika_bhatt_20
Zulip: anshika bhatt
Location: Indore, India
Time zone: UTC+05:30
Typical working hours: can be flexible


Wikipedia is a free, multilingual online encyclopedia that allows users to create and edit articles. It was launched in 2001 and has since become one of the most widely used sources of information on the internet. Wikipedia articles are collaboratively written and maintained by volunteers from around the world. The content is licensed under a Creative Commons Attribution-ShareAlike license, which means that it can be reused and modified by anyone.

However, one issue with Wikipedia is the translation imbalance. This refers to the fact that the majority of Wikipedia articles are written in English, with other languages having far fewer articles. This creates a significant language gap, with speakers of less dominant languages having limited access to information in their native language. The translation imbalance is a result of various factors, including differences in language proficiency, access to technology, and cultural barriers. Efforts have been made to address the translation imbalance in Wikipedia, including the creation of tools and initiatives to encourage volunteers to contribute to non-English language editions of the site. However, much work remains to be done to ensure that Wikipedia is truly multilingual and accessible to people from all language backgrounds.

However, much more work is needed. This project will be researching more about this matter to ensure the problems and solutions further.


My primary focus for this project on translation imbalances will be on utilizing data analysis and user research to understand the root causes of translation imbalances and to develop effective strategies for addressing them. I plan to use data analysis techniques to identify patterns and trends in the distribution of Wikipedia articles across languages, as well as track changes in article coverage over time. Additionally, I will conduct user research to gain insight into the needs and preferences of Wikipedia users from diverse language backgrounds, and to identify any barriers to participation in non-English language editions of the site.

It's worth noting that there may be obstacles that work against translation imbalances in the opposite direction. For instance, English Wikipedia has disabled machine translation into English, which could explain why so many translations are landing in non-English language editions, but not the other way around. Understanding these types of challenges can help inform strategies for improving multilingualism and accessibility on the platform. Ultimately, I hope to develop evidence-based recommendations that address these challenges and leverage the opportunities associated with translation imbalances in Wikipedia.


@awight and @Simulo

Past Experience and Contributions Made to the Project:

I am a longtime user of Wikipedia. Prior to beginning this application, I had never contributed to it. With a desire to start my open-source journey, I decided to apply for the Outreachy program and contribute to Wikipedia. But as a user, I have been using Wikipedia for several years now, as long as I can remember. The site has become my go-to source for knowledge and inspiration, a virtual library that never closes and never runs out of books. I use it to look up information on a wide range of topics, from history and science to current events and pop culture. I find myself on the site several times a week, like a curious traveler wandering through a vast and fascinating landscape. I appreciate the platform for its breadth of knowledge and the convenience of having all this information readily available in one place. I feel incredibly grateful for the opportunity to contribute to the Wikipedia platform, a resource that has fascinated me for years. Wikipedia has been a constant source of knowledge and inspiration for me, and I am excited to have the chance to give back to this amazing community.

I believe that my skills in data analysis and user research, combined with my passion for making knowledge accessible to all, I am confident that I can help make Wikipedia an even more inclusive and accessible resource for all users. I am eager to work alongside the Wikipedia community to address translation imbalances and make the platform more inclusive and accessible for users from diverse backgrounds.

During the contribution period, I had the opportunity to make a few contributions to Wikipedia. Through contributions, feedback, and guidance that I received from my mentors during my previous contributions, helped me a lot to learn and grow as I gained valuable experience in working with others to achieve a shared goal.

To make my initial contribution to my project of interest, Research into translation imbalances:
• In task #T331199, I wrote a summary of the paper and provided informed hypotheses and guesses on how the insights from the paper could be relevant to the work of translators. The paper explores the ways in which participation in Wikipedia is influenced by digital divisions of labor and the informational magnetism of certain topics. My summary then discusses three hypotheses related to the distribution of translation work in different language versions of Wikipedia, based on the findings of the paper. Overall, the summary highlights the potential implications of the paper's findings for translators working on Wikipedia.

• In task #T331200, I conducted a light systematic literature review on four academic articles related to Wikipedia translations and cultural biases in Wikipedia content.

• In task #T331201, I wrote a Python script that extracts the supported language pairs from CXServer configuration files and exports them to a CSV file. The script efficiently extracts configuration files from the CXServer repository and creates a well-structured, in-memory representation of all supported language pairs.

• In task #T331202, I created a Git repository to track changes to a CSV file containing city names and temperatures, with each commit representing a snapshot of the data at a specific time. I used the my_parser file to read the data into memory and the timemachine.rb file to playback the Git history and store the entire sequence with timestamps for each commit.

• In task #T331204, I used the public Content Translation data source and wrote Python code to refine and extend a Sankey diagram, as well as create scatterplot diagrams based on the ratio of translations between different languages and attempted to estimate translation ratios against Wikipedia article count.

• In task #T331207, I contributed to a survey about the Content Translation tool, aimed at improving its quality by gathering feedback from users on their language selection process, satisfaction with translations, and challenges encountered while using the tool.

• In task #T332647, I evaluated the accuracy of a scraper I built by comparing its output to API results and found that it had a perfect match. In addition, I extended the task by comparing the output of other contributors and recording their match percentages.

Past experience with other software

Prior to this application, I did not have any experience contributing to open-source projects. However, I have always been interested in learning and contributing to the community. I am a daily user of multiple FOSS projects, including Git, Firefox, Python, Linux, and Ubuntu. My journey through the world of open-source software has led me to discover the countless benefits and possibilities of these projects. For me, Git has been particularly useful for version control, Python has been my go-to programming language thanks to its extensive library support and ease of use. I am currently learning Linux and Ubuntu, and I cannot express enough how much I appreciate these open-source projects and their communities. Having gained a lot from my experiences with various FOSS projects, including using Git for version control, programming in Python, and utilizing Linux and Ubuntu as my operating system, I am eager to give back to these communities by contributing to their projects.

Project timeline

The following timeline is a detailed outline of the work I plan to do to prepare for, carry out, and wrap up my internship. This is based on my current understanding of the project and may need to be revised as I become more familiar with the project.
As noted below, revising the timeline is a goal for the beginning of the internship, and it may need to be further revised as the internship progress.

pre-selection period ( April 4 - May 3)

• I will continue exploring the task of producing flow diagrams illustrating translation imbalances #T331204. Building on my initial work, I plan to experiment with various diagrams and APIs to uncover intriguing imbalances between languages. My mentor has provided me with a workaround involving running two APIs: first, the site matrix listing of all wikis, then filtering to just Wikipedias, and then running an additional site info API call on each of those sites. Although this approach can be cumbersome, I am willing to give it a try to gain a better understanding of the task.
• In addition to this workaround, there is a CSV available in the repository ( that can be used to streamline the process. I plan to use both the APIs and the CSV to further explore the task and uncover intriguing imbalances between languages.
• Furthermore, I believe that this project has the potential to benefit a wider audience beyond the immediate team. Therefore, I am excited about the possibility of building a publicly-accessible resource, such as a Jupyter notebook, that would allow readers to explore the data and insights on their own. I am committed to continuing to learn and grow throughout this project, and I look forward to collaborating with my mentor and the rest of the team to produce high-quality deliverables that advance our understanding of translation imbalances.

Community bonding period (May 4th - May 28th)

Before the internship begins, I will immerse myself in the active research project, Research: Content Translation language imbalances, and its component on Research into translation imbalances.
• I will invest time in understanding the project's goals, objectives, and key deliverables, as well as the resources and tools that are part of this project.
• To gain a deeper understanding of the project, I will explore the existing literature and research on translation imbalances and their impact on the Wikimedia community.
• Review the project's documentation, including project plans, specifications, and previous research findings
• In addition, I plan to contribute to Wikipedia as a way of giving back to the community and deepening my understanding of the challenges and opportunities of translation on the platform. I will seek out opportunities to participate in Wikipedia editing, translation, and community-building activities.
• I will invest time in studying the project's wiki page on Meta to access additional resources and connect with other researchers and community members.
• If needed, I am open to exploring additional resources or tools that could help produce more insightful and interactive visualizations, such as an interactive graph or a Jupyter notebook.
• Furthermore, I hope to contribute to the further development of this project by conducting user research and data analysis to uncover new insights and inform the development of new features and tools.

During the internship (May 29th – august 25th)

Week: 1 Initial Review and Survey Development

( May 29th - June 2nd )
• Review project documentation, including the project plan, specifications, and goals.
• Meet with project mentors and community members to discuss project goals and clarify research questions.
• Develop and refine the survey and interview questions in task #T331207 to ensure that they are clear, relevant, and unbiased based on the feedback from the mentors.
• Discuss with WMF Language Engineering the possibility of sending out the survey to translators.

Week 2-3: Conducting Interviews and Analyzing Data on Translation Imbalances in Wikipedia

( June 5th - June 16th )
• Interview translators of minority languages, non-western languages translators, and translators who work with indigenous languages.
• Conduct the interviews.
• Clean and preprocess the data to ensure it is ready for analysis.
• Analyze the data, transcribe the interviews, and code the data to identify common themes and patterns using data visualization tools to help identify relationships and insights.

  • Write atleast 2-3 blog posts. Possible topic:
  • Behind the Scenes of a Translation Research Project: My Journey So Far.
  • Uncovering the Challenges Faced by Translators of Minority and Indigenous Languages: Insights from My Research.
  • Making Sense of Data: My Experience Preprocessing and Analyzing Interviews with Translators.
Week 4: Synthesizing Findings and Writing Report

( June 19th - June 23rd )
• Synthesize the findings by bringing together the results from the survey and interviews to provide a comprehensive overview of the challenges and opportunities related to translation imbalances.
• Identify key themes, insights, and recommendations.
• Write a report to document the results and share the findings with the mentor.
• Review and edit the final report or presentation to ensure that it is clear, concise, and effectively communicates the findings and recommendations.
• Present the findings to the mentor and the community members to get feedback and suggestions for improvement.

Week 5-6: Algorithm Investigation and Experimental Intervention

( June 26th - July 7th )
• Conduct experiments to evaluate the effects of changes to the currently used algorithm for default language selection.
• Develop a statistical analysis plan to test the hypotheses and significant findings.
• Examine the existing algorithm employed for default language selection and propose an alternative approach.

  • Write 2-3 blog posts. Possible topic:
  • Insights from Synthesizing Findings on Translation Imbalances.
  • The Role of Machine Learning in Translation Imbalances.
  • The Importance of Data Preprocessing in Analyzing Translation Imbalances.
Week 7-8: Historical Log Analysis and Correlation Determination

( July 10th - July 21st )
• Conduct a thorough investigation into the algorithm utilized to recommend articles for translation. Identify any potential sources of bias and develop strategies to mitigate them.
• Identify potential changes to the model that could address imbalances.
• Initiate the passive analysis of historical content translation logs.

Week 9-10: Analyzing Translation Processes and Language Pair Metrics on Wikipedia

( July 24th - August 4 )
• Determine correlations between the translation process and language pair metrics, such as article count, active editors, and page views.
• Conduct a thorough analysis of smaller language subsets to enable effective comparisons.
• Classify findings based on the translation source, explicit use of machine translation, and availability of external or internal machine translators.

  • Write atleast 2-3 blog posts. Possible topic:
  • Addressing Translation Imbalances: A Case Study of Algorithmic Bias and its Mitigation Strategies.
  • Finding Common Ground in Translation: Strategies for Addressing Language Imbalances and Ensuring Linguistic Diversity in Online Content.
  • Uncovering Correlations between Language Pair Metrics and Translation Processes: Implications for Global Communication and Cross-Cultural Understanding.
  • From Default Language Selection to Alternative Approaches: A Critical Examination of Language Bias in Machine Translation.
Week 11-12: Finalizing Findings and Recommendations

( August 7th - August 18th )
• Finalize the analysis and prepare a comprehensive presentation of the findings to share with the team.
• Use data visualization tools to communicate the results of the analysis.
• Synthesize the findings to develop recommendations for addressing imbalances in Wikipedia.
• Develop a plan to test the effects of changes to the algorithm used for suggesting articles for translation.

Week 13: Wrapping up

• Prepare the final report and presentation.
• Present the final findings and recommendations to the project mentors and community members.
• Receive feedback and suggestions for improvement.
• Finalize the report and recommendations.
• Document the research process by generating a concise report outlining the most noteworthy findings and making all supporting materials.

  • Write 2 blog posts. Possible topic:
  • Outreachy and Wikipedia: My Experience and What I've Learned.

Other deliverables during the internship

  • Blog posts on my progress every week.
  • Blog posts on my experience with the Wikipedia community and about the project.
  • Regular communication with my mentors and other members of the Wikimedia Community.

About me

Hello, I'm Anshika, a self-taught software engineer with a background in business studies, but my passion for technology led me to make a career switch. Since then, I've been pursuing my interest in software engineering on my own. I'm an active member of online communities such as FreeCodeCamp, where I engage with other passionate individuals to learn more about software development and open-source projects. These communities have provided me with a platform to connect with like-minded individuals from all over the world, and to stay up-to-date with the latest trends in the tech industry.

Relevant projects:

During my time in school, I led a team of three in a data analysis project focused on consumer behavior. As part of my role, I researched and analyzed data to identify patterns, trends, and key factors influencing purchases. I conducted statistical analyses, including frequency and correlation analyses, to gain insights into the relationships between factors such as age, income, and buying behavior. Our team generated valuable insights that could inform future decision-making for businesses and organizations.

In my first year of college, I gained experience in conducting case studies and surveys on various topics, including two studies on the “determinants of buying behavior and mobile phone demand among university students in India and Sri Lanka.” Additionally, I was part of an active research team in my college, where I conducted research, case studies, reports, and surveys on economic and business-related topics. Through these experiences, I honed my analytical and research skills and developed expertise in report development and effective communication of findings. These experiences sparked my interest in pursuing data-driven work in the future.

How did I learn about outreachy

I first learned about Outreachy through my participation in online communities such as FreeCodeCamp and other tech-related forums. These communities have provided me with a wealth of information and resources, and have helped me stay up-to-date with the latest trends and opportunities in the tech industry. Through these channels, I discovered Outreachy and was immediately drawn to its mission of promoting diversity and inclusion in tech by providing internships to underrepresented groups. I was excited to learn more about the program and to apply for an internship, which ultimately led to a valuable learning experience and a chance to contribute to meaningful open-source projects.

Event Timeline

Anshika_bhatt_20 renamed this task from outreachy proposal: Anshika Bhatt to Outreachy Proposal: Anshika Bhatt.Apr 3 2023, 2:17 PM

Outreachy results are out! Declining this task as the proposal was not selected. You could consider finishing up any pending pull requests or tasks remaining from the contribution period. Your ideas and contributions to Wikimedia projects are still welcome. If you would still be eligible for the next round of Outreachy, we look forward to your participation!