=Proposal for https://phabricator.wikimedia.org/T328597=
Name: Abhishek Bhardwaj
Email: abhishek02bhardwaj.er@gmail.com
User Page: [[ https://commons.wikimedia.org/wiki/User:Abhishek02bhardwaj | Wikimedia Page ]]
Location: Delhi, India
LinkedIn: [[ https://www.linkedin.com/in/abhishek-bhardwaj-821054241/ | Abhishek Bhardwaj ]]
Zulip: abhishek02bhardwaj.er@gmail.com (Abhishek Bhardwaj)
Phabrikator: Abhishek02bhardwaj
Time Zone: UTC+5:30
Working Hours: 9 AM to 2 AM (UTC+05:30)
On College Days, occupied between 9 AM to 2 PM (UTC+05:30)
=**Abstract**=
Wikipedia is a free, online encyclopedia that provides information on a vast array of topics, from history to technology to pop culture. It is maintained by a global community of volunteer editors who collaborate to create and update its content. With millions of articles available in multiple languages, Wikipedia has become a go-to resource for people seeking knowledge and information on a wide range of subjects.
Wikipedia provides translation services that allow users to access articles in their preferred language. This feature enables content to be available to a wider audience, and it is achieved through the efforts of volunteer translators who work to create and update articles in various languages. The translation services also include machine translation, which uses artificial intelligence to automatically translate articles into different languages. While the quality of machine translation can vary, it has made it easier for users to access information in their native language, regardless of the original language of the article.
But one important issue that concerns the wiki enthusiasts is the imbalance in the translations .By comparing the number of translations made between pairs of languages, it is observed that articles from languages with a larger presence on Wiki are being translated into languages with a smaller presence at very high rates. English, specifically, is the source language for 70% of all published translations, and this trend is also present for other colonial languages. [[ https://phabricator.wikimedia.org/T328597 | This project ]] will focus on researching into these imbalances and understanding the reasons behind the same.
=**Extension**=
Apart from the Analysis stream of research I will also be working in the UX research direction, taking interviews of the translators to gain an understanding of how their perception of language importance influences their language selection. Also, I will try to investigate how the software design impacts the selection of languages and the translation workflow.
=**Mentors**=
@awight and @Simulo
=**Experience and Contributions made to the project**=
Being a Wikipedia user since the age of 12 I always wondered who was the person who was so knowledgeable to write all of this information all by themselves (as a kid I thought Wikipedia was written by one person like a book). As i grew old and realised it wasn't an individual but a community who did this, I never thought in my wildest dreams that one day I will be sitting in front of my laptop being capable enough to write a proposal to those people to be a part of their team, this experience is more important to me than all of the knowledge that I have gained participating in this contribution period for the Wikimedia foundation. So it is already a dream come true for me.
During the past 21 days (from 6th of March to 26th of March) I have learned a lot of things doing this project,
1. In the task [[ https://phabricator.wikimedia.org/T331199 | #T331199 ]] I summarised a paper and on the basis of it gave hypotheses and informed guesses about how it applies to translators.
[[ https://github.com/Abhishek02bhardwaj/Submission-for-T331199-and-T331200 | My Contribution ]]
2. In the task [[ https://phabricator.wikimedia.org/T331200 | #T331200 ]] I did a light systematic review of literature that might be relevant to our research.
[[ https://github.com/Abhishek02bhardwaj/Submission-for-T331199-and-T331200 | My Contribution ]]
3. In the task [[ https://phabricator.wikimedia.org/T331201 | #T331201 ]] I created a parser from scratch to extract the cx-server configuration and extract them to a csv.
[[ https://github.com/Abhishek02bhardwaj/Extract-cxserver-configuration-and-export-to-CSV | My Contribution ]]
4. In the task [[ https://phabricator.wikimedia.org/T331202 | #T331202 ]] I created a time machine to access the git history of a data repository and analyzing the data at each commit. The information obtained is then stored in the memory, along with the time stamp of the git commit, forming a complete sequence.
[[ https://github.com/Abhishek02bhardwaj/Evolution-Tracker | My Contribution ]]
5. In the task [[ https://phabricator.wikimedia.org/T331204 | #T331204 ]] plotted flow diagrams illustrating translation imbalances.
[[ https://github.com/Abhishek02bhardwaj/Flow-Diagrams-Illustrating-Translation-Imbalances | My Contribution ]]
6. In the task [[ https://phabricator.wikimedia.org/T331207 | #T331207 ]] I learned about how to compose a survey. In this task I drafted a survey for Content Translation software users, investigating how the software is used and how the languages are chosen.
[[ https://etherpad.wikimedia.org/p/xGzVywcafj65F66Gea2n | My Contribution ]]
7. In the task [[ https://phabricator.wikimedia.org/T332643 | #T332643 ]] we had to integrate the configuration scrapper that we built in task [[ https://phabricator.wikimedia.org/T331201 | #T331201 ]] and the time machine built in task [[ https://phabricator.wikimedia.org/T331202 | #T331202 ]] running the configuration scraper on every git commit of the cxserver source repository.
[[ https://github.com/Abhishek02bhardwaj/Rough-Integration-of-Time-Machine-and-Configuration-Scrapper | My Contribution ]]
8. In the task [[ https://phabricator.wikimedia.org/T332647 | #T332647 ]] I compared the API results to the output of the scrapper I built in the task #T331201. The accuracy of the scrapper is 100%. As an extension to the task I also compared other contributor's output and recorded their match percentage.
[[ https://github.com/Abhishek02bhardwaj/Compare-config-scraper-output-with-config-API | My Contribution ]]
=**Past Experience with Open Source Software **=
As a contributor, this is my first time contributing to open-source, but I have been an active open source user since past 10 years. From using VLC Media Player to watch videos, to using the Android operating system on the smartphones, open-source softwares have been an integral part of the technology present in my life. I started to learn coding on Dev Cpp which is a free open source IDE for Windows. Then, when I expanded my horizons of learning and learned more programming languages I switched to an IDE that is compatible with multiple languages, The Visual Studio Code which is again built on open source. I use Firefox to browse internet which provides phishing and malware protection. I use WordPress to develop websites which are really good in design and function very well. I use PHP to mange dynamic content and session tracking. MySQL is my favourite RDBMS for managing databases. In this way whether it was entertainment, learning or any other utility open source has helped me a lot by providing excellent utilities.
=**About the Project**=
Among the proposed streams I am quite interested in the UX research and the Analysis as my past projects involved product development in which UX research and Analysis is an integral part. Through the UX research I aim to dig deeper into the thought process of translators while selecting languages for translation and what are the factors that motivate and prompt them into choosing the language. Can these factors be countered from the project end or are these some other factors which needs to be addressed with a better approach. Through the Analysis stream, I will be analysing the possible technical reasons for these imbalances and also try to find a way to counter them.
I have divided my timeline into 5 phases, each phase addresses one important aspect of research into the translation imbalances.
=**Timeline**=
===Pre-Selection Period===
**April 4th - May 3rd**
- Study the code responsible for setting the default languages from [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/ContentTranslation/%2B/refs/heads/master/modules/dashboard/ext.cx.dashboard.js | CXDashboard.findValidDefaultLanguagePair ]].
- Carry forward Task [[ https://phabricator.wikimedia.org/T331204 | #T331204 ]] further producing and illustrating translation imbalances via flow diagrams.
- Gather information about the two algorithms at play: one chooses the pair of source and target languages between which to translate, and the other chooses which articles to show for translation.
**May 4th**
- Celebrations
===Community Bonding Period===
**May 5th - May 28th**
- Refine the survey developed in task [[ https://phabricator.wikimedia.org/T331207 | #T331207 ]] with the help of mentors and the already provided contributions on the task.
- Send out the survey to translators.
- Find information about potential candidates for interview.
- Prepare and discuss a questionnaire with the mentors that will be used for interviewing.
- Communicate to the potential candidates and ask for a suitable date and time for their interview.
- Review the papers that were listed in the task [[ https://phabricator.wikimedia.org/T331200 | #T331200 ]] ultralight systematic review and try to observe if there is anything we can relate to or use in our research.
===Contribution Period Begins===
**May 29th - June 2nd**
**WEEK 1**
- Review the papers that were listed in the task [[ https://phabricator.wikimedia.org/T331200 | #T331200 ]] ultralight systematic review and try to observe if there is anything we can relate to or use in our research.
**June 5**
**FEEDBACK #1**
===Phase I Testing of the Possible Hypothesis===
**June 5th - June 9th + June 12th - June 16th**
**WEEK 2 + Week 3**
- Hypothesis 1 – The distribution of translators around the world will be significantly different from that of the editors and contributors. Analyse the data of translators on the basis of location. In a way similar to what was done for the paper [[ https://www.tandfonline.com/doi/full/10.1080/00045608.2015.1072791 | Graham, Straumann, Hogan. 2015. “Digital Divisions of Labor and Informational Magnetism: Mapping Participation in Wikipedia.” ]]
**June 12th - June 16th**
**WEEK 3**
- Hypothesis 2 – The availability of broadband connections and internet access in a region directly impacts the number of translations to and from the languages spoken in that region. Analyse the correlation between the number of broadband connections /internet availability in different regions and the number of translations to and from the languages which are spoken in these regions.
**June 19th - June 23rd**
**WEEK 4**
- Hypothesis 3 - Use the data about gross enrolment ratio in education in regional languages and assess its relation and impact on the number of eligible and active translators in that language.
**June 26th - June 30th**
**WEEK 5 + June 26th - June 30th**
- Hypothesis 4 - When two languages are more closely related geographically or historically, translations are more likely to occur. So languages which are more linked to other languages geographically or historically will see more translations.**WEEK 4 + WEEK 5**
- Hypothesis 2 - When two languages are more closely related geographically or historically, translations are more likely to occur. So languages which are more linked to other languages geographically or historically will see more translations.
**July 3rd**
**FEEDBACK #2**
===Phase II Interview and Survey Documentation===
**July 3rd - July 7th + July 10th - July 14th**
**July 3rd - July 7th**
**WEEK 6**
- Analyse the responses from the survey and deduce important observations and document them appropriately.
**July 10th - July 14th**
**WEEK 6 + WEEK 7**
- Conduct interviews with translators to gain an understanding of how their perception of language importance influences their language selection. Also, investigate how software design impacts the selection of languages and the translation workflow.
**July 17th - July 21st**
**WEEK 8**
- Document the interviews and the findings into a structured blog that can be used for further reference in the research work.
**July 24th**
**FEEDBACK #3**
===Phase III Analysis and Improvement of Algorithm===
**July 24th - July 28th + July 31st - August 4**
**July 24th - July 28th**
**WEEK 9**
- Analyse the currently used algorithm for default language selection and devise a potential alternative to it.
**July 31st - August 4**
**WEEK 10WEEK 9 + WEEK 10 **
- Analyse and research into the algorithm used for suggesting the articles for translation. Look for any potential bias and devise a way to remove if any.
===Phase IV Passive Analysis and Experimental Intervention===Build a Quantitative View on the Issue===
**August 7th - August 11th**
**WEEK 11**
- Passive analysis of content translation historical logs:
1. Find connections between the translation process and the relative metrics for each language in the pair, such as the total number of articles, active editors, and pageviews.
2. Analyse smaller language subsets to make comparisons.
3. Break down all statistics based on whether the translation came from a suggestion, whether machine translation was used explicitly, and whether external or internal machine translators are available for the language pairComplete the Configuration Time Machine work with the required documentation.
**August 14th - August 18th**
**WEEK 12**
- Discuss and design experimental interventions in coordination with the Language Engineering teamFigure out the discontinuities in the machine translation and check for correlations with the step changes in the published translations.
**August 21st**
**FEEDBACK #4**
===Phase V Conclusion===
**August 21st - August 25th**
**WEEK 13**
- Conclude the research work, prepare a report of the findings, and publish the raw data links, code, graphs and exceptions.
- Write a blog article containing a concise report of all the work done that can be used to carry the research forward
**August 28th and Later**
- Celebrations.
- Continue code-based contributions to Wikimedia.
- Be an active member of the Wikimedia community and start exploring other communities to work with.
- Actively maintain the code and documentation and guide beginners who are interested in contributing.
**Stretch Goals**
- Test the hypothesis about how some wikis are more challenging to contribute to than others (may be because of the cultural or stylistic difference of wikis ) either through survey questions or by looking at translator activity over time.
==Other Deliverables during the Internship==
- Weekly Blog posts on my internship progress/experience.
- Blog posts about my experience with the open source community and the Wikimedia Foundation.
- Regular communication with my mentors and other members at the Wikimedia Community.
==About Me==
I am a sophomore, pursuing a degree in Bachelors of Technology in Information Technology and Mathematical Innovation from Cluster Innovation Centre, University of Delhi. I am currently in my 4th semester of the 8 semester program. I will be graduating in May, 2025. I am an active member of the coding society of our college where we have built an ecosystem of peer-learning and have also helped in organising numerous workshop and technical fests.
==Past Projects==
1. **[[ http://dhamni-cic.infinityfreeapp.com/?i=2 | DHAMNI ]]** :
- It is web-development project which aims to provide a platform for blood donors and recipients to share information.
- The donors can upload their details which can be searched by people who are in need of blood donation.
2. **[[ https://curiocic.netlify.app/ | Curio ]]** :
- It is also a web-development project in development and aims to solve the language gap (in available audio) problem of YouTube.
- Educational content is difficult to understand through subtitles due to which to which dubs and audio translations become a need.
- Through this platform user can record their dub of a YouTube video and upload it that can be viewed by other users who wish to watch it in the language.
3. **[[ https://github.com/Abhishek02bhardwaj/Facial-Recognition-Using-Principal-Component-Analysis | Facial Recognition Software using Principal Component Analysis]] **-
- Developed a MATLAB software capable of identifying and retrieving an image if it exists in its train database.
4. **[[ https://github.com/Abhishek02bhardwaj/mazes-with-dijkstra-and-A- | Maze Solving Algorithm Comparison ]] **:
- Compared Djikstra and A* algorithms along with their time and space complexity
==How did I learn about Outreachy?==
In our college, we have a really good culture of contributing to open source and open source is something that is discussed around our campus almost all of the time. This culture got my initial interest in open source software and I started learning about them. As I explored more about the communities, I realised what kind of an impact they make into people's lives which further motivated me to be a part of this community and start my part of contributions. I came to know about Outreachy from a college senior, who interned at Outreachy and contributed to Inkscape in 2021. She told me about the program and also about how it supports diversity in free and open source software which got my interest and I started to work in this direction.