Outreachy proposal - Nathaly Toledo
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ahn-nath
	Apr 3 2023, 4:05 AM

Description

Profile

Name : Nathaly Toledo
Time zone : GMT -4:00
Email : nathaly12toledo@gmail.com
Zulip : Nathaly Toledo
Github : https://github.com/ahn-nath
Location : Venezuela
Blog: : https://medium.com/@nathaly12toledo

Synopsis

The project studies, proposes, and tests solutions for the translation imbalances perceived on Wikipedia and studied in the past. From the description provided by the mentors of the project, we have:

When we compare the number of translations made between pairs of languages, we find very high ratios of articles being translated from languages with a larger wiki presence into languages with a smaller presence. English alone is the source language for 70% of all published translations, and the pattern seems to repeat for other colonial tongues.

We would like to understand why this is. We've begun to find explanations in the software design choices, and there are many potential influences behind each translator's choice of article and languages. Some of these factors might be: the number of articles available in each language, cultural richness and blind spots, suggestions made by software, the availability and quality of machine translation, and more.

The Outreachy component of our project will follow one of these possible avenues for investigation.

Mentors: @awight
Co-mentors: @Simulo

My contributions

Contribution #1

Link: https://docs.google.com/document/d/1lXfRC9kgPWGlpYPqpDgIH1kZUkAeVAzy7qOCe0PwUhc/edit?usp=sharing
This is my solution for a task that required comparing the differences between an official API result and the scraper result from a GitHub contribution. I observed the differences after parsing and converting the data of both outputs and validated them with programming solutions. Finally, I compare all public contributions to observe their differences and how each file is compared to the others in terms of accuracy.

Contribution #2

Link: https://etherpad.wikimedia.org/p/r.df3d6f2e35e02a3cfa8912b58abb6e36
The goal of the survey is to learn more about how this software is used, and how translation languages are chosen. My approach to this contribution was attempting to address assumptions made after reading research papers about potential reasons that translation imbalances may be present in the Wikipedia community of translators. It also includes general questions that may serve as an alternative guide on how to proceed with further research and the construction of new assumptions.

Contribution #3

Link: https://github.com/ahn-nath/configuration-evolution-over-time.time-machine
This project is a time machine for CSV files. It allows you to track changes in CSV files over time, restore CSV files to a previous state, compare CSV files to a previous state, keep track of the last time a CSV file was changed, and update it accordingly, without having to rewrite the data each time. Essentially, it works as a parser, which reads the data into a native structure in memory and plays back the data repository's git history to parse the data at each commit, storing the entire sequence in memory along with the timestamp of the git commit. It uses the GitHub API to access the git history of the data repository. As of now, it uses the GitHub repository Configuration Evolution Over Time: Source File as the primary data source, but it can be easily extended to use other data sources.

Contribution #4

Link: https://github.com/ahn-nath/wikimedia-cxserver-config-parser
This is a simple parser for Wikimedia CSV files. It is designed to be used with the “https://github.com/wikimedia/mediawiki-services-cxserver/tree/master/config” directory. Essentially, it is a parser for these files and creates a single flat, in-memory structure with all the supported pairs. It exports the data in a list of accepted YAML files as a CSV of all pairs, with at least the following columns: “source language”, “target language”, “translation engine”, and “is preferred engine?”

The configuration files have several file structures. Most have the source as the top-level key, and target languages as a list of values under that key. Those with “handler” indicate a non-standard interpretation for the file. My solution parses the "mt-defaults.wikimedia.yaml" to get the designated files and then proceeds to generate a CSV file with all records. It includes tests, a pickle file to improve time performance, the main script, and various folders with the target data.

It uses:

Python 3.6 or higher, pip and git.

Contribution #5

https://docs.google.com/document/d/1k2Ek4V2kzmjBXc4cOb9EmVsgV0IR08TGkSo8edXc-Ng/edit?usp=sharing
The task requires analyzing and summarizing a paper that studies participation on Wikipedia. It also involves making based assumptions and explaining them. I have read the research paper provided and added my own conclusions about the topics and about how it applies to the challenge we would target with this Outreachy project.

Timeline (2023)

Week 1 (30th May - 5th June)

Get familiarized with the topic and discuss with mentors priorities.
Solve task #1: Could be related to improving the integration of the cxserver scraper and the time machine as well as extending their functionality or the functionality of past tasks.

Week 2 (6th June - 12th June)

Document (task #1).
Get mentors' feedback
Continue working on task #1.
Submit the final version.

Week 3 (13th June - 19th June)

Document new changes of task #1.
Discuss with mentors potential software improvements of research imbalances as well as based proposals that we could test.
Decide on three potential software solutions.

Week 4 (20th June - 26th June)

Start working on proposed solution #1 - Translation imbalances visualizer enhancements: a tool that allows community contributors from around the world to visualize translation imbalances across regions on Wikipedia, and compare them with past statistics as well as outcasted statistics, so that those who are interested in solving it or making contributions based on that data, can do so with a clear statement backed by evidence. Some features I suggest adding would be country of origin (1), and filters by timeframe (2). This is based on two existing solutions: https://en.wikipedia.org/w/index.php?title=Wikipedia%3AEdits_by_project_and_country_of_origin, and https://en.wikipedia.org/wiki/Special:ContentTranslationStats.
Submit the project plan (project requirements, motivations, and research to back the proposal, wireframes/user journeys, software architecture, and development plan) for feedback.

Week 5 (27th June - 3rd July)

Implement feedback.
Continue working on proposed solution #1 – Translation imbalances visualizer.
Discuss with mentors the final version.

Week 6 (4th July - 10th July) + Week 7(11th July - 17th July)

Complete proposed solution #1 - Translation imbalances visualizer.

Week 8 (18th July - 24th July)

Start working on proposed solution #2: a set of two features to integrate into Wikipedia to help influence user behavior in a way that promotes a reduction of imbalance in translations.
Submit the project plan (project requirements, motivations, and research to back the proposal, wireframes/user journeys, software architecture, and development plan) for feedback.
Start blog post draft #1.

Week 9(25th July - 31st July)

Implement feedback.
Continue working on proposed solution #2: a set of two features to reduce imbalances.
Discuss with mentors the final version.
Complete and publish blog post #1.

Week 10(1st August - 7th August)

Complete feature #1 of the proposed solution #2 – a set of two features to reduce imbalances.
Start blog post draft #2.

Week 11(8th August - 14th August)

Start working on feature #2 of the proposed solution #2 – a set of two features to reduce imbalances.
Submit progress for feedback.
Complete and publish blog post #2.

Week 12(15th August - 21st August)

Implement feedback.
Continue working on proposed solution #2 and the two features.
Discuss with mentors the final version.
Start blog post draft #3.

Week 13(22nd August – 30th August)

Complete proposed solution #2.
Complete and publish blog post #3.

30th August and later

Final improvements.
Additions to the documentation of projects.
Continue working as a volunteer if improvements are needed for the project.

About me

I am Nathaly Toledo, a senior student from Caracas, Venezuela. I studied in a technical institute during my high school years, and continued my education at the University of the People, with a computer science degree. Next year I will be completing a Master's level diploma with Open Classrooms in Software Architecture. Additionally, I participate in different programs and take different specializations to upgrade my software engineering skills and understanding of world social problems, as well as social problems in South America and in my country. I have three years of working experience with international clients as a professional software developer and have been officially recognized as the top 3% of all professionals on a freelancing platform.

I have decided to apply to Outreachy because I feel it is a highly useful way of connecting with mentors and projects that can help me have a bigger understanding of quality software, the best approach to tackle problems, and help me upgrade my skills. Furthermore, I have decided to only apply to this Wikipedia project because I felt, after studying the other projects, that it was the only project that matched my interests, skills, and purpose. And I believe that this contribution can have the potential to improve participation and make knowledge more accessible to people in my country and in countries like mine. I would like to study the gap between countries when it comes to translation, the impact it has on the access certain groups have to relevant information, why we should care, and how to do something about it.

Past experience with this community

User:
Like most people, I have used Wikipedia to inform myself about relevant topics and have a starting point for important information. Sometimes, I just use Wikipedia to find significant or useful groups of references targeting one concept. I recently visited: https://es.wikipedia.org/wiki/Instituto_del_Hemisferio_Occidental_para_la_Cooperaci%C3%B3n_en_Seguridad

Similarly, I have direct contact with Wikimedia Venezuela because I am a member of the Impact Hub Caracas, which is associated with the organization, and can connect with them. I recently contacted Galahad (https://meta.wikimedia.org/wiki/User:Galahad), to have his views on the problem the project is trying to study.

Contributions:
Via Outreachy, I applied to a Wikimedia project before and worked on one contribution during the application period. This is the link to my work: https://public-paws.wmcloud.org/66093174/task-01.ipynb

Past experience with other communities

Contributions:
As mentioned later, I was an open-source developer through a fellowship. I worked for two months on several contributions to ProgramEquity, an open-source project used to promote climate change and help advocacy groups meet their goals.

Repository link: https://github.com/ProgramEquity/amplify

Users:
As a user of Windows, I am lately getting more and more involved with Linux. I have used Docker Web many times to interact with Ubuntu via virtual machines, and I was also selected by the Linux Foundation to receive the “Shubhra Kar Linux Foundation Training (LiFT) Scholarship 2022”, as listed on their public website (https://www.linuxfoundation.org/about/lift-scholarships). Thanks to it, I am training and getting involved with it professionally.

Naturally, as a developer, I am a user of many wonderful open-source projects like Python and many libraries. Nevertheless, I would like to focus on more specific cases where I was a user who took advantage of the tool in a technical way:

I used a tool developed by another open-source contributor for my open-source contributions, and I discussed with the author how to use it. I also analyzed and compared other alternatives to implement a solution.

Relevant links:

Relevant projects

As a freelancer with several years of experience working for international clients and as someone with three months of experience with open-source development, I have worked on a wide range of products that involved the analysis of existing systems for their improvement, as well as detecting the root cause of issues that could be solved or addressed with software, as it’s the case with software imbalances. I would like to highlight one closed-source projects I worked on and one open-source project:

Open source project: Program Equity – Amplify
After being selected as an MLH Fellow last fall, I was chosen by the GitHub organization to contribute as an open-source developer in one of the projects they sponsor: ProgramEquity. The Amplify project is “an open-source app created for users to take the initiative in being part of an actionable step in the efforts to protect against climate change.” Besides that, they also help indigenous communities in North America by enabling advocacy groups through the app. I was the most active contributor and successfully closed more than five issues in two months.

Some of my contributions:

Add social media icons with links to representative card: https://github.com/ProgramEquity/amplify/pull/361
Fix the display of the filter in the campaign page form: https://github.com/ProgramEquity/amplify/pull/365
Crop representatives photo: [Low Priority] Crop representative photos using Cicero API data: https://github.com/ProgramEquity/amplify/pull/367
Cache option APIs: https://github.com/ProgramEquity/amplify/pull/373
Support queries by address: https://github.com/ProgramEquity/amplify/pull/402
[Documentation] Add ORM diagram to README: https://github.com/ProgramEquity/amplify/issues/354, https://github.com/ProgramEquity/amplify/wiki/Data-Structures/382ff30431f1a13375838e8dc89934de62252a17

Skills I gained:

Collaborating in an open-source environment.
Proposing ideas and discussing them with an open source community before implementing a solution.
Assertive communication and collaboration while pair programming with senior developers.
API caching and understanding techniques about performance improvement.

Relevant links:

ProgramEquity - Amplify: https://github.com/ProgramEquity/amplify
GitHub blog highlighting my participation: https://github.blog/2022-09-23-meet-the-github-campus-experts-selected-for-the-fall-2022-mlh-fellowship-cohort-powered-by-github/

Closed source project: Queensbury.io
For one year, I worked on the creation of a minimum viable product (MVP) and a proof of concept (PoC), for a startup with a focus on the US. As a junior software engineer, I was responsible for Queeensbury.io, one of Cryptius's projects. Queensbury.io is a system that aims to revolutionize the boxing world, by providing an accurate way to analyze and score boxers’ performance via artificial intelligence and body mechanics.

Skills and knowledge I gained:
Overall, I designed and built the proof of concept and MVP versions of the solution. I have learned how to:

Used different communications protocols, like TCP to understand and implement a WebSockets architecture that let the team create a notifications system that was event-driven and highly functional.
Define and study the software architecture to use.
Lead the initial design process and handle the low-fidelity and high-fidelity proposals.
Fully implement each module of the web app with Django and Plotly.
Integrate the Vimeo and Google Cloud Storage APIs.
Help with the data pipeline process, the creation of Google Cloud Functions, and the Docker container logic.
Use data analysis skills to process inputs and outputs and simplify existing calculations.
Generate Plotly diagrams/graphs with the data received.
Document the project on Notion and GitHub, by adding the architecture overview, technical specifications, product requirements, testing specifications, maintenance guide, and README file.

Relevant links:
Unfortunately, due to the nature of my freelancing contract, I cannot share many substantial details about the project. However, I can share a Gist secret that contains some files that the client allowed me to share and that date back to the initial stage of the project:

Gist secret with some files that describe my involvement with the project during the initial stages: https://gist.github.com/ahn-nath/34b559ca0648f577fe46c73e09e60105
Extract feedback from the client. This feedback is listed as a recommendation on LinkedIn: https://www.linkedin.com/in/nathaly-toledo/

“With so many positive things to say about Nathaly, it's impossible to know where to begin. We are a startup company, and as such we are very much “sink or swim”, and we constantly challenge our people to think creatively to solve complex problems with no easy or readily identifiable solution. In this challenging atmosphere, I have yet to see a challenge that she was unwilling to take on or unable to solve. Nathaly is an incredibly talented engineer. She designed and built our platform from the ground up, often with little or no guidance on which direction would be the best to take. She basically started with an idea that was not her own, some very loose guidelines on what the end product would look like, and hit the ground running...”

Time commitments on initial application

As a student of an online institution, I have complete control of my schedule and can easily adapt my commitments to the project and the availability required. My current time commitments with the university need up to 15 h of dedication, and, on average 10 h per week.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T328597 [Outreachy round 26] Research into translation imbalances
		Resolved		Ahn-nath	T333792 Outreachy proposal - Nathaly Toledo

Event Timeline

Ahn-nath created this task.Apr 3 2023, 4:05 AM

Anoop merged a task: T333791: Outreachy proposal - Nathaly Toledo.Apr 3 2023, 4:35 AM

Thank you for putting all of this work into your contributions and proposal, it's been an honor to be a reviewer!

Two major tasks is a good scope, IMHO. We might only complete one, but if that ends surprisingly quickly then we'll still have a second project to focus on.

Point of information, there's already a statistics view for ContentTranslation here (see especially the "translations from" graphs, which require something like a manually-split log scale!). Coordinating with the https://www.mediawiki.org/wiki/Special:MyLanguage/Wikimedia_Language_engineering team to integrate this new view within our timeline would probably increase the effort unreasonably, so I would suggest studying the existing tool and building our enhancement as an independent mock-up, but perhaps using a similar tech stack (main module, chart.js) so that it's easy to port into MediaWiki if approved. Also: really cool idea! Starting to address the problem by creating more transparency seems like a great strategy.

A similar note applies to the suggestion in task #2 of making software changes: merging into the main software comes with a lot of coordination and the feature should be validated ahead of this. However, I could see us validating the feature by running a small user research study where 5-10 wiki translators offer to try out a specially-modified version of the software with us in an interview setting.

Lastly, please be aware that Outreachy has a project requirement to blog every two weeks. Hopefully that's more of an opportunity than a burden! Don't feel that you need to adjust the timeline to include these, I'm mostly mentioning so that the "job" parameters are clear.

In T333792#8749394, @awight wrote:

Thank you for putting all of this work into your contributions and proposal, it's been an honor to be a reviewer!

Thanks for guiding us and helping me improve my analytical and software engineering skills.

Point of information, there's already a statistics view for ContentTranslation here (see especially the "translations from" graphs, which require something like a manually-split log scale!). Coordinating with the https://www.mediawiki.org/wiki/Special:MyLanguage/Wikimedia_Language_engineering team to integrate this new view within our timeline would probably increase the effort unreasonably, so I would suggest studying the existing tool and building our enhancement as an independent mock-up, but perhaps using a similar tech stack (main module, chart.js) so that it's easy to port into MediaWiki if approved. Also: really cool idea! Starting to address the problem by creating more transparency seems like a great strategy.

I based my choice on the observations of another tool: https://en.wikipedia.org/w/index.php?title=Wikipedia%3AEdits_by_project_and_country_of_origin (no longer available), which also shows the country of origin. I believe showing the country or tracking it in aggregate for the creation of the tool can provide better insights on how to address the issue for future contributors. Furthermore, I believe that having a starting point makes a small contribution from our side more likely.

A similar note applies to the suggestion in task #2 of making software changes: merging into the main software comes with a lot of coordination and the feature should be validated ahead of this. However, I could see us validating the feature by running a small user research study where 5-10 wiki translators offer to try out a specially-modified version of the software with us in an interview setting.

Yes, my proposed timeline may seem out of touch with reality or the intended scope of the collaboration for the selected participant. I tried to do my best to adapt it to what I feel makes sense to work on based on my understanding of the project and some potential patterns we can address, as well as on my experience with similar projects in the past and how it can turn out (regarding the time). Of course, I know that behind each feature, there is a lot of feedback to be received, intervention and waiting time for approval. I also know that this features will not make it to a main repository the next day, but even if they are not immediately approved, I hope to work on something that is solid and efficient enough for it to advance the goals mentioned in the project abstract/description.

Lastly, please be aware that Outreachy has a project requirement to blog every two weeks. Hopefully that's more of an opportunity than a burden! Don't feel that you need to adjust the timeline to include these, I'm mostly mentioning so that the "job" parameters are clear.

This project requirement was included in my timeline.

Thanks again for your time.

Ahn-nath updated the task description. (Show Details)Apr 3 2023, 3:57 PM

Ahn-nath updated the task description. (Show Details)Apr 3 2023, 6:30 PM

srishakatux moved this task from Backlog to Project Proposals on the Outreachy (Round 26) board.Apr 4 2023, 6:11 AM

Gopavasanth moved this task from Project Proposals to Selected Projects on the Outreachy (Round 26) board.May 8 2023, 12:40 AM

Hello, @awight and @Simulo. Firstly, I am honored to be selected.

Because I would like to do a good job before and after the community bonding period, I would like to know if there is anything else you would like me to focus on. I also wanted to know if you had any thoughts about the route or role I would be taking during this internship (if determined, such that you had some ideas about what I would be doing or take care of).
I have already completed most tasks in this list:

Create a task around your project (as a sub-task of the featured project) in Phabricator if you skipped this step during the application period. Use this task to track progress, for discussions with mentors and other community members, and share updates frequently in a comment (e.g., T266916).

Refine your project proposal with guidance from mentors, and discuss communication, development, overall timeline, and deliverables plan to follow throughout the internship.

Join Zulip (more info) to keep yourself up to date with the announcements related to the program and opportunities for participating in Wikimedia activities.

Write bi-weekly progress reports or blog posts in a language you are most comfortable writing, share them with fellow interns on Zulip, and add them to the "Updates" column on the Outreachy wiki page on MediaWiki.org. You can use this column to share any other project-related updates.

Setup your MediaWiki user page. You can use it to documenting your project work and linking reports (e.g., User:Martyav, User:Gopavasanth).

Stay in touch with Wikimedia technical discussions by subscribing to the wikitech-l mailing list.

[Optional] Read stories from the Wikimedia movement on Wikimedia Foundation’s blog, about the technology and software behind running Wikipedia and its sister projects on the Wikimedia technical blog. > > - Watch previous videos on Wikimedia technical topics.

Except for the blog, and refining the proposal as well as checking more Wikimedia-related videos on technical topics. Let me know if I should be tagging you both or just one mentor at a time.

In the meantime, I would like to continue reading some papers to either strengthen or reconsider my initial views on the topic.

Thanks and regards!

Hi! Please consider resolving this task and moving any pending items to a new task, as GSoC/Outreachy rounds are now over, and this workboard will soon be archived.

Updates from the project: https://www.mediawiki.org/wiki/Outreachy/Past_projects#Content_Translation_language_imbalances. As Outreachy Round 26 is long over, closing this task now.

Outreachy proposal - Nathaly ToledoClosed, ResolvedPublicActions