Page MenuHomePhabricator

Outreachy Round 29: Expand Scribe-Data translation functionality to use Wikimedia project dumps
Closed, ResolvedPublic

Description

Brief summary

The Scribe community uses Wikimedia based data to create software for language learners. The main user facing application that the community makes is Scribe-iOS – a collection of keyboards for second language learners that can be used in any app to translate words, conjugate verbs and much more! The community is now also working on Scribe-Android and Scribe-Desktop.

The processes by which the Scribe community derives Wikimedia data are found in the project Scribe-Data, which till now have been Wikidata lexicographical data and Wikipedia texts based. Scribe-Data is a Python based command line interface, with usage examples including:

  • Getting all nouns, their genders and their plurals for a given language from Wikidata
  • Getting all verbs and needed conjugations from Wikidata
  • Generating autosuggestions for Scribe keyboards via Wikipedia dumps

Translations are an important functionality of the end-user Scribe applications, but till now the translation functionality has been reliant on Hugging Face based machine translations that are quite time intensive. This Outreachy project will focus on adding translation functionality that's based on Wikimedia project dumps - either Wikidata if the available lexeme data allows or Wiktionary based functionality if not - to Scribe-Data. This data will then be used to add functionality to downstream Scribe applications as well as others making use of Scribe-Data. Specifically this project will add the following commands to the Scribe-Data command line interface:

  • An improved version of the translate functionality that will parse Wikimedia project dumps for all translations for any language
  • Data outputs should be formatted both for Scribe-Data end users and Scribe-iOS/Android (we'll explain)
  • Potentially also similar functionality for deriving synonyms of words
  • If using Wikidata dumps, then the results of this process should mirror Scribe-Data's SPARQL query results as closely as possible, and dumps should be offered as an alternative data source for the user for current Scribe-Data functionality
    • This allows for experimentation and using Scribe-Data without large requests to the Wikidata Query Service

The above processes will need to have unit tests written for them to make sure that future changes to the code to not cause breaking changes. Efficiency of parsing Wikimedia project dumps or other data sources will also be key to the success of this project. The tasks above are the confirmed goals for this project, with aspirational goals being set by mentors and the mentee once the program starts.

Note: The decision was made to use Wikidata dumps as the data source for the project.

Skills required

  • Skills in the Python
  • Prior experience working with Wikimedia information would be a plus
  • Project tag: affects-scribe-org

Possible mentor(s)

Microtasks

  • Issues for Scribe-Data
    • We'll be making more issues in the coming weeks to add more languages to the CLI's functionality
  • Any issues for other Scribe projects
    • Note that working on Scribe-iOS requires coding on macOS so that you'll have access to Xcode

Please look for the good first issue or help wanted tag in all projects! We'll be happy to help you onboard :)

Communication

Please join our community Matrix spaces to chat with the team and learn more about Scribe! We'd suggest using Element as your Matrix client, if you haven't used it before. Specifically we have a room for Scribe-Data and for Mentorship programs. During the program your mentors will be happy to communicate with you on GitHub or via Matrix. You'll also be invited to the Scribe bi-weekly developer calls where you'll have time to present your progress and work with the team on any problems. Calls and checkins outside of the syncs can also happen if needed :)

Event Timeline

Hi @debt 👋 Will port this over to outreachy:communities/cfp/wikimedia soon :) Please let us know if further information is needed and if the tags or any other Phabricator specifics should be updated!

Hi @AndrewTavis - this all looks great, please port it over to Outreachy! :)

AndrewTavis renamed this task from Outreachy Round 29: Scribe-Data Wiktionary translation and synonym to Outreachy Round 29: Scribe-Data Wiktionary based translation and synonym commands.Sep 6 2024, 8:04 PM
AndrewTavis renamed this task from Outreachy Round 29: Scribe-Data Wiktionary based translation and synonym commands to Outreachy Round 29: Create Scribe-Data Wiktionary based translation and synonym commands.Sep 6 2024, 8:10 PM

Project posted on Outreachy, @debt :) Let us know if there's anything else needed for now!

Thanks, @AndrewTavis - please add in your additional mentors to the Outreachy site!

Hi @AndrewTavis, My name is Abhishek Bhardwaj and I interned with Outreachy for the summer 2023 cohort on the Content Translation Language Imbalances under the mentorship of @awight. As a past Outreachy intern I would really like to help in the capacity of a co-mentor if you allow me to. This project also falls under my area of expertise. If you like we can discuss this further on a different communication channel.

Hey @Abhishek02bhardwaj 👋 @awight and I actually discussed your project at work as we were both mentors for different Outreachy/GSoC projects that same summer :) I just added communication guidelines to the task description, so you'd be welcome to reach out to me on Scribe's community Matrix spaces. Wouldn't just be my call on co-mentorship, but let me say that I do really appreciate your dedication to Outreachy and willingness to help!

debt changed the task status from Open to In Progress.Dec 17 2024, 5:59 PM

Week 1 (Dec. 9 - 13)

Tasks Completed:

  • Completed Community bonding tasks: updated wiki meta user page, Wrote my first blog, Joined Zulip.
  • In sync meeting with mentors,

Learning:

Mostly finding my way around the scribe translation and just actually getting started.

This comment was removed by Afi570.

Week 2 (Dec. 16 - 20) Update:

Tasks Completed:
Made a PR for downloading lexeme wikidump for scribe-data.

  • New Feature: Download Wikidata Dumps
  • Integration with Existing CLI
  • Code Cleanup and Get -all Test Remove

Learnings
Learning the inner workings and the relationships of the lexeme wikidump

Week 3 (Dec. 23 - 27)

Tasks Completed:

Studying Source code for qwikidata.json_dump and other wiki and related references, for parsing large file lexeme bz2 file in a suitable time to make users more comfortable to use scribe-data.

Tasks yet to be completed:

Understanding lexemes syntax and the structure of the dump file.

Week 4 (Dec. 30 - Jan 3rd)

Tasks Completed:

  • Met with AndrewTavis
  • Started onboarding and researching
  • tried multithreading in dump parsing and figure out which is best for our case.

Learnings:

Understood wiki dump structure and how those work in scribe-data

Thanks so much for the updates to the task, @Afi570! Project is going wonderfully so far. Thanks for all your hard work! 😊

AndrewTavis renamed this task from Outreachy Round 29: Create Scribe-Data Wiktionary based translation and synonym commands to Outreachy Round 29: Expand Scribe-Data translation functionality to use Wikimedia project dumps.Jan 8 2025, 11:25 PM
AndrewTavis assigned this task to Afi570.
AndrewTavis updated the task description. (Show Details)

Week 5 - Jan. 6 - Jan. 10, 2025.

Task completed:

  • modifying translation & forms by their unique id key.
  • added unique forms check between wikidata lexeme dump.
  • modified and fix bugs for translation command.

Learnings:

  • Learned to manage large tasks by breaking them into smaller parts and tackling them step by step.
  • Effective communication with mentors has helped me gain a better understanding of my work.
  • Learning GitHub Actions and exploring how to automate manual processes.

Week 6 (Jan 13th - Jan 17th)

Tasks Completed:

Learnings:

  • Understanding the flow of github action about how we can automate manual processes.

Week 8 (Jan 27 - 31 ) Update:

Tasks Completed:
Wrote blogs about How to start Scribe.

Add sub-language filter for translation & forms.
Implement air configuration for hot reload for scribe-server.

Week 9 (Feb 3 - 7 ) Update:

Tasks Completed:
Change paths for oapi-codegen moved GitHub organizations.
Setting up air for development

Tasks yet to be completed:
Make and Create the database schema for Scribe-Server

Week 10 (Feb 10 - 14 ) Update:

Tasks Completed:
Modified translations db & forms table for lastModified. Refactor data_to_sqlite for translations and forms
Make the initial step for migrating data, mysql files into MariaDB.
Migrate sql data into mariadb

Tasks yet to be completed
Will create the route logic for scribe-server.

Week 11 (Feb 17 - 21 ) Update:

Improve the testing for get and total cmd by Improve tests and fix condition in get cmd.
Added dump accessibility for get cmd in interactive mode. Here.

Helping to fix Some known issues & Combine genders into comma separated strings for noun queries.

Week 12 (Feb 24 - 27 ) Update:

Done with Scribe-Data translation functionality to use Wikimedia project dumps.

Added tests and other CLI command.

Also, planning on scribe-server, How we want to implement the route logic's.

Also, I'm writing the renaming blogs in upcoming days.

Assuming that this task is resolved as Outreachy 29 is over.

And one more congratulations to @Afi570! Such amazing work in this internship and now mentoring in the current Outreachy round 😊 Thank you for all of your efforts!