Brief summary
The Scribe community uses Wikimedia based data to create software for language learners. The main user facing application that the community makes is Scribe-iOS – a collection of keyboards for second language learners that can be used in any app to translate words, conjugate verbs and much more! The community is now also working on Scribe-Android and Scribe-Desktop.
The processes by which the Scribe community derives Wikimedia data are found in the project Scribe-Data, which till now have been Wikidata lexicographical data and Wikipedia texts based. Scribe-Data is a Python based command line interface, with usage examples including:
- Getting all nouns, their genders and their plurals for a given language from Wikidata
- Getting all verbs and needed conjugations from Wikidata
- Generating autosuggestions for Scribe keyboards via Wikipedia dumps
Translations are an important functionality of the end-user Scribe applications, but till now the translation functionality has been reliant on Hugging Face based machine translations that are quite time intensive. This Outreachy project will focus on adding translation functionality that's based on Wikimedia project dumps - either Wikidata if the available lexeme data allows or Wiktionary based functionality if not - to Scribe-Data. This data will then be used to add functionality to downstream Scribe applications as well as others making use of Scribe-Data. Specifically this project will add the following commands to the Scribe-Data command line interface:
- An improved version of the translate functionality that will parse Wikimedia project dumps for all translations for any language
- Data outputs should be formatted both for Scribe-Data end users and Scribe-iOS/Android (we'll explain)
- Potentially also similar functionality for deriving synonyms of words
- If using Wikidata dumps, then the results of this process should mirror Scribe-Data's SPARQL query results as closely as possible, and dumps should be offered as an alternative data source for the user for current Scribe-Data functionality
- This allows for experimentation and using Scribe-Data without large requests to the Wikidata Query Service
The above processes will need to have unit tests written for them to make sure that future changes to the code to not cause breaking changes. Efficiency of parsing Wikimedia project dumps or other data sources will also be key to the success of this project. The tasks above are the confirmed goals for this project, with aspirational goals being set by mentors and the mentee once the program starts.
Note: The decision was made to use Wikidata dumps as the data source for the project.
Skills required
- Skills in the Python
- Prior experience working with Wikimedia information would be a plus
- Project tag: affects-scribe-org
Possible mentor(s)
Microtasks
- Issues for Scribe-Data
- We'll be making more issues in the coming weeks to add more languages to the CLI's functionality
- Any issues for other Scribe projects
- Note that working on Scribe-iOS requires coding on macOS so that you'll have access to Xcode
Please look for the good first issue or help wanted tag in all projects! We'll be happy to help you onboard :)
Communication
Please join our community Matrix spaces to chat with the team and learn more about Scribe! We'd suggest using Element as your Matrix client, if you haven't used it before. Specifically we have a room for Scribe-Data and for Mentorship programs. During the program your mentors will be happy to communicate with you on GitHub or via Matrix. You'll also be invited to the Scribe bi-weekly developer calls where you'll have time to present your progress and work with the team on any problems. Calls and checkins outside of the syncs can also happen if needed :)