Page MenuHomePhabricator

GSoC '24 Proposal: Refactor Scribe-Data into a multi purpose Wikidata language pack CLI tool
Closed, DeclinedPublic

Description

Profile Information

Name: Jacob Kondo
Github:https://github.com/Jk40git
Country: Ghana
Time zone: (GMT+00:00)UTC
working hours : 9:00-16:00 UTC

Project Size:

  • Large

Synopsis

Currently Scribe-Data, which processes the data from Wikidata and other sources and prepares it for Scribe application to use, is run locally in every new release of the Scribe-iOS version. My main goal for this proposal is to turn this into a command line interface that anyone can use to easily get language data from Wikidata.

Code snippets:

Currently data process

python3 src/scribe_data/extract_transform/wikidata/update_data.py '["French", "German"]' '["nouns", "verbs"]'

Update_data.py is run in every version released

Data process after the project

python3 scribe-data --update --language French --word-type nouns

This project will help:
-To facilitate the extraction of languages or data from Scribe-Data’s sources
-By a simple command get information from sources (Wikidata, Wikipedia, Hugging Face, Unicode, ...)
-To provide a service whereby packs of language data can be extracted from Wikidata via a CLI that is loaded into a person's computer via pip, Conda Forge or by cloning the Git repository.
-Restructuring output to be less primarily Scribe focused to be more general
-In the translation process by filtering out bad translations via benchmarking accuracy metrics
-To create a database from which the applications can make calls to get new data, with us running Scribe-Data on Scribe-Server every two weeks to get new language data from the data sources.

Benefits to the community

By working on this project I will be able to contribute in bringing more languages to Scribe using the abundant Wikidata data stores. The GSoC visibility can help Scribe and Wikidata reach more linguistic communities while further showing the potential of Wikidata as a communal data source. The work I would be doing during GSoC would also benefit Wikidata and broader Wikimedia communities by improving one of the strongest uses of Wikidata in mobile applications.

Have you contacted your mentors already? Yes

Deliverables

General

-Show timely and consistent work throughout
-Document changes and the reasons behind them
-Breakdown the project into smaller goals
-Test the data processes
-Ensure proper commit messages and well commented code to streamline PR reviews
-Communicate GSoC milestones via Phabricator and other available means
-Finish GitHub issues and write up some blog posts

Before Mid-term Evaluation

-Have the other Scribe languages file ready/ database ready
-Converting the way the project is used, adding tests for it, allowing it to be downloaded by Anaconda, using Docker to set up development environments
-Have regular check-ins with the project mentor to update the status

Before Final Evaluation

-Add language data for all the languages discussed with the mentor
-Test out the process by which we get data like nouns, adjectives and verbs for a given language and then format them for a wider use
-Refactor the way we get the data into a CLI

Timeline

Community Bonding:

Getting acquainted with the code base of Scribe-Org and understanding how various repositories work together beyond Scribe-Data. Checking in with mentors and contributors to create a plan for GSoC including MVP, medium and long term goals as well as a work schedule that includes check ins and documentation of my work. Discussing with the project mentor the general skeleton of adding new languages to the keyboard and how we can refactor it to a CLI. Setup the environment to work with Scribe-Data including extensions for code formatting and virtual environment use.

Week 1:

Look into ways of adding in a cutoff for the translations.

Week 2:

Explore the Scribe-Data codebase to gain a holistic understanding. Setup and explore the SQLite databases. Look into ways on how to include documentation for Scribe-Data as CLI.

Week 3-4 :

Understanding Wikidata Query Service and its intricacies. Look into the python scripts in Scribe-Data to understand data organization and processing. Research on the formatting process such that we can maintain a unique identifier for later data updates, where we just want to update the data that needs to be updated, not all of it.

Week 5-6 :

Learn modern data processes, ETL processes (extract, transform, load) and how to work with Wikidata using the Python package SPARQLWrapper.

Week 7-9 :

Document on various python tools to build a CLI and how to implement them in the project.
Write more Wikidata query and formatting processes to expand the reach of the service to non-Scribe languages.

Week 10-11 :

Test the CLI with different languages, developing the test suite for Scribe-Data, Identify and fix bugs if there is a need.

Week 12-13:

Debugging and additional work that might be required.

Extras:

I would like to help later on in the making of the API calls in the development of Scribe-Server.

Participation

Through Zulip chat or Element
My code will be published to scribe-org / Scribe-Data

About Me

I have just completed my Diploma in Software Engineering and am planning to further my education. I speak French and English; I love learning new things and I am a bit inquisitive. In my free time I play guitar and bass.
I learned about the program via the Scribe community, which I joined after the Tech Safari event.
I will probably be doing a part-time job during the program.

Past Experience

I joined Scribe-org mid-February 2024 and my first contribution was to Standardize function doc strings across Scribe-Data which I am supposed to standardize the docstring for functions in Scribe-Data to a consistent style of documentation throughout the codebase. I created the base noun and verb SPARQL queries for Scribe-Data which is added to the Scribe-iOS.

I have been also contributing to wikipedia since 2020, editing and creating articles.

I contributed to OpenStreetMap in mapping some areas in Accra and help in translating the weeklyOSM from English to French.

My training coursework has provided me with skills on various programming languages such as Python, JavaScript, MySQL, React.js and Node.js. During my course I have created a shopping cart application using JavaScript.

Potential Personal Takeaways

Through this project I will be able to gain a more holistic understanding of the organization. This would allow me to work with large amounts of data and learn how to code in other programming languages.
This would be a wonderful opportunity for me to practice and learn different project management skills including working with SPARQLWrapper. I would also gain hands-on experience of working with Wikidata Query Service and other online APIs. This would allow me to interact with abundant data down the line in any personal projects that may follow. It would also ingrain in me good practices that come with working in an accountable environment. Meeting more open-source peers would also bring networking opportunities and a larger learning landscape.

Event Timeline

Jacob4code renamed this task from GSoC '24 Proposal: Create French language to all other Scribe languages translation process. to GSoC '24 Proposal: Create French language to all other Scribe languages translation process.(WIP).Feb 26 2024, 10:33 PM
Jacob4code renamed this task from GSoC '24 Proposal: Create French language to all other Scribe languages translation process.(WIP) to GSoC '24 Proposal: Refactor into a multi purpose Wikidata language pack CLI tool.Mar 29 2024, 6:31 PM
Jacob4code updated the task description. (Show Details)
AndrewTavis renamed this task from GSoC '24 Proposal: Refactor into a multi purpose Wikidata language pack CLI tool to GSoC '24 Proposal: Refactor Scribe-Data into a multi purpose Wikidata language pack CLI tool.Mar 29 2024, 7:57 PM
AndrewTavis closed this task as Declined.EditedJun 9 2024, 2:34 AM

Cleaning up the affects-scribe-org board now. Thanks for your application, @Jacob4code! Wishing you all the best :)