=== Profile Information
Name: Wenqi Du
Github: [Linfye](https://github.com/Linfye)
Location: China
Time Zone: UTC+08:00
Typical working hours: 14:00 to 22:00 UTC+8:00
=== Synopsis
- Scribe is an open-source community aimed at developing software to assist people in learning languages. Scribe has currently developed a keyboard software on iOS and iPadOS to assist people in language learning while typing.
- Scribe-Data is a program designed to retrieve language data from Wikidata, Wikipedia, HuggingFace, and Unicode for potential application use. However, the current data retrieval process is manual and separate.
- This project aims to create a universal entry point, in the form of a CLI, to allow users easy access to data for various purposes. Additionally, the goal is to make this project easily deployable via Docker for easy environment management..
- **Possible Mentors:**
- Will Yoshida, Andrew McAllister, Henrik Thomasson
- **Have you contacted your mentors already?**
- Yes.
### Deliverables
- **CLI Tool:** a command-line interface tool providing a user-friendly interactive interface for users to input query parameters via the command line to retrieve language data. The entry point of the program will be implemented using the argparse library. When launched with the command `scribe-data`, different commands inputted by the user will be parsed by the CLI, which will then call the corresponding script execution functions within Scribe-Data. For example, to obtain all English verbs and their variations, you can execute the command `query --type verb --language English --format JSON --path [opt]`. This command will first search for relevant files locally. If they are not found or if the user requests the latest data, the script will then query the Wikidata Query Service using SPARQL commands to download the data. Afterward, it will use conversion scripts to transform the data into format requested by community that so far include JSON and TSV. The final project will include many similar commands, such as comparing translations between two languages or updating local data. These aspects will require further exploration and refinement to ensure completeness.
- **Data Integration and Formatting:** These modules should consolidate data from various sources into a unified data structure and format it into appropriate output formats such as text, JSON, XML, TSV, etc. Regarding this, Python libraries offer many methods to convert data formats.
- **Docker Containerization:** a Dockerfile for packaging the CLI tool and all its dependencies into a Docker image.
- **Documentation and Examples:** detailed documentation including installation instructions, usage examples, command-line parameter explanations to help users get started quickly with the CLI tool.
- **Test Suite:** a set of test cases to validate the functionality of the CLI tool under various scenarios and edge cases.
{F43812528}
=== Timeline
| Dates | Tasks |
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| May 1st - May 26th | Reading the documentation for Wikidata and the Wikidata Query Service to learn about the technical details.Diving deeper into the Scribe-Data codebase to understand the current architecture and functionalities of this application. Perhaps I'll work on completing a few small issues to gain more practical experience. Discussing specific requirements and exploring suitable and efficient CLI interaction methods with my mentor. |
| May 27th - Jun 2nd | Reviewing the existing code files and identify common functionalities or patterns among them. This could include data retrieval from different sources, data processing, and formatting. My main task involves reading various code scripts to familiarize myself with the functionalities of different parts of the current project and the code style of Scribe-Data. |
| Jun 3rd - Jun 9th | Organizing the codebase into modular components based on the identified common functionalities. Each component should encapsulate related functionality and have well-defined interfaces. |
| Jun 10th - Jun 16th | Creating a new main file or script that serves as the entry point to the project. This script will handle user inputs, parse. Adding a formatted code snippet that can generate formatted documents according to user requirements. |
| Jun 17th - Jun 23rd | Modifying the existing code files to fit into the modular structure. Extracting common functionalities into separate modules and refactor the code to use these modules. And modifying the code to accept corresponding external parameters. |
| Jun 24th - Jun 30th | Implementing a CLI interface in the main script to allow users to interact with the project. Use libraries like argparse to parse command-line arguments and invoke the appropriate functions. I won't be very active on this project for the next two weeks as I need to prepare for my end-of-semester exams, but I will catch up on the missed work afterward. |
| Jul 1st - Jul 7th | The same tasks as last week, as this could be complex and time-consuming. |
| Jul 8th - Jul 14th | Organizing the current code for review by the supervisor and submit the midterm report. |
| Jul 15th - Jul 21st | Implementing error handling and exception handling mechanisms to gracefully handle unexpected situations or errors during the execution of the program. |
| Jul 22nd - Jul 28th | Testing each module and the unified entry point thoroughly to ensure that they work as expected. Writing unit tests to validate the functionality of individual components and integration tests to verify the interaction between modules. |
| Jul 29th - Aug 4th | Ensuring that the refactored codebase can be easily containerized using Docker. Write a Dockerfile to define the environment and dependencies required to run the project. |
| Aug 5th - Aug 11st | Writing documentation for this project, providing usage examples to help users utilize it. |
| Aug 12nd - Aug 18th | Flextime for handling any previously unfinished tasks. |
| Aug 19th - Aug 26th | I will submit the tested code and complete the final report. |
=== Participation
- I plan to use matrix to contact my mentor and Email is also OK.
- The code will be gradually submitted to the community via pull requests for review by mentors.
=== About Me
- **Education (in progress):**
- I am a third-year undergraduate student majoring in Applied Mathematics at Tianjin University of Technology.
- **How did you hear about this program?**
- I first learned about GSoC through a video shared by a YouTube content creator.
- **Will you have any other time commitments, such as school work, another job, planned vacation, etc, during the duration of the program?**
- No, I don't have any additional plans for my summer break at the moment. I can assure you that I'll complete the workload assigned each week on time.
- **We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?**
- No, I have only applied for GSoC and this project with Wikimedia Foundation.
- **What does making this project happen mean to you?**
- Firstly, I aim to familiarize myself with Python application development and the open-source process through this opportunity.
- Secondly, I'm a language enthusiast who enjoys self-learning languages and linguistic theories in my spare time. This project presents an opportunity for both others and myself to learn foreign languages, which aligns with my interests.
- Thirdly, I'm a contributor to the Chinese Wikipedia and would like to contribute to Wikimedia in other areas as well.
=== Past Experience
- **Please add links to any feature or bug fix you have written for a Wikimedia project during the application phase.**
- [Scribe-data pull 111](https://github.com/scribe-org/Scribe-Data/pull/111)
- [Scribe-data pull 81](https://github.com/scribe-org/Scribe-Data/pull/81)
- **Describe any relevant projects that you've worked on previously and what knowledge you gained from working on them.**
- In my Python fundamentals course, I've completed several major assignments, including web scraping, basic data structures, and algorithms.
- I have previously utilized Google Colab to complete assignments involving deep learning algorithms for image and video processing.
- **Describe any open source projects you have contributed to as a user and contributor (include links).**
- I'm a beginner in open source, with no previous similar experience.
=== Post GSoC
I read the full architecture diagram of Scribe-data, and if possible, I'd like to contribute to completing this project for language lovers.
The Scribe-Data module is responsible for fetching and storing data in the database, while Scribe-Server provides a set of APIs for applications toconveniently retrieve data from the database. So, I will first solve some small issues to gain practical experience and become familiar with Scribe-Server and backend knowledge in Go.