# Profile Information
Name: Shashank Mittal
Email: [shashank.mittal.mec22@itbhu.ac.in](mailto:shashank.mittal.mec22@itbhu.ac.in)
Education: Indian Institute of Technology (BHU), Varanasi
Github: [shashank-iitbhu](https://github.com/shashank-iitbhu)
LinkedIn: [Shashank Mittal](https://www.linkedin.com/in/shashankmittal27/)
Phabricator: [Shashankmittaliitbhu](https://phabricator.wikimedia.org/p/Shashankmittaliitbhu/)
Other communication Channels: [Zulip](https://wikimedia.zulipchat.com/#user/698100), [Alternate email](mailto:latashashank.3@gmail.com)
Location: Varanasi, India
Typical Working Hours: 9am to 10pm (IST) UTC+5:30
# Synopsis
The goal of this project is to refactor the existing Scribe-Data scripts into a multi-purpose Wikidata language pack CLI tool. This tool will allow users to easily download, extract, process, and update language data from Wikipedia, WikiData and Unicode sources.
Mentors:
- Primary: Will Yoshida (@wkyoshida)
- Secondary: Andrew McAllister (@AndrewTavis)
- Tertiary: Henrik Thomasson (@Henrikt93)
> Have you contacted your mentors already?
Yes
# Deliverables
#### Generalize Scribe-Data
Refactor the existing Scribe-Data process to be more structured for the CLI tool and usable for a wider range of applications beyond Scribe's keyboard apps.
#### CLI Tool Development
Develop a CLI tool that allows users to fetch, process, and format language data from Wikidata with ease.
Set up the CLI framework with the specified commands and flags for language pack generation.
#### Expand Language Coverage
Increase the number of languages and word types supported by the tool, making it more inclusive and versatile.
#### Language Pack Generation
Implement functionality to generate language packs with words, translations, supporting export formats like JSON, TSV, XML, ZIP and SQLite.
#### PyMultiDictionary Integration
Utilize PyMultiDictionary to enhance language packs with comprehensive translations and synonyms from multiple online dictionaries. ([Documentation Link](https://pypi.org/project/PyMultiDictionary/))
NOTE: (Alternate solution to get synonyms and antonyms, if not using Wiktionary)
#### Testing and Documentation
Develop a comprehensive test suite for the CLI tool and provide detailed documentation for users and contributors.
# Implementation Strategy
####Modularise and modify Existing files:
- Modify update_data.py to be more accessible via CLI tool. Currently, passing multiple arguments via CLI is a little bit tricky.
Here’s a quick look how the **scribe-data-cli** will work:
```
(scribedev) Scribe-Data % scribe-data-cli update-data --languages "French" --word_types "nouns"
Data updated: 0%| | 0/1 [00:00<?, ?dirs/s]Querying and formatting French nouns
Wrote file nouns.json with 17,442 nouns.
Data updated: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.24s/dirs]
```
- Same modifications are proposed for files like `process_wiki.py`, `extract_wiki.py`, `process_unicode.py`, `data_to_sqlite.py` and `send_dbs_to_scribe.py`.
- Modularise `gen_autosuggestions.ipynb` and `gen_emoji_lexicon.ipynb` .
- Modify translation scripts to accept arguments via the CLI.
####WDQS queries and formatting scripts:
- Expand SPARQL queries to include prepositions for languages where they are not yet implemented.
- Note: German and Russian prepositions have been implemented, considering their grammatical cases' dependency on prepositions.
- Develop formatting scripts for new languages and data types, such as adjectives and prepositions.
####Support extraction of new word data types from Wiktionary:
- Relevant Issues [#18](https://github.com/scribe-org/Scribe-Data/issues/18) (Add synonym data to Wikidata) and [Scribe-iOS #20](https://github.com/scribe-org/Scribe-iOS/issues/20) (Add Synonym Command)
- Synonyms of many words can be found on Wiktionary, but this Wikimedia service is not set up for optimal machine reading. It would be best if this information was ported to Wikidata so that it could be simply queried.
- Synonym data would be another thing that would be great for Scribe-Data to include into language packs in the future. With a focus on Wikidata, at this time it seems like this would be something we need to wait on as the transition of this information from Wiktionary instances and other sources is as of now not far enough along to have this information be a priority.
####PyMultiDictionary Implementation:
- PyMultiDictionary is a dictionary module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different languages. It uses educalingo.com, synonym.com, and WordNet for getting meanings, translations, synonyms, and antonyms. ([Documentation Link](https://pypi.org/project/PyMultiDictionary/))
- This implementation is subject to availability of free requests. This package uses Wordnet for meanings which as far as I know is free.
- 7 out of 8 scribe languages (currently implemented in the Scribe-Data project) are present in these 20 languages except Swedish.
- PyMultiDictionary is heavily tested for English words, even if we are unable to get meaning, antonym or synonym for a particular word of a language, we can pass the english translation of that word.
####**Why a CLI tool?**
##### Impact on Scribe Data
The CLI will streamline the process of fetching, processing, and updating language data, making Scribe-Data project more efficient and accessible.
##### Impact on Scribe-Server
The CLI will help in seamless updates and management of language packs, ensuring that Scribe-Server always serves the most up-to-date data to Scribe applications.
##### Impact on The Wikimedia Foundation
This CLI tool will make it easier for other projects within the Wikimedia Foundation to access and utilize this language data.
##### Impact on organizations outside the Scribe and Wikimedia ecosystems
This tool can be used for educational purposes, language research, or any project that requires access to structured language data.
####Set Up CLI Framework:
- Use **`argparse`** [(Link to the Documentation)](https://docs.python.org/3/library/argparse.html) to set up the CLI framework with the specified commands and flags.
- Why `argparse` ?
- There are several options for parsing command-line arguments in Python, such as `click` and `typer`.
- `argparse` is a powerful module in Python for parsing command-line arguments. It allows the creation of user-friendly command-line interfaces and provides a simple way to specify the type of arguments, default values, and help messages.
- It's part of the Python Standard Library, which means it doesn't require any additional installation and is readily available for use.
- Since we are already using `argparse` for `checkquery.py`, I thought that it would be great to standardize this choice throughout the project.
- Proposed directory structure for the CLI tool:
```
scribe_data/
├── src/
│ ├── scribe_data/
│ │ ├── __init__.py
│ │ ├── cli/
│ │ │ ├── __init__.py
│ │ │ ├── commands.py # Main command-line interface logic.
│ │ │ ├── cli_utils.py # Helper functions and utilities specific to CLI.
│ │ │ ├── language_pack.py # Generating and managing language packs.
│ │ │ ├── data_fetch.py # Fetching data from WikiData and other sources.
│ │ │ └── data_process.py # Processing and cleaning the fetched data.
```
- Configure `setup.py` for CLI and define the entry-points for the CLI.
- Running `pip install .` from the project root builds the package along with the CLI.
- Creating Homebrew formula for the CLI. This approach provides a convenient way for users to install and manage the CLI tool, ensuring that they always have the latest version and can easily update it when new releases are available.
- It would be cool to give multiple options to users to install and manage the CLI tool.
####Implement flags and Command Handlers:
{F44052603, layout=center, float, size=full, alt="CLI flowchart"}
##### Language Data Based on Data Type and languages
- `--lang`: Specify the language to fetch or process.
- `--data-type`: Specify the type of data to fetch or process (e.g., nouns, verbs, prepositions).
- Not specifying any data type would run the process for all the data types.
##### Language Pack Generation
- `--generate-pack`: Generate complete language packs.
- `--language`: Specify the language for which to generate the language pack (default: all languages).
- `--output-format`: Specify the output format for the language pack (e.g., json, zip, xml, tsv and SQLite).
- `--include-translations`: Include translations
- `--include-data-types`: Include specific data types.
- `--include-synonyms`: Include synonyms for words in the language pack using Wiktionary.
- `--include-antonyms`: Include antonyms for words in the language pack using Wiktionary.
##### Translation
- `--translate`: Enable translation functionality.
- `--source-lang`: Specify the source language for translation.
- `--target-lang`: Specify the target language for translation (default: all languages).
- `--batch-size`: Specify batch size for batch processing of words.
##### Updating Data from Wikidata
- `--update`: Update the language data by running WDQS queries and formatting scripts.
- `--update-langs`: Specify one or more languages for which to update the data (default: all languages). Multiple languages can be specified as a comma-separated list (e.g., `--update-langs en,fr,de`).
- `--update-data-types`: Specify one or more data types to update (e.g., nouns, verbs, prepositions). Multiple data types can be specified as a comma-separated list (e.g., `--update-types nouns,verbs`). If not specified, all data types will be updated.
##### Extracting and Processing Language Data from Wikipedia
- `--extract`: Extract language data from Wikipedia dumps.
- `--process`: Process the extracted language data for use in Scribe applications.
- `--wiki-lang`: Specify the language of Wikipedia to extract data from (default: all languages).
##### Output Formats
- `--output-format`: Specify the output format for the data (e.g., JSON, TSV, XML, ZIP and SQLite).
##### Database Integration (ideation)
- `--store-db`: Store the output data directly into an SQL database.
- `--db-connection`: Specify the database connection string or parameters.
- `--db-table`: Specify the database table to store the data.
##### Output Location
- `--output-path`: Specify the output location for the generated files or language packs.
##### Miscellaneous
- `--verbose`: Enable verbose output for debugging purposes.
- `--help`: Display help information for the CLI tool and provide detailed descriptions of all available commands, flags, and their usage.
##### General
- `--config`: Specify a configuration file that contains default settings or parameters for the CLI tool.
##### More flags …
- So I recently came across this tool [WikiBaseJS-Cli](https://www.wikidata.org/wiki/Wikidata:Tools/WikibaseJS-cli) , it is a command-line interface to Wikidata or any other Wikibase instance.
- It's basically an alternative interface to the web browser interface, or in scripts to run many edits.
#### Testing Strategy
##### Unit Tests
- Test individual Command handler functions or components in isolation for the CLI tool.
- Write unit tests for each function in your CLI tool using a testing framework like pytest-cov or unittest in Python.
##### Integration Tests
- Test the integration of different flags and their combinations.
- Write integration tests that simulate real-world scenarios, such as fetching data from Wikidata, processing it, and generating a language pack. These tests should cover the end-to-end workflow of your CLI tool.
- For example, Test the `--generate-pack` command with its different `--language` and `--output-format` flags to ensure that the language pack is generated correctly and saved in the specified format.
#### Documentation
- Document the usage of the CLI tool, including all commands and flags, in a user-friendly manner.
# Timeline
| **Period** | **Milestones** | **Task Description** |<table>
|------------------------|-------------------------------|----------------| <tr>
| May 1 to May 26 | Community bonding period | - Join community and connect with Mentors. - Introduction and connecting with other contributors. - Start working on taking feedback for the proposal and Start working on Suggestions. | <th>Period</th>
| May 27 to May 31 | | - Get started with the refactoring and modularisation of Scribe-Data. | <th>Milestones</th>
| May 31 to June 2 | | - Set up CLI framework and repository. | <th>Task Description</th>
| June 2 to June 16 | | - Configure `setup.py` for the CLI and create a sub command for the update process. - Write new WDQS queries for new data types. | </tr>
| June 16 to June 18 | | - Complete formatting process for newly introduced data types. | <tr>
| June 18 to June 20 | | - Complete CLI implementation for Updating Data from Wikidata | <td>May 1 to May 26</td>
| June 20 to June 28 | | - Complete CLI implementation for Extracting and Processing Language Data from Wikipedia | <td>Community bonding period</td>
| June 28 to July 8 | | - Structure the command handlers for language pack generation. | <td>
| July 8 to July 12 | Mid Term Evaluations | - Code Cleanup, Testing and Progress Report/Blog | - Join community and connect with Mentors.
| July 13 to July 20 | | - Complete CLI implementation for translation process. - Complete the CLI implementation for Language Pack generation for `--include-***` flags. | - Introduction and connecting with other contributors.
| July 20 to August 1 | | - Complete rest of the command handlers. - Command handlers for the rest of the flags for Output format, output location and miscellaneous. | - Start working on taking feedback for the proposal and Start working on Suggestions.
| August 1 to August 4 | | - Create a HomeBrew Formula for the CLI and test it. - (Scribe-Server) Define and conceptualize the SQL queries and end points for the server database. | </td>
| August 4 to August 9 | | - (Scribe-Server) Define end points for fetching and adding data to the database. - (Scribe-Server) Improve the database schema to efficiently store language pack data. | </tr>
| August 9 to August 15 | | - (Scribe-Server) Get started with the API endpoints. - Unit testing and Integration testing | <tr>
| August 15 to August 20 | | - Documentation | <td>May 27 to May 31</td>
| August 19 to August 26 | | - Final week: Submit final work product and their final mentor evaluation | <td></td>
| August 26 to September 2 | Final Evaluations | - Mentors submit final GSoC contributor evaluations |
<td>
- Get started with the refactoring and modularisation of Scribe-Data.
</td>
</tr>
<tr>
<td>May 31 to June 2</td>
<td></td>
<td>
- Set up CLI framework and repository.
</td>
</tr>
<tr>
<td>June 2 to June 16</td>
<td></td>
<td>
- Configure `setup.py` for the CLI and create a sub command for the update process.
- Write new WDQS queries for new data types.
</td>
</tr>
<tr>
<td>June 16 to June 18</td>
<td></td>
<td>
- Complete formatting process for newly introduced data types.
</td>
</tr>
<tr>
<td>June 18 to June 20</td>
<td></td>
<td>
- Complete CLI implementation for Updating Data from Wikidata.
</td>
</tr>
<tr>
<td>June 20 to June 28</td>
<td></td>
<td>
- Complete CLI implementation for Extracting and Processing Language Data from WikiData and Wikipedia.
</td>
</tr>
<tr>
<td>June 28 to July 8</td>
<td></td>
<td>
- Structure the command handlers for language pack generation.
</td>
</tr>
<tr>
<td>July 8 to July 12</td>
<td>Mid Term Evaluations</td>
<td>
- Code Cleanup, Testing and Progress Report/Blog.
</td>
</tr>
<tr>
<td>July 13 to July 20</td>
<td></td>
<td>
- Complete CLI implementation for translation process.
- Complete the CLI implementation for Language Pack generation for `--include-***` flags.
</td>
</tr>
<tr>
<td>July 20 to August 1</td>
<td></td>
<td>
- Complete rest of the command handlers.
- Command handlers for the rest of the flags for Output format, output location and miscellaneous.
</td>
</tr>
<tr>
<td>August 1 to August 4</td>
<td></td>
<td>
- Create a HomeBrew Formula for the CLI and test it.
- (Scribe-Server) Define and conceptualize the SQL queries and end points for the server database.
</td>
</tr>
<tr>
<td>August 4 to August 9</td>
<td></td>
<td>
- (Scribe-Server) Define end points for fetching and adding data to the database.
- (Scribe-Server) Improve the database schema to efficiently store language pack data.
</td>
</tr>
<tr>
<td>August 9 to August 15</td>
<td></td>
<td>
- (Scribe-Server) Get started with the API endpoints.
- Unit testing and Integration testing
</td>
</tr>
<tr>
<td>August 15 to August 20</td>
<td></td>
<td>
- Documentation
</td>
</tr>
<tr>
<td>August 19 to August 26</td>
<td></td>
<td>
- Final week: Submit final work product and their final mentor evaluation.
</td>
</tr>
<tr>
<td>August 26 to September 2</td>
<td>Final Evaluations</td>
<td>
- Mentors submit final GSoC contributor evaluations.
</td>
</tr>
</table>
# Participation
- I plan to communicate through GitHub and the Scribe team’s matrix channels.
- I plan to attend the Scribe team's bi-weekly developer synchronization meetings. Additionally, I am available for separate calls focusing on updates and discussions related to my GSoC project.
- I plan to report the successful completion of tasks through detailed blogs on either Hashnode or Medium.
- I am available online during my working hours and I am always available on my Email.
- I am also available during my working hours for Online meetings through Element Call or Google Meet or Other Mediums if required.
# Previous Contributions to Scribe-Data
I started Contributing to Scribe-Data in January 2024 and have made several pull requests since then.
**Pull Requests Created By Me:**
| **S.No.** | **PR no.** | **Issue No.** | **Description** | **Status** |
| --- | --- | --- | --- | --- |
| 1. | [PR #60](https://github.com/scribe-org/Scribe-Data/pull/60) | [#55](https://github.com/scribe-org/Scribe-Data/issues/55) | Refactor ISO code usage using Python langcodes | Merged |
| --- | --- | --- | --- | --- |
| 2. | [PR #83](https://github.com/scribe-org/Scribe-Data/pull/83) | [#80](https://github.com/scribe-org/Scribe-Data/issues/80) | Add loading and exporting functions to utils | Merged |
| --- | --- | --- | --- | --- |
| 3. | [PR #89](https://github.com/scribe-org/Scribe-Data/pull/89) | [#77](https://github.com/scribe-org/Scribe-Data/issues/77) | Setup Translation process and Translate words from Russian to other languages. | Merged |
| --- | --- | --- | --- | --- |
| 4. | [PR #99](https://github.com/scribe-org/Scribe-Data/pull/99) | [#57](https://github.com/scribe-org/Scribe-Data/issues/57) | Fix for sphinx autodoc | Merged |
| --- | --- | --- | --- | --- |
| 5. | [PR #123](https://github.com/scribe-org/Scribe-Data/pull/123) | [#122](https://github.com/scribe-org/Scribe-Data/issues/122) | Avoid modifying sys.path | Merged |
| --- | --- | --- | --- | --- |
**Issues Created By Me:**
| **S.no.** | **Issue No.** | **Description** | **Status** | **Type** |
| --- | --- | --- | --- | --- |
| 1. | [#122](https://github.com/scribe-org/Scribe-Data/issues/122) | Simplify Module Imports by Avoiding Modification of sys.path | Closed | Bug |
| --- | --- | --- | --- | --- |
| 2. | [#124](https://github.com/scribe-org/Scribe-Data/issues/124) | Query timeout limit reached while updating German Nouns | Open | Bug |
| --- | --- | --- | --- | --- |
**Other involvements in Scribe-Data:**
| S.no. | Issue No. | Description |
| --- | --- | --- |
| 1. | [#96](https://github.com/scribe-org/Scribe-Data/issues/96) | Remove articles from the machine translation process. |
| --- | --- | --- |
| 2. | [#68](https://github.com/scribe-org/Scribe-Data/issues/68) | Reproduced this issue for fix of Portuguese Verb process |
| --- | --- | --- |
| 3. | [#61](https://github.com/scribe-org/Scribe-Data/issues/61) | Assisted in finding root cause for MacOS build failure |
| --- | --- | --- |
| 4. | [#80](https://github.com/scribe-org/Scribe-Data/issues/80) | Helping new contributors. |
| --- | --- | --- |
# About Me
I am Shashank Mittal, a Sophomore at the Indian Institute of Technology (BHU), Varanasi, pursuing B.Tech in Mechanical Engineering. I am passionate about software development and love to break software.
> How did you hear about this program?
I got to know about this program through Google developer Student Club (GDSC).
>We advise all candidates eligible for Google Summer of Code and Outreachy to apply for both programs. Are you planning to apply to both programs and, if so, with what organization(s)?
I am only applying to Google Summer of Code.
# Past Experience
- Developed a web extension for detecting dark patterns on e-commerce websites for the Dark Patterns Buster Hackathon 2023 organized by the Government of India. [Link](https://github.com/shashank-iitbhu/Titans_dpbh23). This project demonstrates my proficiency in Python and machine learning models such as XLNet and RoBERTa.
- Constructed a REST API for the LFX Mentorship metrics website using Go. This microservice parses mentorship-related data into PostgreSQL database, generates statistics, and exposes them through a REST API. [Link](https://github.com/EshaanAgg/LFXMM-Backend).
# Relevant Skills
- Python
- Go and Rust
- Git and GitHub
- REST APIs
- SQLITE and SPARQL queries.
- Understanding of WikiData
- HTML, CSS, JS and MERN stack.
- CI/CD pipeline
- Docker and Kubernetes
# Volunteer Experience
Core Team Member of Club of Programmers (Software Development Group), IIT (BHU) Varanasi
# Availability
>Are you eligible for Google Summer of Code?
Yes.
>Do you plan to submit any other proposal apart from this one?
No, I plan to submit only this proposal.
>Do you have any other plans during the contribution period?
No, I do not have any other plans during the contribution period.
>How many hours per week can you dedicate to this?
I will work for 40 hours per week.
>Have you been accepted to GSoC before?
No. I am applying to the GSoC program for the first time.
# Post GSOC
I like this project personally and I love how active and enthusiastic the whole community of contributors is in Scribe-Data. This community is something that I want to be part of for a long time and keep contributing to and improving this project even after GSOC and I'll be more than happy to be a part of major changes in this project. I have tons of ideas for this project and Scribe-Server, would love to discuss and implement those over a long period.