Profile:
Name: Sandeep Subramanian
Email: sandsub95@berkeley.edu
Location: Berkeley, CA
Working Hours: 8-1; 4-7 (PDT)
About Me/Motivations:
I am a second year undergraduate student intending to pursue chemistry and computer science at the University of California, Berkeley (UCB).
I have been a fanatic of geography and languages since childhood. I have strongly believed that anyone should be able to advance their knowledge independent of language, and the exchange of information over the Internet is an extension of this principle. I myself can read and write over a dozen scripts and have won several national geography competitions.
By chance, I was exposed to programming during my second semester of college and absolutely loved it. I am now incredibly impassioned to use my newfound love for programming to remove language barriers from the web and make knowledge available to all.
I can now program comfortably in Python, Java, HTML/CSS, MATLAB, C/C++, Javascript, and basic Django. I have only started programming independently for the past couple of months, and I have focused my efforts on designing dynamic websites. In Fall 2014, as a scripts fanatic, I joined the Script Encoding Initiative (SEI) at UC Berkeley and worked with several other internationalization professionals at the 38th Unicode & Internationalization Conference in November 2014. I also work part-time in the Computational Research Division of Lawrence Berkeley National Laboratory (LBNL) as a web developer and high-throughput programmer for materials informatics research. Thus far, I am in the process of designing website for SEI and my LBNL group, both of which have not given me permission to publish my designs yet, but they are supposedly impressive. I will update this proposal as necessary.
Wikipedia is my home that doesn’t require a physical roof. I love contributing to Wikipedia, especially on pages concerning languages, geography, and Indian music. I have recently gotten into pywikibot, and I am working on building bots that sync airport destination pages and their maps.
Wikipedia is beyond doubt my constant source of intellectual engagement, and nothing would be more motivating and interesting to me than to develop tools that will allow people around the world to interact with Wikipedia like I do. I’ve been looking for an opportunity to bring my internationalization & localization ideas to life, and I think Wikimedia’s GSoC program presents the perfect opportunity for me to express my interests and meaningfully impact my world.
If we want everyone to use Wikipedia, we need to make it usable for everyone. As a language fanatic, I know from first-hand experience that people who can contribute a lot to our digital knowledge bank are unable to do so because of unsupported scripts and locales, and as someone who wants all human knowledge readily available at his fingertips, this frustrates me. And that’s why I want to make this the most awesomest wiki enhancement ever -- so that I may eventually be able to master all of human knowledge at just one click away. As such, I am interested in getting more involved with and contributing to Wikimedia’s internationalization projects, and I see this as a great leap into that goal. Making this project happen means many more people can contribute to Wikipedia in a way they like, which makes me happy and inspired to do more.
Objectives:
A wide variety of internationalization projects are required to universalize access to digital information. I have identified five different types of localizations, as specified below. I will try to implement one of each, in increasing complexity, so that a framework exists for (hopefully) quicker implementations of other future internationalization tasks that may fall under each category.
1) Font Variation: Farsi/Urdu: Naskh-Nastaliq
Viewers should have the choice of viewing material in the style/font of their choice.
2) Script Transliteration: Malay/Indonesian: Rumi-Jawi
The same article should be available in multiple scripts used for the same language.
3) Simple Version Control: Punjabi: Gurmukhi-Shahmukhi
For languages which use multiple scripts but already have differentiated content in each script, since the content cannot be modified, I can simply give users the option to view the original article in the script of their choice and render the associated article. The advantage of this unified page is that users who can read both scripts and understand finer nuances in dialects can easily translate articles to other scripts. Also, this opens an easier pathway for dialect-friendly machine transliteration, which can be explored if time permits.
4) Dialectal Variation: English: American-British-Australian-Indian
For multinational languages, multiple user communities can exist with different spelling & numerical conventions. For example, British spellings vary from American ones, Indian English far more often makes use of crores and lakhs in counting large numbers instead of millions and billions, and having both conventions present on the same webpage prevents necessary localization to cater to the familiarities of users around the world. I plan to have localized versions of each webpage that account for conventions agreed upon by localized user communities.
5) Form Variation: Arabic: Diacritics-No Diacritics
Allowing the user to toggle the display of vowel diacritics can help the user to easily identify/pronounce a word that is not easily recognizable and can help for looking up the word in a dictionary. This can also help in displaying diacritics for Arabic-based scripts in which diacritics are obligatory.
And if time permits:
6) Simple Version Control: German: German-Alemannic-Luxembourgish-Plattdeitsch-etc.
7) Dialectal Translation: Chinese: Mandarin-Yue-Hakka-Minnan-Wu-Classical
Participation:
I can publish code on my Github, sandsub95. I will ask for help from members of the Wikimedia GSoC community, including mentors or students who can answer a question that I cannot find on Stack Overflow or elsewhere. I will communicate weekly at a minimum with my mentor to ensure satisfactory progress.
Timeline (11 Weeks):
Weeks 1-2
Goals: Learning PHP, Find Naskh/Nastaliq Fonts for Urdu/Farsi Across Platforms, Implement System to Change Font on Page from Menu
Skills: PHP, CSS, Javascript
Deliverables: Naskh-Nastaliq Rendering Activated in Farsi/Urdu Wikipedias; Free, Open-Source Font Choice for other Language Wikipedias
Week 3
Goals: Implement Transliteration between Rumi and Jawi; Implement for Javanese and Rumi in Basa Jawa Wikipedia if Time
Skills: PHP, CSS, Javascript
Deliverables: Rumi & Jawi Options Activated in Bahasa Melayu and Bahasa Indonesia Wikipedias from Drop-Down Menu
Week 4
Goals: Collect Gurmukhi & Shahmukhi Articles, with Mappings between them, and combine into version control system.
Skills: PHP, CSS, Javascript
Deliverables: New Punjabi Wikipedia with Drop Down Menu for Shahmukhi & Gurmukhi
Weeks 5-6
Goals: Develop tool that will allow users to specify required localizations for different English Wikipedia user communities; when tools approved, implement changes.
Skills: PHP, CSS, Javascript
Deliverables: Organized Localization Request System & Community Page for Reviewing Submissions
Week 7-10
Goals: Develop tool that can predict Arabic diacritics based on words: Collect database of Arabic words used on Wikipedia, map to possible words that could be represented with diacritics, choose contextually based on part of speech, correlation tags, etc.; Also set up version control using previous formats to allow for user contributions.
Skills: PHP, CSS, Javascript
Deliverables: Tool with Arabic diacritics for Arabic articles (unlikely to be finished; this is a project of substance)
Week 11
User Feedback & Bug Fixes & Slip Week