Page MenuHomePhabricator

Linguistic Version Control for Polygraphia
Closed, DeclinedPublic



Name: Sandeep Subramanian
Location: Berkeley, CA
Working Hours: 8-1; 4-7 (PDT)

About Me/Motivations:

I am a second year undergraduate student intending to pursue chemistry and computer science at the University of California, Berkeley (UCB).

I have been a fanatic of geography and languages since childhood. I have strongly believed that anyone should be able to advance their knowledge independent of language, and the exchange of information over the Internet is an extension of this principle. I myself can read and write over a dozen scripts and have won several national geography competitions.

By chance, I was exposed to programming during my second semester of college and absolutely loved it. I am now incredibly impassioned to use my newfound love for programming to remove language barriers from the web and make knowledge available to all.

I can now program comfortably in Python, Java, HTML/CSS, MATLAB, C/C++, Javascript, and basic Django. I have only started programming independently for the past couple of months, and I have focused my efforts on designing dynamic websites. In Fall 2014, as a scripts fanatic, I joined the Script Encoding Initiative (SEI) at UC Berkeley and worked with several other internationalization professionals at the 38th Unicode & Internationalization Conference in November 2014. I also work part-time in the Computational Research Division of Lawrence Berkeley National Laboratory (LBNL) as a web developer and high-throughput programmer for materials informatics research. Thus far, I am in the process of designing website for SEI and my LBNL group, both of which have not given me permission to publish my designs yet, but they are supposedly impressive. I will update this proposal as necessary.

Wikipedia is my home that doesn’t require a physical roof. I love contributing to Wikipedia, especially on pages concerning languages, geography, and Indian music. I have recently gotten into pywikibot, and I am working on building bots that sync airport destination pages and their maps.

Wikipedia is beyond doubt my constant source of intellectual engagement, and nothing would be more motivating and interesting to me than to develop tools that will allow people around the world to interact with Wikipedia like I do. I’ve been looking for an opportunity to bring my internationalization & localization ideas to life, and I think Wikimedia’s GSoC program presents the perfect opportunity for me to express my interests and meaningfully impact my world. I would like to work with any of the mentors involved in internationalization, such as Alolita Sharma, Amir Aharoni, Santosh Thottingal, and others.

If we want everyone to use Wikipedia, we need to make it usable for everyone. As a language fanatic, I know from first-hand experience that people who can contribute a lot to our digital knowledge bank are unable to do so because of unsupported scripts and locales, and as someone who wants all human knowledge readily available at his fingertips, this frustrates me. And that’s why I want to make this the most awesomest wiki enhancement ever -- so that I may eventually be able to master all of human knowledge at just one click away. As such, I am interested in getting more involved with and contributing to Wikimedia’s internationalization projects, and I see this as a great leap into that goal. Making this project happen means many more people can contribute to Wikipedia in a way they like, which makes me happy and inspired to do more.


A wide variety of internationalization projects are required to universalize access to digital information. I have identified five different types of localizations, as specified below. I will try to implement one of each, in increasing complexity, so that a framework exists for (hopefully) quicker implementations of other future internationalization tasks that may fall under each category.

1) Font Variation: Farsi/Urdu: Naskh-Nastaliq
Viewers should have the choice of viewing material in the style/font of their choice.
2) Script Transliteration: Malay/Indonesian: Rumi-Jawi
The same article should be available in multiple scripts used for the same language.
3) Simple Version Control: Punjabi: Gurmukhi-Shahmukhi
For languages which use multiple scripts but already have differentiated content in each script, since the content cannot be modified, I can simply give users the option to view the original article in the script of their choice and render the associated article. The advantage of this unified page is that users who can read both scripts and understand finer nuances in dialects can easily translate articles to other scripts. Also, this opens an easier pathway for dialect-friendly machine transliteration, which can be explored if time permits.
4) Dialectal Variation: English: American-British-Australian-Indian
For multinational languages, multiple user communities can exist with different spelling & numerical conventions. For example, British spellings vary from American ones, Indian English far more often makes use of crores and lakhs in counting large numbers instead of millions and billions, and having both conventions present on the same webpage prevents necessary localization to cater to the familiarities of users around the world. I plan to have localized versions of each webpage that account for conventions agreed upon by localized user communities.
5) Form Variation: Arabic: Diacritics-No Diacritics
Allowing the user to toggle the display of vowel diacritics can help the user to easily identify/pronounce a word that is not easily recognizable and can help for looking up the word in a dictionary. This can also help in displaying diacritics for Arabic-based scripts in which diacritics are obligatory.

And if time permits:

  1. Simple Version Control: German: German-Alemannic-Luxembourgish-Plattdeitsch-etc.
  2. Dialectal Translation: Chinese: Mandarin-Yue-Hakka-Minnan-Wu-Classical

I can publish code on my Github, sandsub95. I will ask for help from members of the Wikimedia GSoC community, including mentors or students who can answer a question that I cannot find on Stack Overflow or elsewhere. I will communicate weekly at a minimum with my mentor to ensure satisfactory progress.

Timeline (11 Weeks):
Weeks 1-2
Goals: Learning PHP, Find Naskh/Nastaliq Fonts for Urdu/Farsi Across Platforms, Implement System to Change Font on Page from Menu
Skills: PHP, CSS, Javascript
Deliverables: Naskh-Nastaliq Rendering Activated in Farsi/Urdu Wikipedias; Free, Open-Source Font Choice for other Language Wikipedias

Week 3
Goals: Implement Transliteration between Rumi and Jawi; Implement for Javanese and Rumi in Basa Jawa Wikipedia if Time
Skills: PHP, CSS, Javascript
Deliverables: Rumi & Jawi Options Activated in Bahasa Melayu and Bahasa Indonesia Wikipedias from Drop-Down Menu

Week 4
Goals: Collect Gurmukhi & Shahmukhi Articles, with Mappings between them, and combine into version control system.
Skills: PHP, CSS, Javascript
Deliverables: New Punjabi Wikipedia with Drop Down Menu for Shahmukhi & Gurmukhi

Weeks 5-6
Goals: Develop tool that will allow users to specify required localizations for different English Wikipedia user communities; when tools approved, implement changes.
Skills: PHP, CSS, Javascript
Deliverables: Organized Localization Request System & Community Page for Reviewing Submissions

Weeks 7-10
Goals: Develop tool that can predict Arabic diacritics based on words: Collect database of Arabic words used on Wikipedia, map to possible words that could be represented with diacritics, choose contextually based on part of speech, correlation tags, etc.; Also set up version control using previous formats to allow for user contributions.
Skills: PHP, CSS, Javascript
Deliverables: Tool with Arabic diacritics for Arabic articles (unlikely to be finished; this is a project of substance)

Week 11
User Feedback & Bug Fixes & Slip Week

Event Timeline

sandsub95 claimed this task.
sandsub95 raised the priority of this task from to Needs Triage.
sandsub95 updated the task description. (Show Details)
sandsub95 added a subscriber: sandsub95.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2015, 12:36 PM
sandsub95 updated the task description. (Show Details)Mar 27 2015, 12:51 PM
sandsub95 set Security to None.

Hi @sandsub95,
I think your proposal doesn't match any of the possible GSoC projects we have here:

We encourage new ideas, but I'm afraid it's quite late to find mentors for this project. Not to mention getting community approval for it. Sorry. :(

I hope you'll apply earlier next year so we get ample time to discuss it and find you mentors.

Thanks for applying!

Hi Niharika,

Oh, that's a shame. As it turns out; I only found about this opportunity a couple of days ago, so I decided to give it a shot anyway :P We'll see about next year; thanks for the suggestion! Off hand, do you have any suggestions on how my proposal could be improved or whether it could serve as a feasible project in the future? Please let me know if you get the chance. Thanks!

The feature you propose already exists and is called language converter. See and

Fonts issues are unrelated and dealt with by ; regional variants are generally handled by editors.

If you want to add a new converter, I suggest to start by fixing bugs and reviewing patches for the existing ones: see MediaWiki-Language-converter for a list of suggestions. Good knowledge of the chosen language, or ample access to a native speaker, is required.

You are still on time to improve your proposal, but it needs to be scoped much more and you'll need to work hard to catch up with the coding microtasks (language converter is not an easy area of our code base).

Is your airport destination pages pywikibot code open source? Is the bot deployed; what is its username?

If you have a github account, it would be useful to link to that so we can see some of your code.

@Bennylin might have some feedback regarding Basa Jawa Wikipedia. See , and

Thanks jay, for the ping. My reading of the original post seems to me more about the Jawi script (a.k.a. Malay Arabic), rather than Jawa (or Javanese) script, which is derived from Sanskrit. As far as I know, Malay projects (-pedia, -tionary) tries to incorporate Jawi script, and I heard that (Minangkabau) also been thinking about it. Javanese script, on the other hand, is incorporated in Javanese (jv) projects, and it is our plan to someday have a converter like that of Kyrgyz or Uyghur wikipedias. Help on that aspect would really be appreciated.

But, go ahead, sandsub, look at those links and ping me on my meta talk page if you're interested, and we could have a long discussion if you wish.

Hi Mr. Vandenburg,
Unfortunately, I have not deployed it yet. Still writing and testing. Since I am a student, I will likely finish it in June when I have more time. I'm glad you're interested though, and I'll share my Github account when I have it pushed.

Hi Mr. Benny Lin,
I would be happy to work on this! I was just giving five examples of different kinds of i18n projects that I would be interested in working on. In fact, I have a much larger list of projects that I see as needing implementation on Wikipedia (about 30 or so, at the moment), and Javanese script is one of them (along with Jawi/Kawi, Sundanese, etc.). I think this would be a good introductory project to get me acquainted; we can chat on your talk page, as you indicated.

Hi @sandsub95! If you'd like to work on this outside of GSoC, or apply for the next round, now would be a good opportunity to start contacting relevant people and scoping this project well for the next round. i18n folks hangout on MediaWiki-Internationalization IRC channel. For all other general queries, MediaWiki-General is your go-to channel.

Qgil added a subscriber: Qgil.Sep 23 2015, 9:13 AM

This is a message posted to all tasks under "Backlog" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

Qgil added a comment.Sep 23 2015, 9:35 AM

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

Qgil added a comment.Oct 5 2015, 11:42 AM

Hi @sandsub95, what is the situation of this project? Are you working on it?

Sumit added a subscriber: Sumit.Mar 1 2016, 5:37 PM
IMPORTANT: This is a message posted to all tasks under "Need Discussion" at Possible-Tech-Projects. Wikimedia has been accepted as a mentor organization for GSoC '16. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.
Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptMar 1 2016, 5:37 PM
Qgil closed this task as Declined.Mar 2 2016, 10:42 AM

I will boldly resolve this task as Declined. It was a project idea by @sandsub95 but he seems not to be pursuing it anymore. The likeliness of taking it as is and finding mentors or other developers for it is pretty slow. If Sandeep or anyone else wants to work on it, feel free to reopen it.

@Qgil: Any chance I can start working on this now? I have a little more experience coding at this point and would like to get this going. What's the administrative procedure to get this running? Thank you so much for your help!

We don't know yet if Wikimedia will be chosen to take part in GSoC 2017, but by taking T94160#1173319 and other feedback into account you could maybe create an updated proposal? :) Also see