Page MenuHomePhabricator

moving palmleaf.org platform to Balinese Wikisource/Wikipedia
Open, Needs TriagePublic

Description

Last year, PanLex (my project) and the Internet Archive created a new platform at palmleaf.org for digitizing Balinese palm-leaf manuscripts, specifically for transcribing the Internet Archive's huge collection of Balinese manuscripts into Unicode text in the Balinese script. The goal is to provide a rich digital resource for these manuscripts, including text, description, translation, etc. The link above provides more context.

PanLex's work last year focused on developing the platform and organizing a team in Bali to transcribe manuscripts. Technical work included developing a new Balinese font and a Mediawiki extension to manage the workflow. The extension is needed to import Internet Archive items into wiki pages and to provide a workable transcription interface. The React app provides a split-pane view where users can view the leaf image above and transcribe it into text below. It also provides an on-screen keyboard for inputting text in Balinese script.

Last year's work was essentially a pilot, so the priority was making the platform work and developing relationships in Bali. It was a success -- the team in Bali transcribed 3,000 leaves containing 92 complete works, and they are excited to continue. They plan to raise the total to 20,000 leaves over the next 1-2 years. Meanwhile, the Internet Archive, which hosts palmleaf.org, has suggested handing the platform over to Balinese Wikipedia (or perhaps more appropriately Balinese Wikisource, which doesn't exist yet). This would put the platform more directly in the hands of the community, make the content more widely available and more likely to be improved, etc.

I recently spoke about this with Carma Citrawati, who has been managing the work on palmleaf.org and is also involved on the Balinese Wikipedia community. She spoke to other community members and they are definitely interested in finding a way for Balinese Wikipedia or Wikisource to host the content and platform in some way. I'm excited about the possibility of getting this done but there are two main obstacles: (1) figuring out the technical details so the ArchiveLeaf extension can be deployed on Wikimedia and updating the extension, (2) funding this work if it is substantial.

I'd appreciate any suggestions for who to contact and what to do next. Thanks!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

By the way, there's the possibility of using the ArchiveLeaf extension to work with palm-leaf manuscript collections in other languages. There are no solid plans for that currently but it's something PanLex is investigating. Just wanted to mention that it may be useful beyond this project.

I'd appreciate any suggestions for who to contact and what to do next.

I provided some over IRC at #wikimedia-tech. Let me know if you need more (ideally with a summary of open questions; I realise I brought up a number of topics, not all necessarily of immediate interest for you).

12020-02-10 23.11 < ningu> hi, I have some general questions about getting a mediawiki extension approved for use on wikimedia projects and how to make sure the extension plays nice and does things the way people expect/want them to be done
22020-02-10 23.11 < ningu> hopefully this channel is better than #wikimedia where I just asked?
32020-02-10 23.12 < mutante> ningu: hi. there are probably too many channels but it's getting closer
42020-02-10 23.12 < ningu> haha ok
52020-02-10 23.12 < DSquirrelGM> maybe ... but idk for sure
62020-02-10 23.13 < mutante> ningu: one way to do it would be to create a ticket that says "deploy extension XY to production" and explain there why you think it would be good to have
72020-02-10 23.13 < ningu> mutante: so I know that the balinese wikipedia committee wants this
82020-02-10 23.14 < ningu> but they don't know anything about the technical implementation
92020-02-10 23.14 < mutante> assuming by wikimedia projects you mean the main projects like Wikipedia, Wikidata, Wiktionary... and not just cloud or tools
102020-02-10 23.14 < mutante> i see
112020-02-10 23.14 < ningu> it will most likely be wikisource, possibly just wikipedia
122020-02-10 23.14 < andre__> ningu: which extension is this about?
132020-02-10 23.14 < ningu> andre__: not released through regular channels yet but code is here: https://github.com/internetarchive/mediawiki-extension-archive-leaf
142020-02-10 23.15 < mutante> ningu: have you ever used gerrit and made it a mediawiki config change before? that's another way to suggest the actual code change and add reviewers on it .. and maybe combined with a mailing list post that points to it
152020-02-10 23.15 < andre__> ningu: Does that mean this extension is NOT on any Wikimedia site yet?
162020-02-10 23.15 < ningu> I am not really a mediawiki developer or even php developer. so I suspect it needs to be cleaned up by that's fine.
172020-02-10 23.15 < ningu> andre__: correct, it's on palmleaf.org which is an independent mediawiki install
182020-02-10 23.15 < andre__> ningu: In that case, see https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment
192020-02-10 23.15 < ningu> the initial conception was for this just to be a separate site, but now the balinese community is interested in moving it to their wikipedia or wikisource (wikisource doesn't exist yet but can be created)
202020-02-10 23.15 < ningu> andre__: I've read all that, I have more specific questions about this extension though
212020-02-10 23.16 < ningu> since it isn't really a "normal" extension in some ways
222020-02-10 23.16 < Reedy> Yeah...
232020-02-10 23.16 < Reedy> Being a react app and stuff.. And needing a specific skin?
242020-02-10 23.16 < ningu> Reedy: I am already eliminating the skin requirement, thankfully
252020-02-10 23.16 < ningu> the react app is essential though
262020-02-10 23.17 < ningu> basically it makes a workflow for balinese people to import palm-leaf manuscripts stored at archive.org and transcribe and translate them
272020-02-10 23.17 < ningu> the react app lets them view the image of the palm leaf above and type below
282020-02-10 23.17 < ningu> it also injects webfonts that allow display of balinese text, and an on-screen keyboard for balinese script for people who don't have a layout
292020-02-10 23.18 < Reedy> WMF deploy an extension for webfonts (UniversalLanguageSelector)
302020-02-10 23.18 < Reedy> So not re-inventing the wheel is nice where possible
312020-02-10 23.18 < ningu> Reedy: I looked into that, one sec, I'll try to remember why it wasn't ok
322020-02-10 23.18 < Reedy> File bugs :)
332020-02-10 23.19 < Reedy> For the react part... I guess it would need turning into a "service" to have any chance of being deployed
342020-02-10 23.19 < mutante> it could possibly start as a tool in cloud aka 'labs' and then ask to be moved to prod in a second step
352020-02-10 23.20 < ningu> what does being a "service" entail?
362020-02-10 23.20 < andre__> right, the current code loads webfonts from 3rd party websites like https://bali.panlex.org/transcriber/fonts/ which would be a privacy no-go
372020-02-10 23.20 < mutante> tool would mean something in wmflabs.org domain
382020-02-10 23.21 < ningu> andre__: sure, we can host the fonts elsewhere
392020-02-10 23.21 < ningu> problem is these fonts didn't exist and we had to create them
402020-02-10 23.21 < ningu> not sure what would be better
412020-02-10 23.21 < Reedy> If you could work out what was wrong with ULS..
422020-02-10 23.21 < mutante> we would have to build Debian packages that install the fonts
432020-02-10 23.21 < Reedy> mutante: Or bundle them in ULS, which does a lot of shit like that :)
442020-02-10 23.22 < ningu> Reedy: so part of the issue here is, the text is in multiple languages but in only one script (writing system)
452020-02-10 23.22 < ningu> using webfonts that always get pulled in for the balinese code block seemed easiest
462020-02-10 23.22 < mutante> heh, ok. i just know we also install a bunch of fonts from packages on appservers
472020-02-10 23.23 < ningu> and then you also don't have to tag which language each bit of text is or assume one language for the whole site
482020-02-10 23.23 < ningu> plus we'd have to fork ULS and add our fonts
492020-02-10 23.23 < ningu> but the whole site as currently conceived only has one special font need, so ULS didn't solve any problem for us
502020-02-10 23.23 < ningu> whole site = palmleaf.org
512020-02-10 23.24 < ningu> it was easier to write 5 lines of CSS
522020-02-10 23.24 < ningu> I don't know what the "right" solution is for wikimedia level
532020-02-10 23.24 < Reedy> You wouldn't need to fork ULS... Submit a patch/ticket to upstream to include them
542020-02-10 23.24 < Reedy> As long as they have an appropriate license, they should be includeable
552020-02-10 23.25 < ningu> Reedy: fair enough but we didn't have time to worry about that at the time (weren't being paid for it basically and short deadlines)
562020-02-10 23.25 < Reedy> sure :)
572020-02-10 23.25 < ningu> now the people funding this may be more willing
582020-02-10 23.25 < ningu> license will be fine
592020-02-10 23.25 < ningu> I think
602020-02-10 23.25 < Reedy> the TLDR is basically, it's possible, but it's not simple to get a complex extension deployed to wmf wikis
612020-02-10 23.25 < ningu> iirc the font author licensed them with CC-BY-NC-ND
622020-02-10 23.26 < ningu> I dunno if NC or ND is a problem
632020-02-10 23.27 < Reedy> Not sure either... But at least being a CC license should be a reasonable start
642020-02-10 23.27 < ningu> Reedy: yeah that's understandable. I've thought of another way being to have a bot that periodically copies stuff from palmleaf.org to wikisource, which might get around all this, if it just means balinese wikisource has to approve the bot
652020-02-10 23.27 < mutante> for content it would be .. for a font that is an interesting question.
662020-02-10 23.27 < ningu> but it makes it hard to simultaneously edit both wikis then
672020-02-10 23.27 < ningu> raises the question of which is primary
682020-02-10 23.27 < mutante> printing wikipedia articles and selling the books needs to be allowed though
692020-02-10 23.28 < ningu> hmmm... yeah
702020-02-10 23.28 < ningu> good point actually
712020-02-10 23.28 < ningu> and I assume you can't print the article with the font without a license to the font? I guess? haha
722020-02-10 23.28 < ningu> I mean and sell it
732020-02-10 23.28 < mutante> you could export raw text and print it in another font if there is one .. i guess
742020-02-10 23.29 < mutante> that could be an interesting one for legal
752020-02-10 23.29 < ningu> there are other balinese fonts but this is the only one that has proper opentype handling. noto is ok-ish. actually we got noto to hire the guy who did this for us to improve noto balinese, so probably sooner or later can just use that
762020-02-10 23.29 < mutante> as in "let's just ask"
772020-02-10 23.29 < ningu> yeah it's ok, it's solvable one way or another
782020-02-10 23.30 < ningu> I can always explain the issue to the designer and ask him too if he's flexible on license
792020-02-10 23.30 < mutante> maybe the easier route would be to approach the font author later and try to convince him to license it differently for use in Wikipedia
802020-02-10 23.30 < ningu> my understanding is it can't be an negotiated license just for wikimedia, right (even if no cost)?
812020-02-10 23.31 < ningu> like it's noncommercial but he explicitly allows wikimedia alone to use it commercially?
822020-02-10 23.31 < ningu> well maybe not alone, I just mean, as listed explicitly
832020-02-10 23.31 < mutante> hmm.. i don't know. that is really "not a lawyer" territory
842020-02-10 23.31 < ningu> hahaha ok
852020-02-10 23.31 < ningu> yeah I have no clue either
862020-02-10 23.32 < ningu> my group is sort of stuck in the middle here, there's a funder and the people in bali are interested in putting this stuff on wikisource/wikipedia, and our job is to do it, but if it turns out to be too hard or expensive the funder might balk anyway
872020-02-10 23.32 < mutante> the whole palm leaf thing is a very interesting project though. you should not shy away from creating one or multiple tickets / bugs about it
882020-02-10 23.32 < mutante> and see what you get from that
892020-02-10 23.32 < ningu> mutante: it's a really cool project, yeah... no one has done anything quite like it to my knowledge. the typed transcription, descriptions, etc are all from young people in bali who have studied this stuff in school
902020-02-10 23.33 < ningu> basically the perfect wiki contributors, since professors and other experts don't have time
912020-02-10 23.33 < mutante> i mean WMF specifically wants to reach underserved languages/groups/media and that's a great example
922020-02-10 23.33 < ningu> example https://palmleaf.org/wiki/carcan-kucing
932020-02-10 23.33 < ningu> we tried to get them to do short english descriptions
942020-02-10 23.34 < ningu> there isn't a lot of balinese text on the web in balinese script. palmleaf.org might have most of it at this point
952020-02-10 23.34 < ningu> I am sure the font business can be sorted one way or another but the complexity of the extension is another matter maybe
962020-02-10 23.34 < mutante> ningu: i don't understand but i know one word. kucing means cat
972020-02-10 23.34 < ningu> yeah, same as indonesian, but you can also say meong (= meow) for cat
982020-02-10 23.36 < ningu> I don't speak balinese but I speak indonesian
992020-02-10 23.37 < ningu> doesn't matter though, I just need to make the site work and talk to them :)
1002020-02-10 23.38 < mutante> Reedy: ^ i was about to suggest to talk to "Community Engagement" but that is being integrated into other teams?
1012020-02-10 23.40 < mutante> ningu: maybe these people would be good to talk to https://meta.wikimedia.org/wiki/Community_Programs_team
1022020-02-10 23.40 < ningu> yes, a conversation would be really useful
1032020-02-10 23.40 < mutante> this is a library in a way
1042020-02-10 23.40 < ningu> I know folks in Bali but not at wikimedia
1052020-02-10 23.41 < ningu> the people in Bali are on the committee for ban.wikipedia.org but that isn't enough for this
1062020-02-10 23.43 < Reedy> Speaking to the language team probably isn't a bad idea either
1072020-02-10 23.43 < ningu> I wonder if WMF is interested if they would even fund a little of our work, or at least help develop a technical plan
1082020-02-10 23.44 < Reedy> Wikimedia do do grants that can be used for this sort of things https://meta.wikimedia.org/wiki/Grants
1092020-02-10 23.44 < mutante> PMing about how to create a phab ticket
1102020-02-10 23.44 < mutante> so at least we have contact data and something to point to
1112020-02-10 23.46 < andre__> ningu: https://meta.wikimedia.org/wiki/Grants:Project
1122020-02-10 23.46 < andre__> (deadline for the current round is soon though)
1132020-02-11 23.33 < Nemo_bis> ningu: This is the same palm leaf project that Brewster Kahle is super fond of, right?
1142020-02-11 23.34 < ningu> Nemo_bis: yes. he is funding PanLex's work on palmleaf.org and some of the work in Bali
1152020-02-11 23.34 < ningu> and he funded the initial digitization too
1162020-02-11 23.34 < Nemo_bis> Right.
1172020-02-11 23.35 < Nemo_bis> There were several similar projects in the last 5-10 years.
1182020-02-11 23.35 < ningu> but there were a bunch of technical hurdles to get the unicode balinese working and get some kind of platform up
1192020-02-11 23.35 < ningu> Nemo_bis: you mean digitization of other collections through the internet archive?
1202020-02-11 23.35 < Nemo_bis> I mean mainly integration of Internet Archive with other digital libraries.
1212020-02-11 23.35 < ningu> ah ok, yeah
1222020-02-11 23.35 < ningu> they do a lot of stuff
1232020-02-11 23.35 < Nemo_bis> The biggest is https://www.biodiversitylibrary.org/ , are you by any chance using the same software?
1242020-02-11 23.35 < ningu> I don't know all of it :)
1252020-02-11 23.35 < ningu> no
1262020-02-11 23.36 < ningu> palmleaf.org is basically just mediawiki plus the ArchiveLeaf extension
1272020-02-11 23.36 < ningu> which we made to give an interface for viewing leaves and transcribing them
1282020-02-11 23.36 < ningu> the initial goal wasn't actually to integrate with wikipedia at all, but since they chose mediawiki it makes it a lot more possible
1292020-02-11 23.36 < ningu> that's a more recent idea
1302020-02-11 23.37 < Nemo_bis> fyi https://www.mediawiki.org/wiki/Wikipmediawiki
1312020-02-11 23.37 < Nemo_bis> We've discussed all this stuff so many times that I'm struggling to choose which URL to link. :)
1322020-02-11 23.38 < ningu> thanks. I think I mostly know those details at this point
1332020-02-11 23.38 < ningu> haha
1342020-02-11 23.39 < ningu> I am not any sort of mediawiki expert but I've had to read a fair amount of the code to get the extension working as I wanted -- found that easier than documentation at times
1352020-02-11 23.39 < Nemo_bis> Anyway, one of the earlier projects was https://phabricator.wikimedia.org/T59813 / https://www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle
1362020-02-11 23.39 < ningu> another archive project is the thing where they've been converting dead links to wayback machine
1372020-02-11 23.39 < ningu> I mean throughout wikipedia
1382020-02-11 23.40 < ningu> ok, that thing you linked to is definitely relevant
1392020-02-11 23.40 < ningu> similar import idea
1402020-02-11 23.40 < Nemo_bis> One of the traditional requests/projects is https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2017/Wikisource/Improve_workflow_for_uploading_books_to_Wikisource
1412020-02-11 23.41 < Nemo_bis> It's not just Wikipedia, it's all Wikimedia sites. See https://www.mediawiki.org/wiki/Archived_Pages and https://meta.wikimedia.org/wiki/InternetArchiveBot
1422020-02-11 23.41 < ningu> so would the idea be, instead of importing scanned images as media into the mediawiki instance, to import the internet archive items into wikimedia commons? would that make more sense for a workflow into wikisource?
1432020-02-11 23.42 < ningu> right now our importer retrieves images from the internet archive item and uploads them into mediawiki
1442020-02-11 23.43 < Nemo_bis> That doesn't sound very efficient
1452020-02-11 23.43 < Nemo_bis> You can probably reuse the ia-upload system
1462020-02-11 23.43 < Nemo_bis> Some people are working on https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Multimedia_and_Commons/Improve_the_PDF/book_reader / https://meta.wikimedia.org/wiki/Indic-TechCom/Tools/BookReader
1472020-02-11 23.43 < Nemo_bis> (If you want to reuse files from archive.org, using the same book reader can help.)
1482020-02-11 23.44 < Nemo_bis> I can't remember when was the last time we deployed a major new functionality to Wikisource viewers, possibly https://www.mediawiki.org/wiki/Extension:Score
1492020-02-11 23.45 < ningu> hmm
1502020-02-11 23.45 < ningu> ok, all this is really interesting to know about
1512020-02-11 23.45 < ningu> I agree that ia-upload looks perfect
1522020-02-11 23.45 < Nemo_bis> Well "perfect" sounds a bit excessive. :D
1532020-02-11 23.45 < Nemo_bis> But if the same workflow happens to work for you, you can already start hosting stuff on Commons and embed them from there.
1542020-02-11 23.46 < Nemo_bis> That will bring the two communities together and simplify any future cooperation.
1552020-02-11 23.47 < ningu> Nemo_bis: so one thing to keep in mind here is, the palmleaf.org workflow focuses on people transcribing text (what OCR would do if it worked for balinese) and to break it into manageable chunks, they do it leaf by leaf
1562020-02-11 23.48 < ningu> so the wiki page ends up looking like https://palmleaf.org/wiki/carcan-kucing
1572020-02-11 23.48 < ningu> with one section per page
1582020-02-11 23.48 < Nemo_bis> ningu: yes, that's the purpose of Wikisource as well (ProofreadPage extension)
1592020-02-11 23.48 < ningu> ah right, I looked at that a long time ago
1602020-02-11 23.48 < Nemo_bis> What I don't understand is what generates the transliteration
1612020-02-11 23.48 < Nemo_bis> ProofreadPage is mostly for multi-page documents, most of its functionality may be less relevant for you
1622020-02-11 23.49 < ningu> I can't remember right now why I didn't use ProofreadPage. it seemed like more trouble than it was worth but I can't remember why
1632020-02-11 23.49 < Nemo_bis> I suppose archive.org doesn't embed any OCR yet
1642020-02-11 23.49 < ningu> the issue with OCR is this is Balinese script and there is no OCR for it yet
1652020-02-11 23.49 < ningu> to develop OCR, you need a manually transcribed corpus ... which is what they're making on palmleaf.org by the way :)
1662020-02-11 23.49 < Nemo_bis> This would actually be a nice case for a Wikisource-specific OCR because it's relatively easy to add new OCR to tesseract while I doubt ABBYY is especially interesting
1672020-02-11 23.50 < ningu> it may be a challenge, it's hand-written and all, but it would be fun for someone to give it a go
1682020-02-11 23.50 < ningu> yeah it's totally possible, I dunno how well it would work
1692020-02-11 23.50 < ningu> the transliteration uses this little thing I wrote: https://github.com/longnow/icu-transliterator-service
1702020-02-11 23.51 < ningu> we developed our own ICU rule-based transliterator using the language for writing the rules, and we run it on the palmleaf.org server. it's also exposed via an API method so front-end code can use it
1712020-02-11 23.51 < Nemo_bis> The rules are already in ICU?
1722020-02-11 23.51 < ningu> the code to interpret the rules is in ICU
1732020-02-11 23.52 < ningu> we could try to get it into ICU, I guess, although it's sort of a work in progress and it doesn't confirm to any standard (to the extent there is one, which is not much)
1742020-02-11 23.52 < Nemo_bis> Alright. We don't really use either, we have our own. But it's not that hard if you already have field-tested rules.
1752020-02-11 23.52 < Nemo_bis> LongNow, so I suppose you know SJ?
1762020-02-11 23.52 < ningu> SJ?
1772020-02-11 23.52 < ningu> maybe
1782020-02-11 23.53 < Nemo_bis> Our lovely documentation for languageconverter is at https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems
1792020-02-11 23.54 < ningu> the transliteration rules try to accomplish two somewhat conflicting things at once: (1) as much as possible unambiguously distinguish all balinese characters in latin using diacritics etc., (2) make the latin readable by balinese people even if they ignore the diacritics entirely, according to the latin orthography they are used to
1802020-02-11 23.54 < ningu> who/what is SJ?
1812020-02-11 23.55 < Nemo_bis> https://blogs.harvard.edu/sj/
1822020-02-11 23.55 < ningu> because of the above conflicting goals, the transliteration rules don't perfectly conform to either balinese latin orthography (which leaves out a ton of distinctions) or scholarly transliteration (which tends to be derived from other indic scripts and is hard for untrained balinese people to interpret)
1832020-02-11 23.55 < Nemo_bis> In the interest of fairness, Wikisource is not the only and possibly not the best transcription system out there although it's one of the few in free software. The Smithsonian promised me to release theirs as well https://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=9543219
1842020-02-11 23.56 < Nemo_bis> ningu: ok, that's the perfect setup for a neverending fight on whether to merge the languageconverter or not
1852020-02-11 23.56 < ningu> haha
1862020-02-11 23.56 < ningu> well another option is two alternative systems
1872020-02-11 23.56 < ningu> I guess
1882020-02-11 23.56 < Nemo_bis> Sure, it's easy to support that
1892020-02-11 23.57 < Nemo_bis> But the writing system really needs to be standardised already, we don't make up languages or scripts or ortographies at Wikimedia.
1902020-02-11 23.57 < Nemo_bis> (Or we try not to. Sometimes it's hard to keep the bar firm.)
1912020-02-11 23.57 < ningu> so, that's a good point but I'd argue this case is a little different from what you're describing
1922020-02-11 23.58 < ningu> the transliteration is not meant to be the primary way anyone writes the languages in question
1932020-02-11 23.58 < ningu> the writing system is the balinese script, this is just a way to read it if you don't know it
1942020-02-11 23.58 < ningu> the balinese script side is standardized, at least in terms of common practice, plus there are works written on how to spell things etc
1952020-02-11 23.59 < Nemo_bis> Which is why we have https://translatewiki.net/wiki/Portal:Ban already
1962020-02-11 23.59 < ningu> the problem with using balinese latin orthography is twofold I guess: (1) it is no longer nearly as good of a guide for people using the transliteration as a crib to learn the original, (2) the original works are not all in balinese in the first place but rather several languages, so using balinese spelling is inappropriate

For the Balinese and Bali script support see https://wikisource.org/w/index.php?title=Wikisource:Scriptorium&oldid=772813#Balinese_Wikisource . Bali and Indonesian users have already helped me identify some things we should do in any case; at least one of them I'll do right away.