Page MenuHomePhabricator
Paste P10418

#wikimedia-tech 2020-02-{10,11} palmleaf
ActivePublic

Authored by Nemo_bis on Feb 16 2020, 8:56 AM.
2020-02-10 23.11 < ningu> hi, I have some general questions about getting a mediawiki extension approved for use on wikimedia projects and how to make sure the extension plays nice and does things the way people expect/want them to be done
2020-02-10 23.11 < ningu> hopefully this channel is better than #wikimedia where I just asked?
2020-02-10 23.12 < mutante> ningu: hi. there are probably too many channels but it's getting closer
2020-02-10 23.12 < ningu> haha ok
2020-02-10 23.12 < DSquirrelGM> maybe ... but idk for sure
2020-02-10 23.13 < mutante> ningu: one way to do it would be to create a ticket that says "deploy extension XY to production" and explain there why you think it would be good to have
2020-02-10 23.13 < ningu> mutante: so I know that the balinese wikipedia committee wants this
2020-02-10 23.14 < ningu> but they don't know anything about the technical implementation
2020-02-10 23.14 < mutante> assuming by wikimedia projects you mean the main projects like Wikipedia, Wikidata, Wiktionary... and not just cloud or tools
2020-02-10 23.14 < mutante> i see
2020-02-10 23.14 < ningu> it will most likely be wikisource, possibly just wikipedia
2020-02-10 23.14 < andre__> ningu: which extension is this about?
2020-02-10 23.14 < ningu> andre__: not released through regular channels yet but code is here: https://github.com/internetarchive/mediawiki-extension-archive-leaf
2020-02-10 23.15 < mutante> ningu: have you ever used gerrit and made it a mediawiki config change before? that's another way to suggest the actual code change and add reviewers on it .. and maybe combined with a mailing list post that points to it
2020-02-10 23.15 < andre__> ningu: Does that mean this extension is NOT on any Wikimedia site yet?
2020-02-10 23.15 < ningu> I am not really a mediawiki developer or even php developer. so I suspect it needs to be cleaned up by that's fine.
2020-02-10 23.15 < ningu> andre__: correct, it's on palmleaf.org which is an independent mediawiki install
2020-02-10 23.15 < andre__> ningu: In that case, see https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment
2020-02-10 23.15 < ningu> the initial conception was for this just to be a separate site, but now the balinese community is interested in moving it to their wikipedia or wikisource (wikisource doesn't exist yet but can be created)
2020-02-10 23.15 < ningu> andre__: I've read all that, I have more specific questions about this extension though
2020-02-10 23.16 < ningu> since it isn't really a "normal" extension in some ways
2020-02-10 23.16 < Reedy> Yeah...
2020-02-10 23.16 < Reedy> Being a react app and stuff.. And needing a specific skin?
2020-02-10 23.16 < ningu> Reedy: I am already eliminating the skin requirement, thankfully
2020-02-10 23.16 < ningu> the react app is essential though
2020-02-10 23.17 < ningu> basically it makes a workflow for balinese people to import palm-leaf manuscripts stored at archive.org and transcribe and translate them
2020-02-10 23.17 < ningu> the react app lets them view the image of the palm leaf above and type below
2020-02-10 23.17 < ningu> it also injects webfonts that allow display of balinese text, and an on-screen keyboard for balinese script for people who don't have a layout
2020-02-10 23.18 < Reedy> WMF deploy an extension for webfonts (UniversalLanguageSelector)
2020-02-10 23.18 < Reedy> So not re-inventing the wheel is nice where possible
2020-02-10 23.18 < ningu> Reedy: I looked into that, one sec, I'll try to remember why it wasn't ok
2020-02-10 23.18 < Reedy> File bugs :)
2020-02-10 23.19 < Reedy> For the react part... I guess it would need turning into a "service" to have any chance of being deployed
2020-02-10 23.19 < mutante> it could possibly start as a tool in cloud aka 'labs' and then ask to be moved to prod in a second step
2020-02-10 23.20 < ningu> what does being a "service" entail?
2020-02-10 23.20 < andre__> right, the current code loads webfonts from 3rd party websites like https://bali.panlex.org/transcriber/fonts/ which would be a privacy no-go
2020-02-10 23.20 < mutante> tool would mean something in wmflabs.org domain
2020-02-10 23.21 < ningu> andre__: sure, we can host the fonts elsewhere
2020-02-10 23.21 < ningu> problem is these fonts didn't exist and we had to create them
2020-02-10 23.21 < ningu> not sure what would be better
2020-02-10 23.21 < Reedy> If you could work out what was wrong with ULS..
2020-02-10 23.21 < mutante> we would have to build Debian packages that install the fonts
2020-02-10 23.21 < Reedy> mutante: Or bundle them in ULS, which does a lot of shit like that :)
2020-02-10 23.22 < ningu> Reedy: so part of the issue here is, the text is in multiple languages but in only one script (writing system)
2020-02-10 23.22 < ningu> using webfonts that always get pulled in for the balinese code block seemed easiest
2020-02-10 23.22 < mutante> heh, ok. i just know we also install a bunch of fonts from packages on appservers
2020-02-10 23.23 < ningu> and then you also don't have to tag which language each bit of text is or assume one language for the whole site
2020-02-10 23.23 < ningu> plus we'd have to fork ULS and add our fonts
2020-02-10 23.23 < ningu> but the whole site as currently conceived only has one special font need, so ULS didn't solve any problem for us
2020-02-10 23.23 < ningu> whole site = palmleaf.org
2020-02-10 23.24 < ningu> it was easier to write 5 lines of CSS
2020-02-10 23.24 < ningu> I don't know what the "right" solution is for wikimedia level
2020-02-10 23.24 < Reedy> You wouldn't need to fork ULS... Submit a patch/ticket to upstream to include them
2020-02-10 23.24 < Reedy> As long as they have an appropriate license, they should be includeable
2020-02-10 23.25 < ningu> Reedy: fair enough but we didn't have time to worry about that at the time (weren't being paid for it basically and short deadlines)
2020-02-10 23.25 < Reedy> sure :)
2020-02-10 23.25 < ningu> now the people funding this may be more willing
2020-02-10 23.25 < ningu> license will be fine
2020-02-10 23.25 < ningu> I think
2020-02-10 23.25 < Reedy> the TLDR is basically, it's possible, but it's not simple to get a complex extension deployed to wmf wikis
2020-02-10 23.25 < ningu> iirc the font author licensed them with CC-BY-NC-ND
2020-02-10 23.26 < ningu> I dunno if NC or ND is a problem
2020-02-10 23.27 < Reedy> Not sure either... But at least being a CC license should be a reasonable start
2020-02-10 23.27 < ningu> Reedy: yeah that's understandable. I've thought of another way being to have a bot that periodically copies stuff from palmleaf.org to wikisource, which might get around all this, if it just means balinese wikisource has to approve the bot
2020-02-10 23.27 < mutante> for content it would be .. for a font that is an interesting question.
2020-02-10 23.27 < ningu> but it makes it hard to simultaneously edit both wikis then
2020-02-10 23.27 < ningu> raises the question of which is primary
2020-02-10 23.27 < mutante> printing wikipedia articles and selling the books needs to be allowed though
2020-02-10 23.28 < ningu> hmmm... yeah
2020-02-10 23.28 < ningu> good point actually
2020-02-10 23.28 < ningu> and I assume you can't print the article with the font without a license to the font? I guess? haha
2020-02-10 23.28 < ningu> I mean and sell it
2020-02-10 23.28 < mutante> you could export raw text and print it in another font if there is one .. i guess
2020-02-10 23.29 < mutante> that could be an interesting one for legal
2020-02-10 23.29 < ningu> there are other balinese fonts but this is the only one that has proper opentype handling. noto is ok-ish. actually we got noto to hire the guy who did this for us to improve noto balinese, so probably sooner or later can just use that
2020-02-10 23.29 < mutante> as in "let's just ask"
2020-02-10 23.29 < ningu> yeah it's ok, it's solvable one way or another
2020-02-10 23.30 < ningu> I can always explain the issue to the designer and ask him too if he's flexible on license
2020-02-10 23.30 < mutante> maybe the easier route would be to approach the font author later and try to convince him to license it differently for use in Wikipedia
2020-02-10 23.30 < ningu> my understanding is it can't be an negotiated license just for wikimedia, right (even if no cost)?
2020-02-10 23.31 < ningu> like it's noncommercial but he explicitly allows wikimedia alone to use it commercially?
2020-02-10 23.31 < ningu> well maybe not alone, I just mean, as listed explicitly
2020-02-10 23.31 < mutante> hmm.. i don't know. that is really "not a lawyer" territory
2020-02-10 23.31 < ningu> hahaha ok
2020-02-10 23.31 < ningu> yeah I have no clue either
2020-02-10 23.32 < ningu> my group is sort of stuck in the middle here, there's a funder and the people in bali are interested in putting this stuff on wikisource/wikipedia, and our job is to do it, but if it turns out to be too hard or expensive the funder might balk anyway
2020-02-10 23.32 < mutante> the whole palm leaf thing is a very interesting project though. you should not shy away from creating one or multiple tickets / bugs about it
2020-02-10 23.32 < mutante> and see what you get from that
2020-02-10 23.32 < ningu> mutante: it's a really cool project, yeah... no one has done anything quite like it to my knowledge. the typed transcription, descriptions, etc are all from young people in bali who have studied this stuff in school
2020-02-10 23.33 < ningu> basically the perfect wiki contributors, since professors and other experts don't have time
2020-02-10 23.33 < mutante> i mean WMF specifically wants to reach underserved languages/groups/media and that's a great example
2020-02-10 23.33 < ningu> example https://palmleaf.org/wiki/carcan-kucing
2020-02-10 23.33 < ningu> we tried to get them to do short english descriptions
2020-02-10 23.34 < ningu> there isn't a lot of balinese text on the web in balinese script. palmleaf.org might have most of it at this point
2020-02-10 23.34 < ningu> I am sure the font business can be sorted one way or another but the complexity of the extension is another matter maybe
2020-02-10 23.34 < mutante> ningu: i don't understand but i know one word. kucing means cat
2020-02-10 23.34 < ningu> yeah, same as indonesian, but you can also say meong (= meow) for cat
2020-02-10 23.36 < ningu> I don't speak balinese but I speak indonesian
2020-02-10 23.37 < ningu> doesn't matter though, I just need to make the site work and talk to them :)
2020-02-10 23.38 < mutante> Reedy: ^ i was about to suggest to talk to "Community Engagement" but that is being integrated into other teams?
2020-02-10 23.40 < mutante> ningu: maybe these people would be good to talk to https://meta.wikimedia.org/wiki/Community_Programs_team
2020-02-10 23.40 < ningu> yes, a conversation would be really useful
2020-02-10 23.40 < mutante> this is a library in a way
2020-02-10 23.40 < ningu> I know folks in Bali but not at wikimedia
2020-02-10 23.41 < ningu> the people in Bali are on the committee for ban.wikipedia.org but that isn't enough for this
2020-02-10 23.43 < Reedy> Speaking to the language team probably isn't a bad idea either
2020-02-10 23.43 < ningu> I wonder if WMF is interested if they would even fund a little of our work, or at least help develop a technical plan
2020-02-10 23.44 < Reedy> Wikimedia do do grants that can be used for this sort of things https://meta.wikimedia.org/wiki/Grants
2020-02-10 23.44 < mutante> PMing about how to create a phab ticket
2020-02-10 23.44 < mutante> so at least we have contact data and something to point to
2020-02-10 23.46 < andre__> ningu: https://meta.wikimedia.org/wiki/Grants:Project
2020-02-10 23.46 < andre__> (deadline for the current round is soon though)
2020-02-11 23.33 < Nemo_bis> ningu: This is the same palm leaf project that Brewster Kahle is super fond of, right?
2020-02-11 23.34 < ningu> Nemo_bis: yes. he is funding PanLex's work on palmleaf.org and some of the work in Bali
2020-02-11 23.34 < ningu> and he funded the initial digitization too
2020-02-11 23.34 < Nemo_bis> Right.
2020-02-11 23.35 < Nemo_bis> There were several similar projects in the last 5-10 years.
2020-02-11 23.35 < ningu> but there were a bunch of technical hurdles to get the unicode balinese working and get some kind of platform up
2020-02-11 23.35 < ningu> Nemo_bis: you mean digitization of other collections through the internet archive?
2020-02-11 23.35 < Nemo_bis> I mean mainly integration of Internet Archive with other digital libraries.
2020-02-11 23.35 < ningu> ah ok, yeah
2020-02-11 23.35 < ningu> they do a lot of stuff
2020-02-11 23.35 < Nemo_bis> The biggest is https://www.biodiversitylibrary.org/ , are you by any chance using the same software?
2020-02-11 23.35 < ningu> I don't know all of it :)
2020-02-11 23.35 < ningu> no
2020-02-11 23.36 < ningu> palmleaf.org is basically just mediawiki plus the ArchiveLeaf extension
2020-02-11 23.36 < ningu> which we made to give an interface for viewing leaves and transcribing them
2020-02-11 23.36 < ningu> the initial goal wasn't actually to integrate with wikipedia at all, but since they chose mediawiki it makes it a lot more possible
2020-02-11 23.36 < ningu> that's a more recent idea
2020-02-11 23.37 < Nemo_bis> fyi https://www.mediawiki.org/wiki/Wikipmediawiki
2020-02-11 23.37 < Nemo_bis> We've discussed all this stuff so many times that I'm struggling to choose which URL to link. :)
2020-02-11 23.38 < ningu> thanks. I think I mostly know those details at this point
2020-02-11 23.38 < ningu> haha
2020-02-11 23.39 < ningu> I am not any sort of mediawiki expert but I've had to read a fair amount of the code to get the extension working as I wanted -- found that easier than documentation at times
2020-02-11 23.39 < Nemo_bis> Anyway, one of the earlier projects was https://phabricator.wikimedia.org/T59813 / https://www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle
2020-02-11 23.39 < ningu> another archive project is the thing where they've been converting dead links to wayback machine
2020-02-11 23.39 < ningu> I mean throughout wikipedia
2020-02-11 23.40 < ningu> ok, that thing you linked to is definitely relevant
2020-02-11 23.40 < ningu> similar import idea
2020-02-11 23.40 < Nemo_bis> One of the traditional requests/projects is https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2017/Wikisource/Improve_workflow_for_uploading_books_to_Wikisource
2020-02-11 23.41 < Nemo_bis> It's not just Wikipedia, it's all Wikimedia sites. See https://www.mediawiki.org/wiki/Archived_Pages and https://meta.wikimedia.org/wiki/InternetArchiveBot
2020-02-11 23.41 < ningu> so would the idea be, instead of importing scanned images as media into the mediawiki instance, to import the internet archive items into wikimedia commons? would that make more sense for a workflow into wikisource?
2020-02-11 23.42 < ningu> right now our importer retrieves images from the internet archive item and uploads them into mediawiki
2020-02-11 23.43 < Nemo_bis> That doesn't sound very efficient
2020-02-11 23.43 < Nemo_bis> You can probably reuse the ia-upload system
2020-02-11 23.43 < Nemo_bis> Some people are working on https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Multimedia_and_Commons/Improve_the_PDF/book_reader / https://meta.wikimedia.org/wiki/Indic-TechCom/Tools/BookReader
2020-02-11 23.43 < Nemo_bis> (If you want to reuse files from archive.org, using the same book reader can help.)
2020-02-11 23.44 < Nemo_bis> I can't remember when was the last time we deployed a major new functionality to Wikisource viewers, possibly https://www.mediawiki.org/wiki/Extension:Score
2020-02-11 23.45 < ningu> hmm
2020-02-11 23.45 < ningu> ok, all this is really interesting to know about
2020-02-11 23.45 < ningu> I agree that ia-upload looks perfect
2020-02-11 23.45 < Nemo_bis> Well "perfect" sounds a bit excessive. :D
2020-02-11 23.45 < Nemo_bis> But if the same workflow happens to work for you, you can already start hosting stuff on Commons and embed them from there.
2020-02-11 23.46 < Nemo_bis> That will bring the two communities together and simplify any future cooperation.
2020-02-11 23.47 < ningu> Nemo_bis: so one thing to keep in mind here is, the palmleaf.org workflow focuses on people transcribing text (what OCR would do if it worked for balinese) and to break it into manageable chunks, they do it leaf by leaf
2020-02-11 23.48 < ningu> so the wiki page ends up looking like https://palmleaf.org/wiki/carcan-kucing
2020-02-11 23.48 < ningu> with one section per page
2020-02-11 23.48 < Nemo_bis> ningu: yes, that's the purpose of Wikisource as well (ProofreadPage extension)
2020-02-11 23.48 < ningu> ah right, I looked at that a long time ago
2020-02-11 23.48 < Nemo_bis> What I don't understand is what generates the transliteration
2020-02-11 23.48 < Nemo_bis> ProofreadPage is mostly for multi-page documents, most of its functionality may be less relevant for you
2020-02-11 23.49 < ningu> I can't remember right now why I didn't use ProofreadPage. it seemed like more trouble than it was worth but I can't remember why
2020-02-11 23.49 < Nemo_bis> I suppose archive.org doesn't embed any OCR yet
2020-02-11 23.49 < ningu> the issue with OCR is this is Balinese script and there is no OCR for it yet
2020-02-11 23.49 < ningu> to develop OCR, you need a manually transcribed corpus ... which is what they're making on palmleaf.org by the way :)
2020-02-11 23.49 < Nemo_bis> This would actually be a nice case for a Wikisource-specific OCR because it's relatively easy to add new OCR to tesseract while I doubt ABBYY is especially interesting
2020-02-11 23.50 < ningu> it may be a challenge, it's hand-written and all, but it would be fun for someone to give it a go
2020-02-11 23.50 < ningu> yeah it's totally possible, I dunno how well it would work
2020-02-11 23.50 < ningu> the transliteration uses this little thing I wrote: https://github.com/longnow/icu-transliterator-service
2020-02-11 23.51 < ningu> we developed our own ICU rule-based transliterator using the language for writing the rules, and we run it on the palmleaf.org server. it's also exposed via an API method so front-end code can use it
2020-02-11 23.51 < Nemo_bis> The rules are already in ICU?
2020-02-11 23.51 < ningu> the code to interpret the rules is in ICU
2020-02-11 23.52 < ningu> we could try to get it into ICU, I guess, although it's sort of a work in progress and it doesn't confirm to any standard (to the extent there is one, which is not much)
2020-02-11 23.52 < Nemo_bis> Alright. We don't really use either, we have our own. But it's not that hard if you already have field-tested rules.
2020-02-11 23.52 < Nemo_bis> LongNow, so I suppose you know SJ?
2020-02-11 23.52 < ningu> SJ?
2020-02-11 23.52 < ningu> maybe
2020-02-11 23.53 < Nemo_bis> Our lovely documentation for languageconverter is at https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems
2020-02-11 23.54 < ningu> the transliteration rules try to accomplish two somewhat conflicting things at once: (1) as much as possible unambiguously distinguish all balinese characters in latin using diacritics etc., (2) make the latin readable by balinese people even if they ignore the diacritics entirely, according to the latin orthography they are used to
2020-02-11 23.54 < ningu> who/what is SJ?
2020-02-11 23.55 < Nemo_bis> https://blogs.harvard.edu/sj/
2020-02-11 23.55 < ningu> because of the above conflicting goals, the transliteration rules don't perfectly conform to either balinese latin orthography (which leaves out a ton of distinctions) or scholarly transliteration (which tends to be derived from other indic scripts and is hard for untrained balinese people to interpret)
2020-02-11 23.55 < Nemo_bis> In the interest of fairness, Wikisource is not the only and possibly not the best transcription system out there although it's one of the few in free software. The Smithsonian promised me to release theirs as well https://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=9543219
2020-02-11 23.56 < Nemo_bis> ningu: ok, that's the perfect setup for a neverending fight on whether to merge the languageconverter or not
2020-02-11 23.56 < ningu> haha
2020-02-11 23.56 < ningu> well another option is two alternative systems
2020-02-11 23.56 < ningu> I guess
2020-02-11 23.56 < Nemo_bis> Sure, it's easy to support that
2020-02-11 23.57 < Nemo_bis> But the writing system really needs to be standardised already, we don't make up languages or scripts or ortographies at Wikimedia.
2020-02-11 23.57 < Nemo_bis> (Or we try not to. Sometimes it's hard to keep the bar firm.)
2020-02-11 23.57 < ningu> so, that's a good point but I'd argue this case is a little different from what you're describing
2020-02-11 23.58 < ningu> the transliteration is not meant to be the primary way anyone writes the languages in question
2020-02-11 23.58 < ningu> the writing system is the balinese script, this is just a way to read it if you don't know it
2020-02-11 23.58 < ningu> the balinese script side is standardized, at least in terms of common practice, plus there are works written on how to spell things etc
2020-02-11 23.59 < Nemo_bis> Which is why we have https://translatewiki.net/wiki/Portal:Ban already
2020-02-11 23.59 < ningu> the problem with using balinese latin orthography is twofold I guess: (1) it is no longer nearly as good of a guide for people using the transliteration as a crib to learn the original, (2) the original works are not all in balinese in the first place but rather several languages, so using balinese spelling is inappropriate