2020-02-10 23.11 < ningu> hi, I have some general questions about getting a mediawiki extension approved for use on wikimedia projects and how to make sure the extension plays nice and does things the way people expect/want them to be done 2020-02-10 23.11 < ningu> hopefully this channel is better than #wikimedia where I just asked? 2020-02-10 23.12 < mutante> ningu: hi. there are probably too many channels but it's getting closer 2020-02-10 23.12 < ningu> haha ok 2020-02-10 23.12 < DSquirrelGM> maybe ... but idk for sure 2020-02-10 23.13 < mutante> ningu: one way to do it would be to create a ticket that says "deploy extension XY to production" and explain there why you think it would be good to have 2020-02-10 23.13 < ningu> mutante: so I know that the balinese wikipedia committee wants this 2020-02-10 23.14 < ningu> but they don't know anything about the technical implementation 2020-02-10 23.14 < mutante> assuming by wikimedia projects you mean the main projects like Wikipedia, Wikidata, Wiktionary... and not just cloud or tools 2020-02-10 23.14 < mutante> i see 2020-02-10 23.14 < ningu> it will most likely be wikisource, possibly just wikipedia 2020-02-10 23.14 < andre__> ningu: which extension is this about? 2020-02-10 23.14 < ningu> andre__: not released through regular channels yet but code is here: https://github.com/internetarchive/mediawiki-extension-archive-leaf 2020-02-10 23.15 < mutante> ningu: have you ever used gerrit and made it a mediawiki config change before? that's another way to suggest the actual code change and add reviewers on it .. and maybe combined with a mailing list post that points to it 2020-02-10 23.15 < andre__> ningu: Does that mean this extension is NOT on any Wikimedia site yet? 2020-02-10 23.15 < ningu> I am not really a mediawiki developer or even php developer. so I suspect it needs to be cleaned up by that's fine. 2020-02-10 23.15 < ningu> andre__: correct, it's on palmleaf.org which is an independent mediawiki install 2020-02-10 23.15 < andre__> ningu: In that case, see https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment 2020-02-10 23.15 < ningu> the initial conception was for this just to be a separate site, but now the balinese community is interested in moving it to their wikipedia or wikisource (wikisource doesn't exist yet but can be created) 2020-02-10 23.15 < ningu> andre__: I've read all that, I have more specific questions about this extension though 2020-02-10 23.16 < ningu> since it isn't really a "normal" extension in some ways 2020-02-10 23.16 < Reedy> Yeah... 2020-02-10 23.16 < Reedy> Being a react app and stuff.. And needing a specific skin? 2020-02-10 23.16 < ningu> Reedy: I am already eliminating the skin requirement, thankfully 2020-02-10 23.16 < ningu> the react app is essential though 2020-02-10 23.17 < ningu> basically it makes a workflow for balinese people to import palm-leaf manuscripts stored at archive.org and transcribe and translate them 2020-02-10 23.17 < ningu> the react app lets them view the image of the palm leaf above and type below 2020-02-10 23.17 < ningu> it also injects webfonts that allow display of balinese text, and an on-screen keyboard for balinese script for people who don't have a layout 2020-02-10 23.18 < Reedy> WMF deploy an extension for webfonts (UniversalLanguageSelector) 2020-02-10 23.18 < Reedy> So not re-inventing the wheel is nice where possible 2020-02-10 23.18 < ningu> Reedy: I looked into that, one sec, I'll try to remember why it wasn't ok 2020-02-10 23.18 < Reedy> File bugs :) 2020-02-10 23.19 < Reedy> For the react part... I guess it would need turning into a "service" to have any chance of being deployed 2020-02-10 23.19 < mutante> it could possibly start as a tool in cloud aka 'labs' and then ask to be moved to prod in a second step 2020-02-10 23.20 < ningu> what does being a "service" entail? 2020-02-10 23.20 < andre__> right, the current code loads webfonts from 3rd party websites like https://bali.panlex.org/transcriber/fonts/ which would be a privacy no-go 2020-02-10 23.20 < mutante> tool would mean something in wmflabs.org domain 2020-02-10 23.21 < ningu> andre__: sure, we can host the fonts elsewhere 2020-02-10 23.21 < ningu> problem is these fonts didn't exist and we had to create them 2020-02-10 23.21 < ningu> not sure what would be better 2020-02-10 23.21 < Reedy> If you could work out what was wrong with ULS.. 2020-02-10 23.21 < mutante> we would have to build Debian packages that install the fonts 2020-02-10 23.21 < Reedy> mutante: Or bundle them in ULS, which does a lot of shit like that :) 2020-02-10 23.22 < ningu> Reedy: so part of the issue here is, the text is in multiple languages but in only one script (writing system) 2020-02-10 23.22 < ningu> using webfonts that always get pulled in for the balinese code block seemed easiest 2020-02-10 23.22 < mutante> heh, ok. i just know we also install a bunch of fonts from packages on appservers 2020-02-10 23.23 < ningu> and then you also don't have to tag which language each bit of text is or assume one language for the whole site 2020-02-10 23.23 < ningu> plus we'd have to fork ULS and add our fonts 2020-02-10 23.23 < ningu> but the whole site as currently conceived only has one special font need, so ULS didn't solve any problem for us 2020-02-10 23.23 < ningu> whole site = palmleaf.org 2020-02-10 23.24 < ningu> it was easier to write 5 lines of CSS 2020-02-10 23.24 < ningu> I don't know what the "right" solution is for wikimedia level 2020-02-10 23.24 < Reedy> You wouldn't need to fork ULS... Submit a patch/ticket to upstream to include them 2020-02-10 23.24 < Reedy> As long as they have an appropriate license, they should be includeable 2020-02-10 23.25 < ningu> Reedy: fair enough but we didn't have time to worry about that at the time (weren't being paid for it basically and short deadlines) 2020-02-10 23.25 < Reedy> sure :) 2020-02-10 23.25 < ningu> now the people funding this may be more willing 2020-02-10 23.25 < ningu> license will be fine 2020-02-10 23.25 < ningu> I think 2020-02-10 23.25 < Reedy> the TLDR is basically, it's possible, but it's not simple to get a complex extension deployed to wmf wikis 2020-02-10 23.25 < ningu> iirc the font author licensed them with CC-BY-NC-ND 2020-02-10 23.26 < ningu> I dunno if NC or ND is a problem 2020-02-10 23.27 < Reedy> Not sure either... But at least being a CC license should be a reasonable start 2020-02-10 23.27 < ningu> Reedy: yeah that's understandable. I've thought of another way being to have a bot that periodically copies stuff from palmleaf.org to wikisource, which might get around all this, if it just means balinese wikisource has to approve the bot 2020-02-10 23.27 < mutante> for content it would be .. for a font that is an interesting question. 2020-02-10 23.27 < ningu> but it makes it hard to simultaneously edit both wikis then 2020-02-10 23.27 < ningu> raises the question of which is primary 2020-02-10 23.27 < mutante> printing wikipedia articles and selling the books needs to be allowed though 2020-02-10 23.28 < ningu> hmmm... yeah 2020-02-10 23.28 < ningu> good point actually 2020-02-10 23.28 < ningu> and I assume you can't print the article with the font without a license to the font? I guess? haha 2020-02-10 23.28 < ningu> I mean and sell it 2020-02-10 23.28 < mutante> you could export raw text and print it in another font if there is one .. i guess 2020-02-10 23.29 < mutante> that could be an interesting one for legal 2020-02-10 23.29 < ningu> there are other balinese fonts but this is the only one that has proper opentype handling. noto is ok-ish. actually we got noto to hire the guy who did this for us to improve noto balinese, so probably sooner or later can just use that 2020-02-10 23.29 < mutante> as in "let's just ask" 2020-02-10 23.29 < ningu> yeah it's ok, it's solvable one way or another 2020-02-10 23.30 < ningu> I can always explain the issue to the designer and ask him too if he's flexible on license 2020-02-10 23.30 < mutante> maybe the easier route would be to approach the font author later and try to convince him to license it differently for use in Wikipedia 2020-02-10 23.30 < ningu> my understanding is it can't be an negotiated license just for wikimedia, right (even if no cost)? 2020-02-10 23.31 < ningu> like it's noncommercial but he explicitly allows wikimedia alone to use it commercially? 2020-02-10 23.31 < ningu> well maybe not alone, I just mean, as listed explicitly 2020-02-10 23.31 < mutante> hmm.. i don't know. that is really "not a lawyer" territory 2020-02-10 23.31 < ningu> hahaha ok 2020-02-10 23.31 < ningu> yeah I have no clue either 2020-02-10 23.32 < ningu> my group is sort of stuck in the middle here, there's a funder and the people in bali are interested in putting this stuff on wikisource/wikipedia, and our job is to do it, but if it turns out to be too hard or expensive the funder might balk anyway 2020-02-10 23.32 < mutante> the whole palm leaf thing is a very interesting project though. you should not shy away from creating one or multiple tickets / bugs about it 2020-02-10 23.32 < mutante> and see what you get from that 2020-02-10 23.32 < ningu> mutante: it's a really cool project, yeah... no one has done anything quite like it to my knowledge. the typed transcription, descriptions, etc are all from young people in bali who have studied this stuff in school 2020-02-10 23.33 < ningu> basically the perfect wiki contributors, since professors and other experts don't have time 2020-02-10 23.33 < mutante> i mean WMF specifically wants to reach underserved languages/groups/media and that's a great example 2020-02-10 23.33 < ningu> example https://palmleaf.org/wiki/carcan-kucing 2020-02-10 23.33 < ningu> we tried to get them to do short english descriptions 2020-02-10 23.34 < ningu> there isn't a lot of balinese text on the web in balinese script. palmleaf.org might have most of it at this point 2020-02-10 23.34 < ningu> I am sure the font business can be sorted one way or another but the complexity of the extension is another matter maybe 2020-02-10 23.34 < mutante> ningu: i don't understand but i know one word. kucing means cat 2020-02-10 23.34 < ningu> yeah, same as indonesian, but you can also say meong (= meow) for cat 2020-02-10 23.36 < ningu> I don't speak balinese but I speak indonesian 2020-02-10 23.37 < ningu> doesn't matter though, I just need to make the site work and talk to them :) 2020-02-10 23.38 < mutante> Reedy: ^ i was about to suggest to talk to "Community Engagement" but that is being integrated into other teams? 2020-02-10 23.40 < mutante> ningu: maybe these people would be good to talk to https://meta.wikimedia.org/wiki/Community_Programs_team 2020-02-10 23.40 < ningu> yes, a conversation would be really useful 2020-02-10 23.40 < mutante> this is a library in a way 2020-02-10 23.40 < ningu> I know folks in Bali but not at wikimedia 2020-02-10 23.41 < ningu> the people in Bali are on the committee for ban.wikipedia.org but that isn't enough for this 2020-02-10 23.43 < Reedy> Speaking to the language team probably isn't a bad idea either 2020-02-10 23.43 < ningu> I wonder if WMF is interested if they would even fund a little of our work, or at least help develop a technical plan 2020-02-10 23.44 < Reedy> Wikimedia do do grants that can be used for this sort of things https://meta.wikimedia.org/wiki/Grants 2020-02-10 23.44 < mutante> PMing about how to create a phab ticket 2020-02-10 23.44 < mutante> so at least we have contact data and something to point to 2020-02-10 23.46 < andre__> ningu: https://meta.wikimedia.org/wiki/Grants:Project 2020-02-10 23.46 < andre__> (deadline for the current round is soon though) 2020-02-11 23.33 < Nemo_bis> ningu: This is the same palm leaf project that Brewster Kahle is super fond of, right? 2020-02-11 23.34 < ningu> Nemo_bis: yes. he is funding PanLex's work on palmleaf.org and some of the work in Bali 2020-02-11 23.34 < ningu> and he funded the initial digitization too 2020-02-11 23.34 < Nemo_bis> Right. 2020-02-11 23.35 < Nemo_bis> There were several similar projects in the last 5-10 years. 2020-02-11 23.35 < ningu> but there were a bunch of technical hurdles to get the unicode balinese working and get some kind of platform up 2020-02-11 23.35 < ningu> Nemo_bis: you mean digitization of other collections through the internet archive? 2020-02-11 23.35 < Nemo_bis> I mean mainly integration of Internet Archive with other digital libraries. 2020-02-11 23.35 < ningu> ah ok, yeah 2020-02-11 23.35 < ningu> they do a lot of stuff 2020-02-11 23.35 < Nemo_bis> The biggest is https://www.biodiversitylibrary.org/ , are you by any chance using the same software? 2020-02-11 23.35 < ningu> I don't know all of it :) 2020-02-11 23.35 < ningu> no 2020-02-11 23.36 < ningu> palmleaf.org is basically just mediawiki plus the ArchiveLeaf extension 2020-02-11 23.36 < ningu> which we made to give an interface for viewing leaves and transcribing them 2020-02-11 23.36 < ningu> the initial goal wasn't actually to integrate with wikipedia at all, but since they chose mediawiki it makes it a lot more possible 2020-02-11 23.36 < ningu> that's a more recent idea 2020-02-11 23.37 < Nemo_bis> fyi https://www.mediawiki.org/wiki/Wikipmediawiki 2020-02-11 23.37 < Nemo_bis> We've discussed all this stuff so many times that I'm struggling to choose which URL to link. :) 2020-02-11 23.38 < ningu> thanks. I think I mostly know those details at this point 2020-02-11 23.38 < ningu> haha 2020-02-11 23.39 < ningu> I am not any sort of mediawiki expert but I've had to read a fair amount of the code to get the extension working as I wanted -- found that easier than documentation at times 2020-02-11 23.39 < Nemo_bis> Anyway, one of the earlier projects was https://phabricator.wikimedia.org/T59813 / https://www.mediawiki.org/wiki/Google_Books,_Internet_Archive,_Commons_upload_cycle 2020-02-11 23.39 < ningu> another archive project is the thing where they've been converting dead links to wayback machine 2020-02-11 23.39 < ningu> I mean throughout wikipedia 2020-02-11 23.40 < ningu> ok, that thing you linked to is definitely relevant 2020-02-11 23.40 < ningu> similar import idea 2020-02-11 23.40 < Nemo_bis> One of the traditional requests/projects is https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2017/Wikisource/Improve_workflow_for_uploading_books_to_Wikisource 2020-02-11 23.41 < Nemo_bis> It's not just Wikipedia, it's all Wikimedia sites. See https://www.mediawiki.org/wiki/Archived_Pages and https://meta.wikimedia.org/wiki/InternetArchiveBot 2020-02-11 23.41 < ningu> so would the idea be, instead of importing scanned images as media into the mediawiki instance, to import the internet archive items into wikimedia commons? would that make more sense for a workflow into wikisource? 2020-02-11 23.42 < ningu> right now our importer retrieves images from the internet archive item and uploads them into mediawiki 2020-02-11 23.43 < Nemo_bis> That doesn't sound very efficient 2020-02-11 23.43 < Nemo_bis> You can probably reuse the ia-upload system 2020-02-11 23.43 < Nemo_bis> Some people are working on https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Multimedia_and_Commons/Improve_the_PDF/book_reader / https://meta.wikimedia.org/wiki/Indic-TechCom/Tools/BookReader 2020-02-11 23.43 < Nemo_bis> (If you want to reuse files from archive.org, using the same book reader can help.) 2020-02-11 23.44 < Nemo_bis> I can't remember when was the last time we deployed a major new functionality to Wikisource viewers, possibly https://www.mediawiki.org/wiki/Extension:Score 2020-02-11 23.45 < ningu> hmm 2020-02-11 23.45 < ningu> ok, all this is really interesting to know about 2020-02-11 23.45 < ningu> I agree that ia-upload looks perfect 2020-02-11 23.45 < Nemo_bis> Well "perfect" sounds a bit excessive. :D 2020-02-11 23.45 < Nemo_bis> But if the same workflow happens to work for you, you can already start hosting stuff on Commons and embed them from there. 2020-02-11 23.46 < Nemo_bis> That will bring the two communities together and simplify any future cooperation. 2020-02-11 23.47 < ningu> Nemo_bis: so one thing to keep in mind here is, the palmleaf.org workflow focuses on people transcribing text (what OCR would do if it worked for balinese) and to break it into manageable chunks, they do it leaf by leaf 2020-02-11 23.48 < ningu> so the wiki page ends up looking like https://palmleaf.org/wiki/carcan-kucing 2020-02-11 23.48 < ningu> with one section per page 2020-02-11 23.48 < Nemo_bis> ningu: yes, that's the purpose of Wikisource as well (ProofreadPage extension) 2020-02-11 23.48 < ningu> ah right, I looked at that a long time ago 2020-02-11 23.48 < Nemo_bis> What I don't understand is what generates the transliteration 2020-02-11 23.48 < Nemo_bis> ProofreadPage is mostly for multi-page documents, most of its functionality may be less relevant for you 2020-02-11 23.49 < ningu> I can't remember right now why I didn't use ProofreadPage. it seemed like more trouble than it was worth but I can't remember why 2020-02-11 23.49 < Nemo_bis> I suppose archive.org doesn't embed any OCR yet 2020-02-11 23.49 < ningu> the issue with OCR is this is Balinese script and there is no OCR for it yet 2020-02-11 23.49 < ningu> to develop OCR, you need a manually transcribed corpus ... which is what they're making on palmleaf.org by the way :) 2020-02-11 23.49 < Nemo_bis> This would actually be a nice case for a Wikisource-specific OCR because it's relatively easy to add new OCR to tesseract while I doubt ABBYY is especially interesting 2020-02-11 23.50 < ningu> it may be a challenge, it's hand-written and all, but it would be fun for someone to give it a go 2020-02-11 23.50 < ningu> yeah it's totally possible, I dunno how well it would work 2020-02-11 23.50 < ningu> the transliteration uses this little thing I wrote: https://github.com/longnow/icu-transliterator-service 2020-02-11 23.51 < ningu> we developed our own ICU rule-based transliterator using the language for writing the rules, and we run it on the palmleaf.org server. it's also exposed via an API method so front-end code can use it 2020-02-11 23.51 < Nemo_bis> The rules are already in ICU? 2020-02-11 23.51 < ningu> the code to interpret the rules is in ICU 2020-02-11 23.52 < ningu> we could try to get it into ICU, I guess, although it's sort of a work in progress and it doesn't confirm to any standard (to the extent there is one, which is not much) 2020-02-11 23.52 < Nemo_bis> Alright. We don't really use either, we have our own. But it's not that hard if you already have field-tested rules. 2020-02-11 23.52 < Nemo_bis> LongNow, so I suppose you know SJ? 2020-02-11 23.52 < ningu> SJ? 2020-02-11 23.52 < ningu> maybe 2020-02-11 23.53 < Nemo_bis> Our lovely documentation for languageconverter is at https://meta.wikimedia.org/wiki/Wikipedias_in_multiple_writing_systems 2020-02-11 23.54 < ningu> the transliteration rules try to accomplish two somewhat conflicting things at once: (1) as much as possible unambiguously distinguish all balinese characters in latin using diacritics etc., (2) make the latin readable by balinese people even if they ignore the diacritics entirely, according to the latin orthography they are used to 2020-02-11 23.54 < ningu> who/what is SJ? 2020-02-11 23.55 < Nemo_bis> https://blogs.harvard.edu/sj/ 2020-02-11 23.55 < ningu> because of the above conflicting goals, the transliteration rules don't perfectly conform to either balinese latin orthography (which leaves out a ton of distinctions) or scholarly transliteration (which tends to be derived from other indic scripts and is hard for untrained balinese people to interpret) 2020-02-11 23.55 < Nemo_bis> In the interest of fairness, Wikisource is not the only and possibly not the best transcription system out there although it's one of the few in free software. The Smithsonian promised me to release theirs as well https://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=9543219 2020-02-11 23.56 < Nemo_bis> ningu: ok, that's the perfect setup for a neverending fight on whether to merge the languageconverter or not 2020-02-11 23.56 < ningu> haha 2020-02-11 23.56 < ningu> well another option is two alternative systems 2020-02-11 23.56 < ningu> I guess 2020-02-11 23.56 < Nemo_bis> Sure, it's easy to support that 2020-02-11 23.57 < Nemo_bis> But the writing system really needs to be standardised already, we don't make up languages or scripts or ortographies at Wikimedia. 2020-02-11 23.57 < Nemo_bis> (Or we try not to. Sometimes it's hard to keep the bar firm.) 2020-02-11 23.57 < ningu> so, that's a good point but I'd argue this case is a little different from what you're describing 2020-02-11 23.58 < ningu> the transliteration is not meant to be the primary way anyone writes the languages in question 2020-02-11 23.58 < ningu> the writing system is the balinese script, this is just a way to read it if you don't know it 2020-02-11 23.58 < ningu> the balinese script side is standardized, at least in terms of common practice, plus there are works written on how to spell things etc 2020-02-11 23.59 < Nemo_bis> Which is why we have https://translatewiki.net/wiki/Portal:Ban already 2020-02-11 23.59 < ningu> the problem with using balinese latin orthography is twofold I guess: (1) it is no longer nearly as good of a guide for people using the transliteration as a crib to learn the original, (2) the original works are not all in balinese in the first place but rather several languages, so using balinese spelling is inappropriate