Page MenuHomePhabricator

Implement Tatar language LanguageConverter
Open, MediumPublic

Description

tt converter classes made from kazakh classes by replacing kk to tt and adding some letters

this is code i have made from kazakh converter replacing kk to tt etc.

(i have made this several months ago, but has not worked further since then).

i will attach 6 files, 3 of them in messages folder, 3 are in classes folder. and a readme file is in attachment.

(and i have added some letters, that are not in kazakh language).


Version: unspecified
Severity: enhancement

Attached:

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:19 PM
bzimport set Reference to bz25537.
bzimport added a subscriber: Unknown Object (MLST).

Please submit these as a SVN diff against trunk.

kaldari renamed this task from imperfect but useful converter code for tatar language to imperfect but useful LanguageConverter code for tatar language.Jan 14 2015, 11:38 PM
kaldari set Security to None.
gerritbot added a subscriber: gerritbot.

Change 185090 had a related patch set uploaded (by Kaldari):
Adding LanguageConverter files for Tatar Language

https://gerrit.wikimedia.org/r/185090

Patch-For-Review

hi . i have made a new converter and uploaded to gerrit :
https://gerrit.wikimedia.org/r/#/c/164049/

3 texts i made (tested) new converter with

Change 185090 abandoned by Kaldari:
Adding LanguageConverter files for Tatar Language

Reason:
Replaced by change I18768eb1b13

https://gerrit.wikimedia.org/r/185090

Change 164049 had a related patch set uploaded (by Nikerabbit):
Add Tatar LanguageConverter

https://gerrit.wikimedia.org/r/164049

@Arrbee, @Amire80, can review of this feature please be put on the Language Engineering team's workboard?

Reedy renamed this task from imperfect but useful LanguageConverter code for tatar language to Implement Tatar language LanguageConverter.Nov 22 2019, 3:32 PM
Reedy removed a subscriber: wikibugs-l-list.

is there community consensus for this code? there were many discussions so it must be wanted. there are links to discussions here: https://tt.wikipedia.org/wiki/Кулланучы:Qdinar#википедиядагы_сөйләшүләр . standalone version of this converter is referred at https://tt.wikipedia.org/wiki/Татар_Википедиясе#TATLAT .

direct links to the standalone version, cyr->lat and lat->cyr, applied to tt.wikipedia.org:
http://https.tt.wikipedia.org.ttcysuttlart1999.aylandirow.tmf.org.ru/wiki/Баш_бит
http://https.tt.wikipedia.org.ttlart2012ttcysu.aylandirow.tmf.org.ru/wiki/Baş_bit

i personally do not "push" this project hard, because i generally dislike how this latin and also cyrillic alphabets are designed. for example, cyrillic/latin letter e is used for "i/e" sound, while there is also real "e" sound in words like "electron". it makes confusions with european languages and with turkish language. i am a programmer here, and wikipedians decided to use some authoritative alphabet, like all wikipedia is made, with authoritative sources, so i programmed using some governmental latin projects.

comment from code, i am going to mostly delete this from the code:
2017-02-18, author dinar qurbanov: by making this converter, i look like supporting it. but it is not so. *i think this alphabet has many disadvantages, i do not want to make it popular.* i regard this as historical museum showpiece. i think it should be ok to put it into tatar wikipedia, into conversion system of mediawiki. that converted pages are denied for search engines to index, as i know. exact version of latin orthography (and alphabet) was not chosen by voting by wikipedians, and wikipedians have not voted to edit rules of the tatar latin orthography to be used in wikipedia, so, i have decided to make this exactly as it was commanded by 2000's #882 resolution of cabinet of ministers of tatarstan. i use scans published by user Kitap ( https://tt.wikipedia.org/wiki/Татарстанда_татар_телен_дәүләт_теле_буларак_куллану_кануны#Татар_теленең_латин_язулы_орфографиясенең_гамәлдән_чыккан,_хәзерге_вакытта_рәсми_булмаган_кайбер_кагыйдәләре ), but i am not sure whether they are of resolution #882 or #618. that 2000's #882 resolution is canceled by russia law and by resolution #38 of 2013, of cabinet of ministers of republic of tatarstan, and new alphabet is accepted by 2013's law of tatarstan 1-ЗРТ, but that new alphabet is (even) less usable: there is no rules, no character for palatilasation in russian words, and the alphabets' table does not show all use cases of cyrillic letters. and i am going to mark this script as tt-latn-2000. i have found from gerrit comment that it is not ok. ("2000" subtag of variant is not registered in iana yet, but must, see https://en.wikipedia.org/wiki/IETF_language_tag ). then maybe i will mark as tt-latn-x-2000 where it is not variant, but in private-use subtag.

renamed 2000 to 2013, because wikipedians would not like it is named as 2000, because 2000's laws are canceled, but now there is 2013's law. there are several letter differences like ɵ -> ö, though ö was also somewhat admitted for computer usage. this converter uses ö. and there is no letter for hamza and palatalisation in 2013's law, and no rules/orthography are given. this converter uses apostroph for hamza and palatalisation, as used in 2000 law, and rules/orthography as given in 2000 law.

converter is ready long time ago. the code is not accepted into mediawiki. @thiemowmde voted -1 and requested to separate code into more files.

2019-11-25, Thiemo Kreuz:

... the maintenance costs for a monolith like this are unbearable ...
This code needs to be split into small services other human beings not keeping track of this for the past 5 (!) years are able to grasp, understand, and feel responsible for.
This possibly needs to be a separate extension.

2019-7-5, Thiemo Kreuz:

... Why was it not possible to split this up into smaller patches that have been merged years ago?
...
Was it really necessary to pack 3500 lines of code into a single file? If there is one mistake MediaWiki core code suffers from then it's this: unmaintainable classes with to much code in one place.
Please, please split this in multiple ways: multiple smaller classes, each introduced in a separate patch, each covered by a separate set of test cases. Ideally all this code is created first as part of a separate library in a separate Git repository. You can use GitHub or ask for one here on Gerrit. If this library is solid, well tested and reviewed, it will be much easier to add and use it in MediaWiki core.

2019-11-25, i answered:

i think it is possible to separate tatar language gramar functions into separate class, but that would be almost useless class, by itself, at this stage. useless because, for example, it works only with "thick" sounded suffixes (with a, o etc), and not with "thin" (ä, ö), because only thicks maybe confused with russian words (words borrowed from russian language), and thus work is only needed to "thick" words.

once i have thought about putting regex replace strings into arrays, but have forgotten about that. maybe i will make that soon.

you can see more comments at https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/164049/ .

why i am stalled since 2019-12-11, when was my last comment there?

i feel reluctant/lazy to do these things, ie coding this code, to separate it into more files etc, independently of suspicion problem, that i describe below.

i had problems with suspicion/paranoia that my operation system is hacked i reinstalled different oses several times, nearly since october. also now i am afraid simple installing of mediawiki developer version onto ubuntu operation system i use now is also not very trustable, so i would like to use some virtual machine, that also slows me down, makes me lazy. this does not mean that i especially distrust mediawiki developer version nor that i suspected it in previous cases. i do not also {trust ubuntu repository very much}, and different packages in it like firefox and others. also i feel distrust to firefox extensions that i use.

i formed 2 arrays from a big statement and a big expression, for requests of reviewers, they further can be separated into a file. i personally doubt this forming of arrays is a good change. because i could add a different statement between the statements, and i could add a different logical expression (not with "or"), and now i cannot.

i think i have problems with getting reviews for the gerrit commit.

i said long time ago that i think code is ready (for example at patch sets 105 (jan 7, 2018), 124 (apr 3, 2018)), but it is not accepted.

from a comment i wrote on nov 25, 2019 in gerrit to patch set 212:

also i did not know that patch is expected to be or should be small. why has this happened? Siebrand said at Patch set 7 "Qdinar: You write in your commit message "does not work correctly". So this will not be mergable until it works correctly..." . so, i have made it until it has come to some upper limit, where further improvement of the conversion quality goes slowly, because i need to search for exceptions manually by converting different texts.

so, from a reviewer (@siebrand) comment, i concluded that it is ok to write it bigger and bigger. and then another reviewer (@thiemowmde) said that it is not ok.

now, i have read https://www.mediawiki.org/wiki/Gerrit/Code_review/Getting_reviews nearly a year ago, and there is also it is not strictly said that commits should be small:

However, if your commits are going to be touching the same files repeatedly, bundle them up into one large commit (using either --amend or squashing after the fact).

there are also other problems. a reviewer (@TJones) said on patch set 149 (apr 27, 2018) that so many lookahead and lookbehind can be expensive. i have not fixed this.

until nearly 1-2 years ago, i did not know that every negative feedback and comment should be fixed...

seems this thing is needed only by tt wikipedia for now. i do not know any other public tatar site that uses mediawiki. and i think probably this code is temporary, for 15 years or less, and then it will be replaced by a neural network solution, if god wills.

in https://tt.wikipedia.org/wiki/%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F:%D0%A2%D0%B0%D0%B2%D1%8B%D1%88_%D0%B1%D0%B8%D1%80%D2%AF:%D0%A2%D0%B0%D1%82%D0%B0%D1%80_%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D1%8F%D1%81%D0%B5_%D0%BA%D0%BE%D0%BD%D0%B2%D0%B5%D1%80%D1%82%D0%B5%D1%80%D1%8B , it was voted that editors of tatar wikipedia agree to use cyrl-lat converter, like in other languages.

if this code is not going to be accepted, maybe i will try to make another gerrit commit with a smaller solution. even if its result is bad, it may be better than nothing.

but the gerrit code is not ready for now, because i do not see converter menu, in local install.