Background
- The Arabic language has a complex system of diacritics (Example images on Commons) that changes both the appearance and meaning of words.
- Strictly speaking, every Arabic word should contain diacritics.
- However, due to some technical limitations, (specifically, the lack of proper support of diacritics by widely used fonts and the poor handling of Arabic diacritics by search engines) most of Arabic websites don't (widely) use diacritics.
- On Arabic Wikipedia (arwiki), most articles don't contain diacritics. However, many articles do. This is particarly true for featured and good articles on arwiki. (As those articles undergo thorough review, diacritics are usually added to them to ensure they are correct from the linguist point of view.)
fixes.py
- Former arwiki admin Alnokta (currently inactive) is the original author of a (non-regex based) replacement dictionary that was added to Pywikibot file fixes.py ("correct-ar" line 446, see for example r4726)
- The idea of the code is that we (arwiki bot operators) are going to use Pywikibot to make automatic typographic corrections using a predefined list of typographic errors. (Meaning that we will use the bot to replace words that we know are wrong only.) This ensures that the code will be 100% accurate when running in bot mode.
Problem
- In r5942 the code was changed to use regex.
- The problem here is that the current code uses \b (word boundary) and treats every Arabic diacritic as a word boundary. (Meaning that if an Arabic word contains diacritics, the code will treat it as two words and will apply the regex-based corrections on each word separately.)
- Now this introduces a huge number of errors when running the code. (A test run of 90 articles resulted in 4 mistakes for an error percentage of 4/90 = 4.4% which is of course not acceptable for bot operation.)
- The current (regex-based) code is useless: It can't be run in bot mode as it introduces many errors and of course we (in arwiki) don't have enough man power to run the code in human assisted mode on the ~630k articles that we have.
Solution
- OK. Now we have 2 solutions here. Either to:
- Modify the Pywikibot regex handling to take Arabic diacritics into account (which I doubt that a non-native Arabic speaker will be able to implement correctly, as to implement such feature you need to understand the semantics of the language.) OR to
- Make a partial revert of r5942 (which is the solution that I prefer here.)
Request
My request here is to apply this clean patch (against Pywikibot core)
The patch does the following:
- Partial revert of r5942
- Adds "gallery" to "exceptions" (because of file names).
- Adds some Arabic (ar) translation
(By the way, if you are wondering why it took me so long to report this issue, this is because I myself was inactive as a bot operator on arwiki. But recently I decided to resume operating the bot again, so, I need this issue fixed.)
Thank you