Page MenuHomePhabricator

Legacy LanguageConverter uses top-level ::guessVariant on srwiki and banwiki
Open, Needs TriagePublic

Event Timeline

We should drop guessVariant and decide a way to set different Wikitext source code language instead.

We should drop guessVariant and decide a way to set different Wikitext source code language instead.

100% agreed. This bug is apparently more subtle though -- when I reviewed the code the new LC implementation *does* call ::guessVariant(), but some text on Cyrillic/Latin wikis is being (correctly) translated with the new implementation, but *not* translated using the old language converter implementation. Compare the infobox on:

old: https://sr.wikipedia.org/w/index.php?title=Bitka_kod_Pantine&useparsoid=0&variant=sr-el

image.png (468×902 px, 522 KB)

and

new: https://sr.wikipedia.org/wiki/Bitka_kod_Pantine?useparsoid=1&parsoidnewlc=1&variant=sr-el

image.png (468×818 px, 501 KB)

Change #1269716 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] LanguageConverter: Allow disabling top-level variant "guess"

https://gerrit.wikimedia.org/r/1269716

Ok, on investigation, Parsoid does invoke guessVariant(), but the legacy Parsoid invokes it *twice*: once on the overall text of any string to be converted (including the embedded html tags and attribtues) and then again on the text substrings between tags. That seems to be a bug: if the topmost 'guess' returns false, then nothing on the page will be converted at all. It seems like the intended behavior is for the individual strings / paragraphs / etc to be the proper subjects of "guessing".

I've added a patch to experimentally allow disabling the top level "guess" via ?nolcguess=1 on the URL, keeping the lower level guesses. This lets us perform an apples-to-apples comparison with Parsoid's implementation via visualdiff, unblocking that work. We can also take this behavior more easily to the community if we can easily show the difference between the two renderings on specific pages.

cscott renamed this task from Parsoid LanguageConverter implementation doesn't support ::guessVariant on srwiki to Legacy LanguageConverter uses top-level ::guessVariant on srwiki.Apr 11 2026, 5:14 AM

Change #1269716 merged by jenkins-bot:

[mediawiki/core@master] LanguageConverter: Allow disabling top-level variant "guess"

https://gerrit.wikimedia.org/r/1269716

Change #1271038 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@wmf/1.46.0-wmf.24] LanguageConverter: Allow disabling top-level variant "guess"

https://gerrit.wikimedia.org/r/1271038

Change #1271038 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.24] LanguageConverter: Allow disabling top-level variant "guess"

https://gerrit.wikimedia.org/r/1271038

Mentioned in SAL (#wikimedia-operations) [2026-04-14T20:30:12Z] <cscott@deploy1003> Started scap sync-world: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside <indicator> (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]]

Mentioned in SAL (#wikimedia-operations) [2026-04-14T20:32:00Z] <cscott@deploy1003> cscott: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside <indicator> (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2026-04-14T20:40:31Z] <cscott@deploy1003> Finished scap sync-world: Backport for [[gerrit:1271030|ParsoidLanguageConverter: convert inside <indicator> (T422961)]], [[gerrit:1271038|LanguageConverter: Allow disabling top-level variant "guess" (T419328)]] (duration: 10m 18s)

cscott renamed this task from Legacy LanguageConverter uses top-level ::guessVariant on srwiki to Legacy LanguageConverter uses top-level ::guessVariant on srwiki abd banwiki.Apr 17 2026, 8:38 PM
cscott renamed this task from Legacy LanguageConverter uses top-level ::guessVariant on srwiki abd banwiki to Legacy LanguageConverter uses top-level ::guessVariant on srwiki and banwiki.

For Serbian Wikipedia, there should be no guessing at all. On Serbian Wikipedia, if user selects Cyrillic, it should transliterate to Cyrillic everything outside lang tags and -{...}- syntax. If article is written in sr-Cyril, it is no-op anyway to transliterate Cyrillic to Cyrillic.

@cscott, workaround ?nolcguess=1 does not work on Serbian Wikipedia. Turn on Parsoid and check out https://sr.wikipedia.org/wiki/Џефри_Сакс?nolcguess=1. Still messed up.

@cscott, workaround ?nolcguess=1 does not work on Serbian Wikipedia. Turn on Parsoid and check out https://sr.wikipedia.org/wiki/Џефри_Сакс?nolcguess=1. Still messed up.

How about try these ones (I think I found another problem, which is the interface messages are not loaded according to language variant if nolcguess is activated):

https://sr.wikipedia.org/w/index.php?title=Џефри_Сакс&variant=sr-ec&nolcguess=1&useparsoid=1

https://sr.wikipedia.org/w/index.php?title=Џефри_Сакс&variant=sr-el&nolcguess=1&useparsoid=1

How about try these ones (I think I found another problem, which is the interface messages are not loaded according to language variant if nolcguess is activated):
https://sr.wikipedia.org/w/index.php?title=Џефри_Сакс&variant=sr-ec&nolcguess=1&useparsoid=1
https://sr.wikipedia.org/w/index.php?title=Џефри_Сакс&variant=sr-el&nolcguess=1&useparsoid=1

  • sr-ec is broken in same way.
  • sr-el has no that problem, as troublesome parts in article are already in Latin script.

Basically, I would say that detection of content is totally irrelevant and unnecessary.

  • If user select sr-ec, display article as-is. Any Latin text is intentionally Latin, it can stay that way.
  • If user select sr-el, convert every Serbian Cyrillic letter to Latin equivalent, avoiding text inside -{...}-, and inside <lang> tags.

Simple as that.

I did some changes to SR Wikipedia, and I came to conclusion that we can proceed with no detection of content.

I updated modules Lang, URL, and Citation/CS1 to implement transliteration prevention.

Lang and URL are straightforward, but CS1 is maybe doing too much. It may need to be reformatted so that if citation is marked as language=sr*, it is still left for transliteration. Basically, I am doing educated guess what to transliterate or not, but it is good enough for now.