Page MenuHomePhabricator

QINU issue occurs when Lua module modifies text that contains a Cite reference
Open, Needs TriagePublic

Description

Short description

When a piece of wikitext that contains a <ref> tag is passed through a Lua module that tries to do a simple find-and-replace operation on the character "2", the output string contains a malformation that contains the pattern QINU....UNIQ, a known bug referred to as the QINU issue and is typically caused by using the wrong instance of the Parser object and running into Unicode issues.

Detailed story

In T237467 the behavior of {{formatnum:...}} was changed such that if what is passed to it is not a number, it would still work but it would also categorize the page in a tracking category called Category:Pages with non-numeric formatnum arguments. Once this went live on fawiki, we noticed that hundreds of thousands of pages were in this category. In reviewing those pages we found that most of them could be associated to specific templates that used formatnum to take a string that contains digits (but is not necessarily a number) and localize it. Examples of such input include dates or time values copied from English Wikipedia; users commonly copy-paste the infobox from an English Wikipedia article, and it may contain a date (e.g. 1/1/2020) or a time (e.g. the duration of video clip may be recorded as 12:44). Because Persian does not use the same characters for digits (we used ۱ and ۲ and ۳ and so on, instead of 1 and 2 and 3 and so on), and because copy-pasting from English Wikipedia is common, we modify our infobox templates to pass certain parameters through formatnum to localize their digits. But that is against the spirit of formatnum which is meant to be used with actual numbers, not non-number strings that contain digits).

What we really need in these cases is just to do a character swap, i.e. find characters 1, 2, 3 ... and replace them with the corresponding character in Persian alphabet. Therefore, we have been replacing those use cases of formatnum with a template (which we named {{formatnumber|...}}) that itself uses a Lua module to do a string replace on digit characters. This worked in most cases, but in some pages, it caused the QINU issue to appear.

The template is at fa:Template:Formatnumber and the module is at fa:Module:Numeral converter and an example of the issue is on fa:User:Huji/sandbox but because these are all in Persian, I have actually recreated a much simpler version on test Wikipedia.

Here is the simplified version of the module: https://test.wikipedia.org/wiki/Module:Localize

Here is a page on which this module is used: https://test.wikipedia.org/wiki/Module_talk:Localize and a screenshot is below.

Capture.JPG (237×577 px, 22 KB)

The module uses a simple regexp pattern to replace the digit 2 with its counterpart in Persian alphabet ۲ and as you see on the module talk page, that results in a UNIQ-...-QINU output.

Interestingly, this only happens with digit 2; replacing other digits (0, 1, or 3 through 9) doesn't cause this problem. Also note that the input string does not even contain a digit. My understanding is that somehow, the output of Cite is interpreted by Lua in a way that causes a unicode problem.

I still have not been able to pinpoint the root cause, so I am going to tag both involved components.

The known cause for the QINU issue is to use an instance of the Parser class that is not correct for the context. Typically, this happens when code uses $wgOut to get the parser object. However, to the best of my knowledge, neither Cite nor Scribunto are using $wgOut.

Event Timeline

Huji added a subscriber: Amire80.

@Amire80 this may be an issue you might be able to provide a fix (or some deeper insight) for.

thiemowmde subscribed.

These UNIQ--ref-00000002-QINU sequences are temporary placeholders for multipass parsing. They start with what's called MARKER_PREFIX in the code, followed by the tag name "ref" and an 8-digit sequence number. The sequence number in the example just happens to be 2. That's why it's unaffected when the other digits are replaced.

What apparently happens is that the inner wikitext abc<ref>Citation</ref> is partly parsed by the Cite extension, but temporarily replaced with one of these placeholders. The true content is stored in a StripState object that's part of the current Parser. This is why it's critical to keep using the correct Parser.

For some reason the Lua module gets to see half-parsed wikitext that still contains these placeholders. While I agree this is weird, it appears to be an intentional quirk of the old parser. It's only a problem because the Lua module assumes what it get's would be plain text. But it's not. It's still wikitext, with wikitext features that can easily be destroyed when the module touches the wrong characters.

Here is another example that destroys certain wikitext features: {{#invoke:string|replace|Example: ''x''|'|?}}

As far as I understand the situation, what the Lua module gets is equivalent to the half-parsed wikitext it gets when calling frame:preprocess.

I don't think there is anything we can do here. I mean, other than getting rid of the old parser in favor of Parsoid. 😇️