Page MenuHomePhabricator

QINU appears instead of math in section names in search results
Open, MediumPublic

Description

In the past, some articles had the string UNIQ QINU because of various edge cases of Parsoid and Content Translation. I checked whether any of that is left in the Hebrew Wikipedia by searching for "QINU". I couldn't find anything in wiki syntax or rendered text, which is great, but it does appear in search results instead of math formulas. Here's a screenshot:

The first result is the article "Function", which has a section with the following wikitext:

==קבוצת הפונקציות <math>Y^X</math>==

The wikitext makes sense and the rendering is correct, but in the search results I see this instead of anything that looks like "Y^X":

'"`UNIQ--postMath-00000072-QINU`"')

It's not the most important bug in Wikipedia, but it definitely shouldn't happen :)

This may be related to T138453 and T127738, but it's about searching and not rendering.

I'm not sure whether it's related to Math, MediaWiki-Parser, or Discovery-Search, so tagging all.

Thanks!

Event Timeline

Looking at the HTML output for one example page we have:

<span style="display:none" class="sortkey">Durener Straße&#160;040 '"`UNIQ--nowiki-00000009-QINU`"' </span>

I'm not sure how advanced the css selectors are for our html stripping utility, perhaps we could add a rule that strips html containing style="display:none" exactly. I read a little on the parent task and I'm not quite sure if this is an appropriate fix though, it seems from the parent task that there might be some other underlying reasons these pages have the marker that should be fixed?

UNIQ/QINO tends to show up when there is either a broken or invalid template call on a page. It also sometimes shows up if an extension tag has a bug in its PHP code where it processing some of the wikitext only partially.

In this case, the bug seems to be in the page itself, and the corruption can be seen on the page itself as well, not just through search, so it seems fair to include. Text is sometimes intentionally hidden with CSS based on user interaction or other context. Hiding all "display:none" text would imho be a mistake.

There is supposed to be an image in row 35 generated through a template (Vorlage: is German for Template:), but it is broken.

Gehel raised the priority of this task from Low to Medium.Aug 28 2020, 12:27 PM
Gehel moved this task from elastic / cirrus to Bugs on the Discovery-Search board.

Seems to be a real issue with math tags, and not fixed by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Math/+/655884 as I was hoping.

I will note that it only seems to affect section headings. Math tags in normal text don't show up in the search results at all, e.g.: https://he.wikipedia.org/w/index.php?search=QINU&title=מיוחד%3Aחיפוש&go=לערך&ns0=1&uselang=en

And in fact, the links to section titles with strip markers actually work, e.g.: https://he.wikipedia.org/wiki/פונקציה#פונקציה_%7F%27"%60UNIQ--postMath-00000088-QINU%60"%27%7F-מקומית

This issue is mentioned in T94344#2093091, so I'll mark that as a dependency of this task.

matmarex renamed this task from QINU appears instead of math in search results to QINU appears instead of math in section names in search results.Feb 24 2021, 3:18 PM

This Hebrew page looks really fascinating to me. Unfortunately, I don't know any Hebrew. However, for reproducing and testing the bug locally a shorter example would be optimal. Does it occur in Hebrew, only? Is it related to the text direction? Has he.wikipedia special extensions enabled that might interfere.

The same issue can be reproduced on other wikis, e.g. English Wikipedia: https://en.wikipedia.org/w/index.php?search=QINU

@matmarex I am not sure if I understand the problem. Is this the general problem, that there is no plain-text version for math? This is also a problem for PDF Bookmarks, and for scientific publishers that only support plain text for Title and Abstract. I am not aware of a solution for that. Is this a minimal test-case https://www.mediawiki.org/wiki/Extension:Math/bug/38641#%7F'%22%60UNIQ--postMath-00000001-QINU%60%22'%7F (taken from T40641)

It is a general problem, but somehow other extensions avoid it. I added some more examples to https://www.mediawiki.org/wiki/Extension:Math/bug/38641 – they don't all generate sensible anchors, but they all avoid exposing strip markers.

Looks like this is implemented in the TOC code in Parser.php: https://github.com/wikimedia/mediawiki/blob/da99ed653f0443fca55908000af7c96f69ac59a5/includes/parser/Parser.php#L4279. The code comment there even calls out <math>, but presumably this no longer works because Math is not using normal strip markers.

I don't really understand why the math parser hook code is the way it is. Assuming that it can't be changed, you could probably add a hook near that code in Parser.php, and implement it in Math to replace the fake strip markers with the source, similar to what is done for 'mw:editsection'.

(Given the mention of strip markers and issues of accessing the source text, I thought it would be good to mention T138229: math ml rendering changes and scribunto that also occurs because Math does not use normal strip markers.)

I personally really hate the solution with the strip marker inspired implementation of parallel rendering. However, I think this can be better discussed in T268785.

@matmarex thank you for the examples. I think the approach of SyntaxHighlight seems to be promising. However the unlike images, references, etc, formulae in mathematical texts can be regarded as words from a linguistic point of view.