Page MenuHomePhabricator

Category sortkeys are not handled properly if they contain '' or '''
Open, LowPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Style apostrophes are not treated as raw text if included in category sortkeys.

What happens?:

  • For example, [[Category:some category|''abc'']] will sort as <i>abc</i>.

What should have happened instead?:

  • They should be treated as raw text. This could feasibly come up on Wiktionary if '' legitimately appears within a term, since the parser would wrongly treat them as unclosed opening apostrophes.

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Theknightwho renamed this task from Category sortkeys are not handled properly in certain edge cases to Category sortkeys are not handled properly if they contain '' or '''.Dec 31 2023, 6:04 PM

I'm not sure I agree entirely with the bug summary. I suspect that the "right" wikitext for this case would use an ampersand-escape for the category sort key:

[[Category:&amp&amp;foo&amp;&amp;]]

which is the standard way to flag to the parser that this is not intended to be wikitext markup, and avoids having /some/ [[ link captions parsed as wikitext and /some/ [[ link captions parsed as raw text. That would also ensure that the name of the category displays properly on the category list page.

But in order for that to work, the category code should probably html-decode the sort key before doing the actual sort. Not sure whether that is done or not; if not then that's a legit bug/feature request that could be addressed.

I'm not sure I agree entirely with the bug summary. I suspect that the "right" wikitext for this case would use an ampersand-escape for the category sort key:

[[Category:&amp&amp;foo&amp;&amp;]]

which is the standard way to flag to the parser that this is not intended to be wikitext markup, and avoids having /some/ [[ link captions parsed as wikitext and /some/ [[ link captions parsed as raw text. That would also ensure that the name of the category displays properly on the category list page.

But in order for that to work, the category code should probably html-decode the sort key before doing the actual sort. Not sure whether that is done or not; if not then that's a legit bug/feature request that could be addressed.

So I did think about this, but two things made me think it was a bug:

  • It's really unintuitive, since most people would not assume this processing is applied to sortkeys (since the average user doesn't think of category links as being the same as ordinary links). There's no comment in the code acknowledging this is intended behaviour, so I assume it's an oversight.
  • It doesn't happen when used with DEFAULTSORT, since parser functions are processed at an earlier stage. This is an inconsistency that's even less intuitive to ordinary users.

To me, the sortkey field for categories should be treated as nowiki text. As with nowiki, HTML entities would be decoded (i.e. &amp; --> &) , but that's it.

To me, the sortkey field for categories should be treated as nowiki text. As with nowiki, HTML entities would be decoded (i.e. &amp; --> &) , but that's it.

Or at least sort of like parameters to modules: templates expanded, but wikitext syntax not converted to HTML tags.

In English Wiktionary, we have a sortkey-generating module Module:Hrkt-sortkey that outputs sortkeys with varying numbers of ' that are ultimately put into sortkeys and are intended to be passed verbatim to categorylinks.cl_sortkey. I grabbed all the sortkeys containing < from the categorylinks dump for English Wiktionary, and we have 129022 sortkeys with italics and bold tags (regex <[iIbB]>), all or mostly from entries for Japanese and other Japonic languages. The module could output &apos; instead of a literal ' (and we could fix any cases of manually specified sortkeys with literal '' and '''), but on the other hand, I don't see why anyone would ever want '' and ''' to become <i> and <b> in a sortkey because a sortkey isn't displayed as HTML anywhere, so why not make the parser pass it through literally?

It would not be satisfactory if the parser removed all HTML tags, as it does in |alt= in an image link. That would prevent anyone from putting literal HTML tags in a sortkey, which is a plausible scenario. We have a sortkey with what could be a HTML tag if the wikitext parser didn't escape it in the entry for <g>. If we had a reason to write an entry on <i>, we would probably want a sortkey with a literal <i>: [[Category:Translingual lemmas|<i>]]. Ideally, this would put <i> in categorylinks.cl_sortkey and the HTML sanitizer would not "correct" this and behave as if we had written [[Category:Translingual lemmas|<i></i>]] or [[Category:Translingual lemmas|]].

A sortkey with unintentional literal HTML tags is TORTULA <SPAN CLASS="GLOSS-BRAC">(</SPAN><SPAN CLASS="GLOSS-CONTENT"><SPAN CLASS="LATN" LANG="EN">POTTIACEAE</SPAN></SPAN><SPAN CLASS="GLOSS-BRAC">)</SPAN>\nBRODEK in the entry brodek. Someone put a HTML-generating template in another template's parameter, and that template put the HTML in a category link.

An oddity is that the italics can span two category links so the opening and closing tag aren't both in the sortkey. The sortkey たすとほっくす</I>\nダストボックス (newline escaped for convenience) with only a closing italics tag was generated from [[Category:Japanese terms with usage examples|たすとほっくす'']][[Category:Japanese terms with usage examples|たすとほっくす'']] (duplicate category links) in the the entry for ダストボックス. Apparently, the parser interprets the '' in the first link as <i> and the '' in the second link as </i>, and the sortkey from the second link ends up in the database.

This only applies to category links; italics can't span across pagelinks: [[abc|ab''c]] [[def|d''ef]] generates the equivalent of [[abc|ab<i>c</i>]] [[def|<i>d</i>ef]]. Maybe this is because the italics tags actually end up in the potentially visible part of the HTML and a sanitizer or something cleans them up, taking the equivalent of [[abc|ab<i>c]] [[def|d</i>ef]] and adding the missing closing and opening italics tags. Image links like [[File:Example.svg|thumb|ab''c]][[File:Example.svg|thumb|d''ef]] have their captions' HTML similarly sanitized. Image alt text on the other hand has HTML tags removed. [[File:Example.svg|thumb|alt=ab''c]][[File:Example.svg|thumb|alt=d''ef]] generates the alt attributes alt="abc" and alt="def".

It looks like the behavior of italics and bold syntax in sortkeys in category links was borrowed from its behavior in the display text in wikilinks, minus the sanitizing of the HTML tags. It seems like an oversight because in English Wiktionary entries, we want literal double and triple apostrophes, and I can't think of why anyone would want HTML tags instead. But if they did, they could manually input <i></i> or <b></b> into the category link.

Just to reiterate Erutuon's point above, that approximately 1.6% of all mainspace pages on the English Wiktionary- that's a lot! Obviously we can adjust the modules to account for this, but it's a situation that was difficult to spot, and causes problems in sorting that are difficult to debug since it's so unintuitive. It really should be being treated as a bug, not a feature request.

Change #1031526 had a related patch set uploaded (by Theknightwho; author: Theknightwho):

[mediawiki/core@master] Don't convert quotes to HTML tags if used in category sortkeys

https://gerrit.wikimedia.org/r/1031526