Page MenuHomePhabricator

TitleKey should drop special characters, like diacritics
Open, MediumPublic

Description

+++ This bug is a superset of Bug #19882 +++

TitleKey does a normalization for titles comparing just their uppercase values (Language::caseFold).
However, there're also many redirects done just for making users arrive at the page with the diacritics from an ASCII title.

The steps to take would be like these:
*Create a normalizing function at Language.
*Make TitleKey::normalize use it before caseFolding.
*Add some basic scoring to TitleKey::exactMatch instead of just returning the frst one, since the options are now much larger.
*Add a hook to also plug it into MediaWiki:Noarticletext


Version: unspecified
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:52 PM
bzimport set Reference to bz20097.

Scoring could indeed be interesting... certainly we want a "more exact" match before a "less exact" one, especially for languages where diacritics and such frequently distinguish between minimal pairs.

Update component, assign to maintainer.

Removing assignment from some tasks I'm not actively working on. Volunteers welcome, I'm happy to help if pinged!

Change 286580 had a related patch set uploaded (by Luis Felipe Schenone):
Further normalise keys to find titles with special characters

https://gerrit.wikimedia.org/r/286580

Change 286580 abandoned by Sophivorus:
Further normalise keys to find titles with special characters

Reason:
I haven't been able to rebase this, so I'm submitting another change for it.

https://gerrit.wikimedia.org/r/286580

Change 382968 had a related patch set uploaded (by Sophivorus; owner: Sophivorus):
[mediawiki/extensions/TitleKey@master] Further normalise keys to find titles with special characters

https://gerrit.wikimedia.org/r/382968

@brion Hi brion! Many months ago I submitted a patch that fixed this issue, but it went unnoticed. Now I submitted another patch because the previous one couldn't be merged any longer. It doesn't do everything that is requested in the task description, but it does at least part of it. Could you check it, please? Thanks!

Sophivorus removed Sophivorus as the assignee of this task.

Another year gone... the lack of normalization contributes to "Illegal mix of collations" errors, which results in user-facing database query error messages due to search term encoding.

Mon Dec 10 23:07:08 UTC 2018    test    wikidb-mw_      TitleKey::exactMatch    localhost       1267    Illegal mix of collations (latin1_bin,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '=' (localhost)    SELECT  page_namespace,page_title  FROM `mw_titlekey`,`mw_page`   WHERE (tk_page = page_id) AND tk_namespace = '0' AND tk_key = 'KINGMAKER ZARZĄDZANIE'  LIMIT 1

Applying @Sophivorus 's patch to the master branch avoids the error.

Change 382968 abandoned by Sophivorus:
Further normalise keys to find titles with special characters

Reason:
Per Brion

https://gerrit.wikimedia.org/r/382968