Page MenuHomePhabricator

Support collation by a certain locale (sorting order of characters)
Closed, ResolvedPublic

Description

National wikipedias still have to bear English sorting order, which orders the accented characters outside of normal sort which is "not acceptable" for the languages. This problem is most visible in the rise of Categories, but present in every automatic list, like Allpages or the list of registered editors.

I would guess either mySQL or PHP can utilise national locales/collation orders which would simply require the sites to specify their locale and there we go. Or do I guess it wrong? (I am only familiar with perl and postgres, where this works.)


Version: unspecified
Severity: major
URL: https://www.mediawiki.org/wiki/Bugzilla/Notes/164

Details

Reference
bz164

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I've added some feedback on the MySQL bug. I urge anyone here who cares about this bug to also pay attention to that bug. Perhaps by working on their code, perhaps voting for the bug, perhaps providing further helpful feedback. Perhaps they will raise its importance if they see it is important for MediaWiki and Wikipedia - a top 5 website. I believe MediaWiki and MySQL have worked together in the past.

http://forge.mysql.com/worklog/task.php?id=1213

Well I didn't want to chime in (you seem to be able to shout at each others without my assistance :-)) but please do not forget that the solution for MEDIAWIKI (as opposed to Wikipedia) ought not base on Mysql specific features.

If there are two approaches, and one of them is database backend agnostic then that one should be preferred.

Of course I didn't want to mention that I believe code speed (resource hunger) is a real factor here, since this is accessed frequently.

By no means I want to speak against helping Mysql, since it's a nice open source project. I just use postgresql, that's all. :-)

Anyway, the backend compatibility layer that maps the SQL into the SQL dialect spoken by the SQL engine can still adapt itself : if there's a true support for collation supported by the SQL backend, it can be used instead of the client-side (in PHP) implementation in the MediaWiki code itself, if it provides additional performance benefits and stronger security. If there's no real difference, then we can simply abandon the SQL-based collation (even if its supported) and use the PHP implementation everywhere

This will simplify the design of the SQL-backend adaptation layer within the MediaWiki code.

This will allow easier integration of other backends, such as Postgres (suggested in Comment #150), or Oracle, Sybase, Informix, MSSQL, or other backends available on Windows servers through ODBC, or others accessible through JDBC via a PHP-to-Java adaptation layer... if the PHP code used in MediaWiki can also run from within a Java-based JSP application server or .Net-based ASP application server, or from within other newer VMs that are already being developed to support both the Java and .Net environments simultaneously. There are very exentive research to make all these VMs compatible with each other (and to ease the migration with mutual compatibility, and with transparent deployment facilities in heterogeneous environments like computing grids and cloud computing).

Note that in some future, even the WMF projects could benefit a lot of such facilities, when its servers will be virtualized to scale better with reduced costs, because we all know that the WMF projects need more and more money each year to support its ever growing databases and audience, and the increased need for supporting better internationalization. It is highly strategic (and should also be discussed in the WMF Strategy wiki) to reduce the technical dependencies (also because there's now a demonstrated need for open-projects to work together, such as within the Open Alliance which has lots of contents to share with the WMF, thanks to their compatible licencing schemes).

bugzilla wrote:

Peter, for PostgreSQL you can ignore all the hacks and workarounds and just use the built in collation.

psql (8.4.1)
Type "help" for help.

a=# create table category (title varchar(255) not null primary key);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "category_pkey" for table "category"
CREATE TABLE
a=# insert into category values ('Apple');
INSERT 0 1
a=# insert into category values ('Banana');
INSERT 0 1
a=# insert into category values ('aaa');
INSERT 0 1
a=# insert into category values ('banana');
INSERT 0 1
a=# insert into category values ('aáa');
INSERT 0 1
a=# insert into category values ('apple');
INSERT 0 1
a=# insert into category values ('Apple');
ERROR: duplicate key value violates unique constraint "category_pkey"
a=# insert into category values ('APPLE');
INSERT 0 1
a=# insert into category values ('aab');
INSERT 0 1
a=# select * from category order by title;

title

aaa
aáa
aab
apple
Apple
APPLE
banana
Banana
(8 rows)

The easiest solution would probably be to drop support for MySQL. Not that that's ever going to happen.

MySQL can do the collation that Anthony mentions for PostgreSQL above in #152 and much more besides.

PostgreSQL has all the problems and more of MySQL when it comes to MediaWiki's collation needs. It depends on the underlying OS, some of which don't do UTF-8 collation at all and as yet does not make us of ICU. See the page here: http://wiki.postgresql.org/wiki/Todo:ICU

I can't find any information on whether PostgreSQL supports Unicode beyong the Basic Multilingual Plane.

That's another good resson why collation should be supported directly within the MediaWiki software, which already depends completely of PHP, so that it should really use the best integration as possible using the dedicated ICU integration module for PHP.
This also means that it will be much better to simply store the computed collation keys directly within the database schema, unless the database adapter for PHP already supports ICU (and there's a commitment from the database vendor to integrate ICU as part of its design).
The Todo:ICU in PostgreSQL will be fine when it will be effectively implemented, and this integration becomes fully supported.
But anyway, the first thing to do is to press the PHP developers to have their own commitment to offer full support and integration of ICU within PHP, and if this is still nto the case, making sure that the ICU integration module for PHP comes from a viable project (otherwise there will be the need, in MediaWiki, to develop an adaptation layer for collation, that will support transparent change for another PHP integration module, or a later integrtion within PHP core itself).

The second thing to look for (and that is still missing) is a support for a ICU-like project (or port) for Javascript (for integration on the client-side, in the browser) with here also an Javascript-written adapter layer, that allows replacement of the Javascript-written collator by some future API supported natively by browsers (because it will perform much better).

The best integration tools (for client-side collation) that I have seen, using Javascript, fully depends on AJAX (i.e. with collaboration with serverside-scripts that can provide precomputed collation data, or that can compute the collation keys from the client-provided texts): some interesting demos use JSON requests or XML requests though AJAX, but this adds some delays and increases the number of HTTP requests needed to sort lots of client-side data (for example when sorting the rendered HTML table columns, which currently just uses the Javascript "localeCompare" function which seems to use only the DUCET or some locale-neutral collation, without taking into account the actual locale).

It would be much better if all major browser engines (for IE, Mozilla for Firefox, Wekbit for Safari/Chrome/KHTML) decided to extend the very poor support of Unicode and locales within Javascript/ECMAScript strings, using ICU as a base foundation or at least for the services API that it can implement (even if those browsers use different integration strategies): they should still support the same collation rules, with the same syntax in a similar language, such as the languages already documented in the Unicode standard, including the possibility to use the collation tailoring data already coming the CLDR project, and the possibility for these implementations to still support user-specified tailorings (so without hardcoding them in a way that would completely depend on the implemented Unicode version and the limited list of locales already supported by CLDR).

There are two standard languages defined for collation tailorings : one is XML-based but is extremely verbose (it is probably easier to use from a DOM-based object view, and most probably more efficient at run-time), another equivalent one is much more compact and more readable and much easier to specify by users or in scripts. Both syntaxes are automatically and easily convertible between each other, with equivalent compilation times and complexities, but the comapct form is easier to transmit in small scripts over HTTP (including through AJAX), and the compact form is much faster to parse as it won't depend on a ressource-hungry XML parser (due to its required and complex conformance rules). For Javascript clients, a JSON-based syntax for collation tailorings may also be even more efficient without requiring additional complex code written in Javascript itself.

Apparently non-BMP UTF-8 was in MySQL 6.0 which was around as alpha for a while and whose features are all due to be added into upcoming 5.X versions: http://lists.mysql.com/packagers/418

chriskwonlewis wrote:

I'm not sure if i'm posting in the right category, but I figured I should post it here before making a new one post.
There's a problem for sorting with Korean page names.
If the page is 가, it will sort under ㄱ if you don't specify where to sort it (which is what it should do).

However, if I manually sort it to like this: [[Category:name|ㄱ]] it won't sort under the same ㄱ as before. Instead it will make another ㄱ to sort under.
So making if a user manually sorts a page under ㄱ, and another has makes a page that starts with the letter ㄱ and mediawiki automatically sort it -- even though they should be sorted under the same letter, they are not. My mediawiki is the latest version by the way (1.15.1)

Incase your confused:


-가 (manually sorted page)


-(page)

.
.
.


-(page)

<<end of korean alphabet, there should be no more Korean letters, but pages that weren't sorted manually will appear after>>

ㄱ (appears at the end, identical letter as before and should not be recreated by mediawiki)

  • 기 (a page that was sorted by mediawiki, this should be sorted in the same section as page "가")

catlow wrote:

As I understand it, this bug was wrongly assigned to someone who is not working on it (at least, not in its full generality), so I'm boldly de-assigning it, in the hope that someone will finally take it up.

When setting up a new(!) wiki, and when using MySQL 4.1/5.0 UTF-8 (not: binary) a correct page title sort order can be achieved when applying this patch

in maintenance/tables.sql change

page_title varchar(255) binary NOT NULL,

to

page_title varchar(255) NOT NULL,

before you run the installer.

The column page_title will then become of encoding type

character set utf8 collate utf8_general_ci

and the sorting order of page titles is correct for (at least European languages). You then get page titles in Special:Pages listed in the order "Anton", "Äpfel", "École supérieure", "Zulu" - and not any longer "Anton", "Zulu", "Äpfel", "École supérieure" as is currently the case on all WMF wikis.

This tip is only valid for fresh installations: you cannot change this important column when updating from an existing database even not when mysqldumping and importing; attention: the tip has not yet checked for unwanted side effects.

For collations see
[1] http://www.collation-charts.org/mysql60/ and
[2] http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html

studerby wrote:

Re: comment 159

While the suggested patch clearly dramatically improves the page title sorting, it doesn't make it 100% correct for all "European languages" though it may make it so for some, and I think it would be an improvement for all but the Nordic languages.

The different European languages have different and conflicting collation sequences; no single order will work for all of them.

For example:

German speakers expect "Ö" and "O" to collate as equivalents; Danes, Norwegians, and Swedes expect "Ö" to be treated as a separate letter that sorts after "Z".

French has an unusual collation rule; the last accent on a word is more significant than the first, leading to this collation:
cote < côte < coté < côté. This is unique to French, so far as I know.

Hungarian sorts the combination "dzs" after the combination "dz" (i.e. "foodzsbar" should sort before "foodzbar" in Hungarian).

In Estonian, "Z" sorts as an equivalent to "S".

Etc. etc. etc.

The suggested patch may be very appropriate for individual Wikimedia installations, but is not so as part of the proper fix.

ayg wrote:

utf8 support in MySQL supports a number of common languages, including Danish, Swedish, French, Hungarian, and Estonian:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html

(In reply to comment #160)

Actually, I think the issues you raised with collations and languages has nothing to do with MediaWiki; if anything, MySQL is to be blamed. All we want here is for MediaWiki to use the full extent of collation features offered by MySQL. Perhaps not all languages will benefit from it, but at least some of them do, and that's enough a reason for a change like this.

(In reply to comments #159 and #160)

While the suggested patch clearly dramatically improves the page title sorting,
it doesn't make it 100% correct for all "European languages" though it may make it so for some, and I think it would be an improvement for all but the Nordic
languages.... Norwegians, and Swedes expect "Ö" to be treated as a separate letter that sorts after "Z".

Yes, I know this, too.

The suggested patch may be very appropriate for individual Wikimedia
installations, but is not so as part of the proper fix.

Thanks for pointing out. I must admit, that you are _fully_ right, also knowing of the collation differences. We developers should collaboratively think to find a satisfying solution for the "collation problem".

This could be done in Berlin during the Wikimedia Developer Workshop 2010, held on April 14.-16. in Berlin, Germany, as part of the Wikimedia Conference 2010. http://techblog.wikimedia.org/2010/03/registration-open-for-the-developer-workshop-in-berlin/

ayg wrote:

(In reply to comment #163)

Thanks for pointing out. I must admit, that you are _fully_ right, also knowing
of the collation differences. We developers should collaboratively think to
find a satisfying solution for the "collation problem".

The options I know of are:

  1. Use MySQL collation support, just using utf8 everywhere. This will mess up equality comparisons in ways we almost certainly don't want, doesn't support characters outside the BMP, etc.
  1. Roll our own sortkeys. Probably requires a lot more work, and a lot less efficient (need to keep extra column+index around in all relevant tables). Also might be tricky, since the default Unicode algorithm gives very long sort keys, which we'd have to squish somehow (AFAIK).
  1. Use some utf8 collation for cl_sortkey, but keep everything else binary. We could also add an extra page_sortkey using utf8 collation, which is no more expensive than (2). This sounds like the best option to me offhand. It would be as easy to implement as anything here, and wouldn't have such bad side effects. It would somewhat mess up pages with non-BMP characters in the name, though.

(In reply to comment #164)

(In reply to comment #163)
The options I know of are:

  1. Use MySQL collation support, just using utf8 everywhere. This will mess up

equality comparisons in ways we almost certainly don't want, doesn't support
characters outside the BMP, etc.

  1. Roll our own sortkeys. Probably requires a lot more work, and a lot less

efficient (need to keep extra column+index around in all relevant tables).
Also might be tricky, since the default Unicode algorithm gives very long sort
keys, which we'd have to squish somehow (AFAIK).

  1. Use some utf8 collation for cl_sortkey, but keep everything else binary. We

could also add an extra page_sortkey using utf8 collation, which is no more
expensive than (2). This sounds like the best option to me offhand. It would
be as easy to implement as anything here, and wouldn't have such bad side
effects. It would somewhat mess up pages with non-BMP characters in the name,
though.

Can you start writing an extension, which outputs the list of pages http://en.wikipedia.org/wiki/Special:AllPages in a user-definable collation such as utf8_general_ci; utf8_unicode_ci; utf8_swedish_ci ? This would be a good starting point in my view.

ayg wrote:

This shouldn't be an extension, we may as well have a page_sortkey added in core. Makes more sense there. However, categories are lower-hanging fruit, since we just need to change cl_sortkey's collation -- the column already exists. Plus, categories are more visible. This needs a sysadmin either way, however, not just developers.

(In reply to comment #166)

This shouldn't be an extension, we may as well have a page_sortkey added in
core. Makes more sense there. However, categories are lower-hanging fruit,
since we just need to change cl_sortkey's collation -- the column already
exists. Plus, categories are more visible. This needs a sysadmin either way,
however, not just developers.

Sounds reasonable, ok, but who comes up with a sustainable solution for the core code and this buzilla ?

ayg wrote:

Changing cl_sortkey to utf8 *is* a solution. It needs a sysadmin to do it on Wikimedia (if we want to do it), and we'd probably also want some core code to filter out non-BMP characters before they get to the sortkey.

[[mw:Extension:CategoryMultisort]] and [[mw:Extension:CategoryMultisortChinese]] written by me is expected to resolve the problem on Chinese Wikipedia.

I think using ICU sort keys, possibly with a pure PHP port to support non-WM installations, would be a good way to go. It would support all the languages we want to support, which is important.

ayg wrote:

I'll be working on this bug. I hope to have a solution coded up within a couple of weeks. I wrote a post to wikitech-l about it, and encourage people to respond there rather than here (since this involves several bugs):

http://lists.wikimedia.org/pipermail/wikitech-l/2010-July/048399.html

I'll also note that the enormous amount of bickering and repetition here meant that I wound up skimming it, so if I missed your suggestion, please repeat it on the wikitech-l thread.

But if the SQL engine does not have such support, this must be implemented in
the PHP code and collation keys can be stored in a new datacolumn (the extra
data column can be added or filled conditionnally : if the SQL engine supports
the needed collations, this column can remain NULL to save storage space).

If you sort this stuff in PHP, you need to grab the entire list before you can
reliably sort it. Doing that for [[Category:Living people]] has no chance of
staying within the memory limit.

And this was false. Because you assume that the generation of sort keys has to be done on database queries for listing the content of a category, when instead this generation os sortkeys can be done safely only on database inserts/updates for each separate page.

What I wanted to say is that the computed sortkeys will have to be stored. But several sort keys for the same page in the same category are possible (one for each collation locale indicated by the target category).

There will be no memory limit, but of course this will have a storage cost, as the stored sort keys will have to be queried along with the list of pages to display.

The good question to ask is: where do we store these sortkeys? Currently we have a SQL relation table containing a unique index on (categoryID, sortkey, pageID) and this is enough to perform the join with the table of pages. However thre's still only one sortkey per page and per category.

That sortkey is needlessly using the pageID within the generated sortkey (this is is visible when crossing a page limit and navigating throught pages) so in fact the unique index is on (categoryID, "augmented sortkey"). Conceptually wrong and bogous (I think this was just a fast patch when there were unicity problems and multiple pages could be specified with the same sortkey).

What is limiting you of changing the relation table containing the list of pages in categories, of using instead a unique index on:
(categoryID, sortkey, pageID, localeID)
where the localeID is one of the locales supported by the project, which specifies: the language for which a collation is being performed, and a collation variant (for example, in Chinese, sort by radical/strokes with locale="zh-Hans", or sort by pinyin with locale="zh-Latn")

The generation of the concent of the sortkey column is the only major problem requiring a design decision. This is where it should not even depend on the SQL engine, and where it can be implemented within PHP, using the PHP extension that allows using ICU functions. That string does not have to be extremely long and does not have to be be humane readable.

It can be safely be stored with a reasonnable length limit. So ICU-generated sortkeys are still safe if they get truncated. Notably because the unique index on:
(categoryID, sortkey, pageID, localeID)
is also unique on its restriction:
(categoryID, pageID, localeID)
And the sortkey generated by ICU, even if it's a string of binary bytes can still safely be stored in a table index that does not support blobs but want only "VARCHAR(n)" types, by serializing the binary sortkey to a safe encoding (the most basic that will work is hexadecimal) that does not even require the support of Unicode or UCA collation. Just use an ASCII only column to store the computed binary sortkey serialized as an ASCII-only string.

But if the database engine supports strings of bytes, just don't serialize the blob, use the supported SQL type that can store it directly, for example VARBINARY(n), if it remains sortable in binary order.

With this design, you are completely independant of the SQL engine, it will work identically on MySQL, PostrgresSQL, or others. And you'll have solved the problems of collation with multiple locales according to their rules, and possibly according to visitors preferences.

Note: above, the localeID is not directly a VARCHAR(n) containing "en", or "zh-Hans". It is an arbitrary unique numeric identifier that maps in fact to a collation rule within a locale, and this collation rule may need to be updated from time to time: when upgraded a collation, you'll generate additional keys with a new localeID. And when this is done the table of supported collations will indicate which localeID is the current one to use, and you'll be able to perform easiky the cleanup of old sortkeys that where computed within the old rule.

It's not a complicate design, and it offers stability warranties and supports as well the possibility of upgrading the collations.

The possibility of offering multiple sortkeys for the same page in the same category comes as a bonus, and you can assign "localeID=0" to store the user-specified sortkey that has been assigned in a page using the second parameter of the [[category:pagename|sortkey]] link or the parameter of the {{DEFAULTSORT:sortkey}} if this paramter was missing (this will avoid having to reparse the pages just to retrieve this user-specified sortkey).

(In reply to comment #172)

And this was false. Because you assume that the generation of sort keys has to
be done on database queries for listing the content of a category, when instead
this generation os sortkeys can be done safely only on database inserts/updates
for each separate page.

What I wanted to say is that the computed sortkeys will have to be stored. But
several sort keys for the same page in the same category are possible (one for
each collation locale indicated by the target category).

[etc., etc., etc.]

I suggest you read comment #171 and the wikitech-l thread linked from there. That thread is an implementation proposal by Aryeh, who, unlike pretty much everyone else commenting on this bug (myself included), is actually gonna start implementing it; he suggested he might start today. This bug already has 173 comments, and posting more long comments about hypothetical implementations while not replying to the implementation plan of the one person actually being serious about getting this done is not useful and might just push this bug to 200 comments without getting anywhere. If we're gonna discuss anything, let's discuss the current implementation plan: it is the only relevant plan to discuss at this point.

If we're gonna discuss anything, let's

discuss the current implementation plan: it is the only relevant plan to
discuss at this point.

This is EXACTLY what I was discussing: proposing an implementation design, which also considers the fact that collations will also need to evolve over time (for example the UCS repertoire is evolving (so the DUCET table is modified), and collation rules are being corrected for some languages, in the CLDR project) : each change will generate a new internal localeid to support it, and there will be possibly different keys during the transition, even if (finally) an old collation rule will be deleted (with its old sortkeys) after the new sortkeys will have been fully recomputed.

So this is clearly not "blah blah". And this is certainly relevant for the fact that you're considering implementing some or all of the suggestions (and you'll have to test your solutions, including on their performance impact.

I propose a simple model that can be very space-efficient, and that also avoids reparsing all pages if ever a collation rule is changed, or if additional collation rules are added in a category to support multiple sort orders (notably within Chinese categories that could support different orders).

My proposal does not even depend on the backend SQL server capabilities (all it has to support is at least a binary order on ASCII-only VARCHAR(n) to store the computed and truncated sortkeys, that will be generated by the PHP front-end (using ICU) and followed by an ASCII-only serialization. This means that the simplest "ORDER BY" queries to retrieve correctly ordered lists of pages will work independantly of the backend.

The function used in PHP to generate the binary-ordered sortkey (that will finally be effectively stored) should also be accessible in MediaWiki as a builtin parser function, that will take two parameters: the locale code, and the text.

For example, as {{SORTKEY:text|locale}}, where the ''locale'' specified can be optional and will take the default value of the {{CONTENTLANGUAGE}} of the project).

This builtin parser function could also be used to correctly sort the sortable Mediawiki tables inserted in articles, by using correct sortkeys generated by this template, if the generated sortkey is readable and effectively serialized as ASCII-only, but it does not necessarily have to be truncated by this function, even if it will be truncated when the same function result will be used to store sortkeys in the schema).

This parser function should even be the first development to realize, before even changing the category-page indexes, because it can be applied and tested immediately in existing categories (for example by using categorizing templates in Wiktionary), without even upgrading the SQL schema.

(In reply to comment #174)

If we're gonna discuss anything, let's

discuss the current implementation plan: it is the only relevant plan to
discuss at this point.

This is EXACTLY what I was discussing: proposing an implementation design,

AFAICT you're proposing your own design without mentioning where it differs from or coincides with Aryeh's proposal. From briefly reading your latest comment, it looks like a lot of what you're proposing is already in Aryeh's proposal, such as the binary-ordered sortkey thing. Other things, such as the attention you're paying to the possibility that collation rules may change, are not covered in Aryeh's proposal, so those are valuable additions.

So you're right, it's certainly not "blah blah", but it would help a lot if you limited your comments to the differences between your and Aryeh's proposal, rather than leaving it up to the reader to find and filter the parts the two proposals have in common.

Anyway, the Aryeh's proposal is not found or documented at the location you indicate. It just says that he will try to work on it today and still asks for solutions and asks for questions that are unanswered in his comment on the wikitech-l list.

In a similar spirit, the generation of the section heading for categories could
also be a different builtin parser function such as:

: {{COLLATIONMAP:text|locale|level|clusters}}

with the similar parameters, and clusters=1 be default (more below). It will
return a non-opaque string that can be displayed in category pages, and could
even become a valid anchor for searching starting at some text position in the
list (ordered using the specified locale). It should be only a level-1
collation mapping by default (only level 1 will be considered for displaying
headings, however some categories could later specify another default collation
level for such mapping.

Note that collation-based mappings is a VERY different concept from the concept
of collation-based sortkeys (read the Unicode UCA specification): sortkeys are
opaque and intended to generate a full order of texts, mappings are readable
but only intended to produce a partial order.

Another optional parameter could be given after the collation level, to
indicate how many locale grapheme clusters should be included in the heading.
The default
headings in categories should just use 1 grapheme cluster from the text given
to the first parameter of {{COLLATIONMAP:text|locale|level|clusters}} (more
below).

In a category you could *LATER* specify additional sort orders (and collation
mappings at the same time) using a syntax like:
{{SORTORDER:locale|level}} (the default collation level will be 1 for
categories).

Example 1: in a Chinese category page, you could specify:
; {{SORTORDER:zh-hans}}
: to indicate that pages in that category will be also available using the
radical/stroke order of sinograms.
; {{SORTORDER:zh-latn}}
: to indicate that pages in that category will be also available using the
Pinyin Latin order.

Example 2: in a Korean cateogry, where the primary order is based on the
decomposition into "jamos", it will be enough to specify:
; {{SORTORDER:ko}}
: (for the default South Korean order of jamos)
; {{SORTORDER:ko-kp}}
: (for the default North Korean order of jamos)

Indicating the collation level with a value different from 1 could generate sub
headings for level 2, but I think it should only display them with the
specified level, all using the same heading from the level-2 collation mapping.

Indicating clusters=2 (or more) in the builtin parserfunction {{COLLATIONMAP:}}
may be used to generate more precise headings (for example in heavily English
or French populated categories, use 2 grapheme clusters to generate headings on
the 2 first letters, but keep the collation level to 1). By default the builtin
parser function will not limit the number of grapheme clusters (so it will
remap all the text), but a category could still specify this maximum number of
clusters to consider for headings.

By default a category will consider either

  • clusters=1 : section headings will be generated only for the first cluster

(this is what is currently used in categories). or

  • clusters=0 : section headings will not be generated at all.

(this default could be a per-project default).

The generation of section headings (using the same PHP function used by
{{COLLATIONMAP:}}) does not require any modification of the schema. Headings
can be computed and generated on the fly, from the retrieved list of pages.
Headings should not even influence the sort order, they are just convenient
groupings for display.

The {{COLLATIONMAP:}} function described here would in fact enter in the list
of string builtin function, as it falls in the same category as other mapping
functions like {{UC:}} or {{LC:}}. This is just another mapping.

The generation of sort keys (using the same PHP function used by {{SORTKEY:}})
however requires active parsing to store them in the schema. So it can only be
done later. This developement should be made *later* (when the SQL schema for
category indexes will be adjusted to support multiple/upgradable collation
orders).

So yes, a builtin parser function such as
{{COLLATIONMAP:text|locale|level|clusters should first be implemented and
tested separately before being used to group items in category headings, but it
can be implemented separately from the developement and deployment of the SQL
schema for indexing categories with {{SORTKEY:}}), and integrated later in the
code that will present the list of pages to the users.

(In reply to comment #176)

Anyway, the Aryeh's proposal is not found or documented at the location you
indicate. It just says that he will try to work on it today and still asks for
solutions and asks for questions that are unanswered in his comment on the
wikitech-l list.

He does document a fair bit of what he wants to do (e.g. separating files, categories and regular pages with sortkey prefixes). It's not a complete proposal and he is asking for input, but he hasn't left everything open either.

Also, why keep the discussion in two places? Why not join the discussion on wikitech-l?

Because his specification is really incomplete, and he said that Bug#164 was useless (despite of the fact that I had described my solution extensively in Bug#164long before Ariyeh started working on it.

And yes, before ever attempting to change the schema, I support the prior developement and extensive testing of builtin parser functions supported by PHP functions which will be shared later to support the updated SQL schema.

Only this developmeent alone will have significant difficulties:

  • notably integrating ICU in a PHP module installation, or
  • rewriting the collation algorithms entirely with PHP;
  • having to support the DUCET updates caused by new Unicode versions or corrections;
  • having to support multiple collation orders by per-locale tailorizations (coming from CLDR or from other sources).

The need to support upgraded collation orders is also an important decision factor for the schema, if sortkeys are stored in a SQL backend, that's why I speak about it very early:

  • collations supported by SQL backends have very strong limitations, or any upgrade would require shutting down the servers for hours or days to perform the upgrade of collated indexes.
  • in their missing full ISO 10646 "level 3 implementation" for the support of supplementary planes.

All this is something that can be avoided completely by using ICU and not depending on SQL backends for their support of many more collation locales that we need in our international projects:

  • the schema just needs to be able to store multiple sortkeys, so that newer sortkeys (computed with the new rules) can be progressively computed in the background by a bot or server script or some upgrades occuring on the fly when processing articles.
  • older sortkeys that were using a older generation rule can be deleted in a simple DELETE operation after the new collation rule for a corrected locale has been made the default one, or can be deleted one by one each time a new generation sortkey is recomputed and has been inserted (there's not even the need to perform the two sucessive operations in a transaction if the first INSERT withe the new rule has been sucessful).

Because we have now multiple sortkeys per indexed page in a category, we can conveniently support multiple sortkeys for different locales and offer a good experience for users that will want alternate sort orders (notably Chinese users that will want presentations in radical/stroke order, or in pinyin order).


Another note about how to serialize the opaque sortkeys:
the builtin function {{SORTKEY:text|locale|level}} described above will not
limit the length of the generated binary sortkey, however it should serilize it
in a valid Unicode text that can be used in tables.

A convenient serialization of bytes to characters that will sort correctly is
Base-36 using the alphabet [0-9A-Z] (no padding necessary) or Base-32 (it
avoids modular arithmetics but will serialize into longer strings)

If sortkeys are about to be stored, retrieved in the SQL schema, and sorted by
the SQL clause "ORDER BY...sortkey...", then:

  • either the SQL backend allows storing and sorting binary sequences of bytes

as VARBINARY(N) : then no extra serialization is needed, store directly that
opaque sort key, after truncation to the max length value (N) indicated in the
SQL type of the "sortkey" table column.

  • or the SQL backend does not support sortable binary sequences of arbitrary

bytes, but can only sort VARCHAR(N), then use a similar Base-32 or Base-36
conversion to create compatible sortkeys, and then store the converted string
after truncating to the max length value (N) indicated in the SQL type of the
'sortkey" table column.

  • in both cases, the stored sortkeys will NEVER be exposed to users, its sole

purpose is to make the SQL "ORDER BY" clause work properly.

To start listing a category from a given artbitrary Unicode text, use the
"start=" HTTP query parameter and compute internally the sortkey associated
with it to generate the value used in SQL clause "WHERE sortkey >= 'value'".

  • Section headings in categories will never need to be stored, they are

generated on the fly by reading the page names retrieved in the SQL result set
using the {{COLLATIONMAP:}} function, with the specified locale in the
"uselang=" HTTP query parameters, and the specified (or default) "clusters="
parameter (whose default will be 1 or 0 as indicated above). They will be
diretly readable by users and do not require decoding anything from the stored
sortkey.

  • the readable collation mappings and the opaque sortkeys should be coherent in

the same locale, but they are clearly different: pagenames that are
collation-mapped should sort in the same natural order as the section headings
generated from them, so it's absolutely not needed to generate sort keys from
collation-ampped headings computed in the fly.

(In reply to comment #178)

All this is something that can be avoided completely by using ICU and not
depending on SQL backends for their support of many more collation locales

This is exactly what Aryeh is proposing. I think everyone agrees that it's better to use binary sorting with munged sortkeys rather than SQL backends' half-baked collation support, so you don't need to argue that.

  • Section headings in categories will never need to be stored, they are

generated on the fly by reading the page names retrieved in the SQL result set
using the {{COLLATIONMAP:}} function, with the specified locale in the
"uselang=" HTTP query parameters, and the specified (or default) "clusters="
parameter (whose default will be 1 or 0 as indicated above). They will be
diretly readable by users and do not require decoding anything from the stored
sortkey.

This is not that simple, as was pointed out on wikitech-l: what if you've got a language where á sorts the same as a (that is, a and á are the same for sorting purposes), then your sorted result could look like:

Áa
Áb
Ac
Ád
Ae
Af
Ág
...

We humans understand that the proper heading for this is "A", not "Á", but how will the software know that? Even if we store the original, unmunged sortkeys (note that the sortkey is not necessarily equal to the page name: [[Albert Einstein]] sorts as "Einstein, Albert") so we can differentiate A from Á, we can't just take the first or even the most frequent letter: neither is accurate in this case.

Your issue *IS* addressed in my proposal:

*Both* {{COLLATIONMAP:Áa}} and {{COLLATIONMAP:Aa}} will be unambiguously "aa" in the primary collation level for that locale. They only differ at the secondary collation level (with accents).

You did not understand that a collation-mapping is DIFFERENT from an opaque sort key, even if both are built using the same collation rules for the same locale.

The case of "Albert Einstein" sorting as "Einstein, Albert" will pass through the standard generation of the sortkey from the string "Einstein, Albert" explicitly given in MediaWiki source code as the parameter of the {{DEFAULTSORT:Einstein, Albert}} special function or as the second parameter of the [[category:...|Einstein, Albert]] link.


So here is a development and deployment map:

  1. Develop to PHP functions that will compute:

    function sortkey($text, $locale, $level=1)
    • it will return an opaque array of byte values
    • $locale may be given a default value from the project's content language, but this is not specified here but part of its integration in MediaWiki
    • $level may take the default value of 1.
    • the algorithm involves parsing steps to transform the $text parameter into normalized form, then parse it by collation element, and then locating the collation element in the tailored collation table, which is indexed by collation element value and returns an array of collation weights, one for each level.
    • it packs all the collation weights into the returned opaque array of byte values, by appaending all non-zero collation weights for each collation element at the same collation level before appending the collation weights for higher successive levels.

      function collationmap($text, $locale, $level=1, $clusters)
    • it will return a remapped text using the same $locale and $level parameters
    • it will use the same DUCET table and the same per-locale tailorings
    • the same parsing steps are performed
    • but instead of packing the collation weigths, it scans the collation table in the reverse order, by locating the first collation element (a small Unicode string, often limited to a single character) that has the same collation weights up to the specified level. When this smallest collation element is found, append this to the result string.

      function base36($bytes)
    • it packs the opaque binary array of bytes into plain ASCII that has safe binary order and can be safely be stored in a VARCHAR(N) table field, or that can be returned in a MediaWiki function.

      This module should use ICU and implement the locale tailorings, and should be able to support a full DUCET table,and allow lookups from a collation element to an array of collation weights, or the reverse (and ordered) lookup from a collation weight to a collation element for the function collationmap())
  1. Integrate these functions in a Media Wiki extension for builtin parser functions.

    {{SORTKEY:text|locale|level}}
    • this will return base36(sortkey($text, $locale, $level))
    • by default take $level=1
    • by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?)
    • it can be used as a much better implementation of Wikipedia templates like [[Template:Sort]]

      {{COLLATIONMAP:text|locale|level|clusters}}
    • this will return collationmap($text, $locale, $level, $clusters)
    • it can be used to simulate the generation of headings in categories, but as well within mediawiki tables
    • by default take $clusters=null (no limitation of length)
    • by default take $level=1
    • by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?)
  1. build a function for mapping category sortkeys into stored sort keys, this will depend on the SQL backend capabilities and on the schema constraint length for the sortkey data columns:

    function sqlsortkey($text, $locale, $level)
    • it will return either : substring(sortkey($text, $locale, $level), 0, $maxlength)
    • or : substring(base36(sortkey($text, $locale, $level)), 0, $maxlength)
    • the choice will depend on the support of VARBINARY(N) and its sortability in the SQL engine, or of only VARCHAR(N)
    • the sortkey will not have to be UTF-8, and will not need any support of the same locales for collation tailoring in the SQL backend.
  1. update the schema to support the description of supported internal collation ids.
    • add a mapping function from the human readable $locale parameter to the collation id associated to the curent version of the collation rule currently applicable to a locale.
    • support this mapping with a simple "collations" relational table
  1. build a new category index based on: (categoryid, collationid, sortkey, pageid)
    • categoryid and pageid are standard MediaWiki page ids (in any modification version).
    • collationid will come from the previous mapping (using the locale identifier string, and where the locale will be determined by HTTP query parameters like "uselang=", i.e. {{int:lang}}, or from the project's default {{CONTENTLANGUAGE}}).
    • the sortkey column will be computed using PHP's: sqlsortkey($text, $locale, $level) described above.

....

But effectively, we would even be smarter for category indices, by separating the namespaces, so that they will be paginated and navigatable independantly (this means that namespaces will have their own prefered sort order, specified in the project, but this should not require any complex computing of sortkeys: we already have namespace ids that sort naturally in numerical order)

This means that an extended category index would be:
(categoryid, collationid, sortkey, namespaceid, pageid)

Namespaces should have never been included at all in the computation of the default sortkeys, so instead of using the default sortkey from {{FULLPAGENAME}} (the way it is today), it should just use {{PAGENAME}}.

... Unless a category specifies specifically that it wants to mix the various namespaces, in which case the namespace name will be part of the default sort key, and the namespaceid will become 0 in all pages categorized in that category.

This is a decision factor for the last steps of the development plan above, it should not influence the development of the 2 first steps above.

Note that my developement plan will NOT imply an immmediate change of the SQL schema. At first, if only working on the frist 2 steps, no schema change is necessary to effectively test the two functions.

Notably, you'll be able to create test pages containing lists of words formatted as table rows using a template. You'll be able to show the words, in one column, then the opaque (base-36 converted) sort keys in another column, and then the visible collation-mapped string in a third column.

Then you'll include this list within a page adding the table headers and using the existing "sortable" table class to see how they collate when sorting by the second column, and you'll also be able to sort again on the third column, to see that the collation-mapped string will not alter this sort order.

If all this passes, this can be deployed and immediately used in Wiktionnary (notably French Wiktionnary) for sorting categories containing huge lists of words, or in english Wiktionnary for use in the template that generate sort keys for people's "Last name, first name".

This will be already a considerable progress, even if there will be no immediate support for multiple sort orders (in Chinese).

The change of schema, where the {{defaultsort:sortkey}} parameter or the second parameter of the [[category:page|sortkey]] link will be used effectively, so that we no longer have to use a sortkey parameter in pages and templates can be delayed a lot.

But the two developed functions (at least the first one {{SORTKEY:text|locale|level}} which is a bit simpler to implement than the second one {{COLLATIONMAP:text|locale|level}} that is a bit trickier as it requires special handling for collation elements that are mapped to more than 1 collation weight per level) should be a lot simplified, and will be enough tested before even thinking about reusing them for sorting pages in categories and presenting the lists with headings.

There is a separate modification to develop for grouping category contents by namespace. This should be developed and deployed first before even trying to change the schema for sort keys.

In fact, the first integration for displaying categories will be to use {{COLLATIONMAP:}} internally for displaying the headings in category pages. This won't require any change as well for the SQL schema.

Finally, the use of {{SORTKEY:}} and the addition of new keywords for use in category pages, that will allow a category to support multiple collations, will happen only at the last step. Because this is the most sensitive one and this is the only step that may finally involve a change of schema. There's much enough work to do for building all the first integration steps, as extensive tests in various locales will be needed, so that all wikis report their problems with the implemented collations and their supported tailorings.

ayg wrote:

Okay, look, if you make posts this long I'm not going to be able to read them. You have to state your points briefly and clearly, or else I'm just not going to have time to read through carefully and figure out exactly what you're trying to say. Walls of text are really not helpful.

Furthermore, you're focusing too narrowly on implementation details, like the exact interface a function should implement. Please try to state your concerns primarily in terms of use-cases or problems -- scenarios that the implementation will need to handle. Secondarily, it's fine if you propose specific solutions, but I can't evaluate your solutions unless you state clearly what problems they're trying to solve.

I'll comment on a few things you said. If I miss anything important, please restate it *briefly*, and I can address that too.

(In reply to comment #172)

What I wanted to say is that the computed sortkeys will have to be stored. But
several sort keys for the same page in the same category are possible (one for
each collation locale indicated by the target category).

If we store separate sortkeys per category link per locale, it would take vastly too much space, since there are likely to be dozens of locales. categorylinks is already a big table; we can't justify making it ten times bigger for this. Each category can have one locale and only one locale.

That sortkey is needlessly using the pageID within the generated sortkey (this
is is visible when crossing a page limit and navigating throught pages) so in
fact the unique index is on (categoryID, "augmented sortkey"). Conceptually
wrong and bogous (I think this was just a fast patch when there were unicity
problems and multiple pages could be specified with the same sortkey).

It is not needless. We need a unique key to sort by, or else paging doesn't work if many pages have the same sort key (e.g., if it was manually assigned). See bug 4912.

The generation of the concent of the sortkey column is the only major problem
requiring a design decision. This is where it should not even depend on the SQL
engine, and where it can be implemented within PHP, using the PHP extension
that allows using ICU functions. That string does not have to be extremely long
and does not have to be be humane readable.

Yes, we will certainly use ICU or something. GerardM has said he much prefers CLDR, so maybe we'll use that instead.

It can be safely be stored with a reasonnable length limit. So ICU-generated
sortkeys are still safe if they get truncated. Notably because the unique index
on:
(categoryID, sortkey, pageID, localeID)
is also unique on its restriction:
(categoryID, pageID, localeID)
And the sortkey generated by ICU, even if it's a string of binary bytes can
still safely be stored in a table index that does not support blobs but want
only "VARCHAR(n)" types, by serializing the binary sortkey to a safe encoding
(the most basic that will work is hexadecimal) that does not even require the
support of Unicode or UCA collation. Just use an ASCII only column to store the
computed binary sortkey serialized as an ASCII-only string.

But if the database engine supports strings of bytes, just don't serialize the
blob, use the supported SQL type that can store it directly, for example
VARBINARY(n), if it remains sortable in binary order.

With this design, you are completely independant of the SQL engine, it will
work identically on MySQL, PostrgresSQL, or others. And you'll have solved the
problems of collation with multiple locales according to their rules, and
possibly according to visitors preferences.

This was always the intended design. (But note that truncating these sort keys might not always be fully safe: it could cause tons of collisions, depending on the nature of the sort key generation algorithm. These would be resolved arbitrarily, by page_id, which is confusing to users.)

It's not a complicate design, and it offers stability warranties and supports
as well the possibility of upgrading the collations.

Upgrading the collation can be done in-place. The worst case is that categories sort weirdly for a few hours. Also, we would only realistically have to change the collation often on smaller wikis, since the largest wikis should have high-quality collations that cover almost all their pages to begin with. I don't think we need to adjust the schema to prepare for this.

(In reply to comment #183)

It is not needless. We need a unique key to sort by, or else paging doesn't
work if many pages have the same sort key (e.g., if it was manually assigned).
See bug 4912.

It is needless in the sense that the cl_from field is right there and is indexed. We should just be using that field rather than tacking its contents onto the sortkey because we're too lazy to rewrite IndexPager so it can handle paging by more than one field (which is not very hard).

ayg wrote:

Oh, I didn't get what he was saying. Yes, obviously we should use the actual cl_from field, not tack it onto cl_sortkey (is that what we do these days?). I'm going to have to make a lot of changes to the category pager anyway to support paging by multiple things, so I can do that while I'm at it.

(In reply to comment #183)

Upgrading the collation can be done in-place. The worst case is that
categories sort weirdly for a few hours. Also, we would only realistically
have to change the collation often on smaller wikis, since the largest wikis
should have high-quality collations that cover almost all their pages to begin
with. I don't think we need to adjust the schema to prepare for this.

That'a a bad assumption : even the highest quality collations will need to be updated from time to time:

  • Unicode evolves and new characters get encoded (new versions are published about every 1-2 years after synchronization and final balloting at both ISO WG2 and the UTC.
  • The content of the Unicode DUCET is NOT stable: characters are inserted in the sequence so that the full list of collation weights needs to be offseted where the new characters get inserted.
  • Collations for languages get corrected. We should be able to upgrade these rules when the CLDR project produces new tailorings (CLDR updates are published separately, about every 1-2 years.)

These corrections may be rare (every few months), but when they will occur, any upgrade could take many hours that could horce the site to go offline when recomputing sortkeys, or NO correction will be possible. Upgrading "in place" is effectively what I proposed, but how will you track which pages need to to reindexed?

A collation ID in the stored index can really help determine which collation rule was used to generate the stored sortkey; In addition it will allow to support multiple collations. This is the mean by which the "in place" recomputing can be safely be done.

Note: truncating the sortkeys will ALWAYS be needed, just because the database column will still have a length limit. Truncating is not so bad anyway, because:

  • the compact binary sequence of primary collation weights, that starts the sort key will be at the begining of the sort key. Further length is used to store the compacted sequence of secundary collation weights, then the sequence of ternary collation weights.
  • if truncation occurs, the effect will be that only the smallest differences will not be represented.

But if you accept to store only non-truncated sort keys, you'll still have the case where some pages will have some long name, plus the case where someone will have indicated for that page a very long {{DEFAULTSORT:sortkey}} or very long text in the second parameter of [[category:...|sortkey]]. To avoid this:

  • page names already have a length limit. This also limits the length of sort keys computed from only them
  • we should already truncate the string given in {{DEFAULTSORT:sortkey}} or {{category:..|sortkey]] so that the concatenation of this string and of the page name can be used to compute the binary sortkey.

If you can accept arbitrary lengths, so go with it, but it will be unsafe and your schema will not be able to put that in a sortable column (you'll be only able to put it in a referenced BLOB, just like the the text of articles, and databases can never sort external BLOB's)

Anyway you did not reply to the idea of first developin the parser functions and test them. Developping the SQL schema extension should not be attempted before at least the first function {{SORTKEY:text|locale|level}} has been fully developed and tested on specific pages (it can be tested easily in tables).

And with just this function, it should be possible on specific wikis to use it immediately to sort specific categories (for example by using templates using that function).

The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed immediately to develop the schema, but will be useful to restore the functionality of headings. Headings don't need to be stored as they can be computed on the fly, directly by reading sequentially the sorted result set from the SQL query:

You can compute headings from the returned page names, or from the existing stored "cl_sortkey" which should be used now ONLY to store the plain-text specified in articles with {{DEFAULTSORT:sortkey}} and [[category:...|sortkey]].
The existing cl_sortkey is just a forced "hint", it does not make the sort order unique. Otherwise it should remain completely empty with the new schema. It will always be locale neutral and will take precedence on the page name : to sort the pages effectively, the content of the cl_sortkey content and the pagename should be always concatenated inernally to compute the binary sortkey for various locales.

(In reply to comment #185)

Oh, I didn't get what he was saying. Yes, obviously we should use the actual
cl_from field, not tack it onto cl_sortkey (is that what we do these days?).
I'm going to have to make a lot of changes to the category pager anyway to
support paging by multiple things, so I can do that while I'm at it.

This does not require any change in the schema. This can be made immediately by MediaWiki developers and will not influence the developement of all corrections needed for bug #164 itself.

The unique index just has to include the cl_from field (whatever it contains, it is already the unique ID of the page, and it is already stored). Only the SQL queries and the HTTP requests have to be able to specify this cl_from field separately.

Tacking it in the sortkey was a bad decision made by "lazy" developers before they realized that storing a separate cl_from field was necessary. It just polluted the indexes and had the bad effect of truncating more precious space for correctly sorting some pages.

Drop this tacking immediately in the PHP code: the cl_sortkey is only intended to store the clear-text sortkey "hint" specified in {{DEFAULTSORT:sortkey}} or [[category:...|sortkey]] to override the sort order, and this clear-text sortkey is not really a true sortkey but a sort hint to override the default sort order.

For example in people's names whose page name is "FirstNames LastName" but that we want to sort as if they were "LastName, FirstNames" by indicating only {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in the wiki text, as this sort hint will not be unique and the group of pages using the same hint will still need to sort within this group using their natural order). Even in that case, there's no need to bogously tack the cl_from field in the stored field.

(In reply to comment #183)

Okay, look, if you make posts this long I'm not going to be able to read them.
You have to state your points briefly and clearly, or else I'm just not going
to have time to read through carefully and figure out exactly what you're
trying to say. Walls of text are really not helpful.

Furthermore, you're focusing too narrowly on implementation details, like the
exact interface a function should implement.

An "interface" function is DEFINITELY NOT an "implementation" detail. My comment #180 _correctly_ describes and summarizes what is really wanted.

It's not focusing on how it will be implemented (using ICU or not, how to integrate it in PHP if it's used, otherwise how to represent the DUCET and how to represent the language tailorings...), and it explains why the development can be planned in several clearly independant steps that can be tested separately.

It correctly explains the dependancies and why any change in the SQL schema can be and should be delayed. In fact I consider the step (1) in my plan to have a high priority on all the rest, and it does not imply any immediate change in the schema to support it.

There will be some time needed only to assert that the implemented collations are effectively correct for various languages: this will be verifiable by users looking at list of terms in some wiki page, using a "sortable wikitable" whose rows are using the new {{SORTKEY:text|locale|level}} parser function.

If this is validated by users, then only two independant development phases can be started (in parallel):

  • to develop the new schema for stored sortkeys, based only on the internal PHP functions implementing {{SORTKEY:text|locale|level}}. The schema should NOT be deployed before the collations have been tested extensively by users and their internal data structures reflect the wanted collations order and tailorings.
  • to develop the {{COLLATIONMAP:text|locale|level|clusters}} parser function (for later inclusion in the generation of the correct "column headings" in the lists displayed by category pages, because for now these headings are almost useless for anything else than English, or in heavily populated categories).

The second function will be separately integrable in the display of category pages (before or after the schema has been updated for stored sort keys), but can be tested separately as well. And anyway, it will already be a very useful general clear-text function for use in wiki templates that are trying to equate "equivalent" strings (for example in a {{#switch:}}, even better than {{LC:text}}) and will be very convenient for some languages like Arabic, Hebrew and Russian that have optional diacritics in their orthographies (such as cantillation marks, dots for vowels, and tonic accents) or that may contain compatibility ligatures or display forms (e.g. narrow/wide kanas and square clusters of kanas in Japanese, "compatibility jamos" in Korean).

The name of this second function could be abbreviated as {{CMAP:}} (it does not matter, but it should be implemented based on its functional description by Unicode in one of its UAX) or could be localized as well like other "magic" wiki keywords. And it probably should use {{CONTENTLANGUAGE}} as the default value for the second parameter "locale" (because {{int:lang}} may cause rendering problems in some pages that do not want to depend on the user's prefered language). And by default the "level" parameter will be 1 (could be up to 3, any higher value will return the same text without performing any mapping because higher collation values have to contain all the collation differences), and the default "clusters" will be unlimited.

ayg wrote:

(In reply to comment #186)

That'a a bad assumption : even the highest quality collations will need to be
updated from time to time:

  • Unicode evolves and new characters get encoded (new versions are published

about every 1-2 years after synchronization and final balloting at both ISO WG2
and the UTC.

  • The content of the Unicode DUCET is NOT stable: characters are inserted in

the sequence so that the full list of collation weights needs to be offseted
where the new characters get inserted.

  • Collations for languages get corrected. We should be able to upgrade these

rules when the CLDR project produces new tailorings (CLDR updates are published
separately, about every 1-2 years.)

It's not critical for most wikis to use the very latest collations, though. The English Wikipedia, for example, will do fine with even very out-of-date collations, since the article names overwhelmingly use characters that have been in Unicode for an extremely long time. The same is true for most other large wikis; but on the other hand, to change collations on small wikis will be fast in any event.

However, Tim Starling independently suggested a cl_collation column, and my initial implementation does have one. The current one doesn't allow the same row to be stored multiple times with different collations, so sorting might still be wrong as the collation is updated, but this might not be a big problem in practice. If it is, we can go with your idea of extending the primary key to (cl_collation, cl_from, cl_to) instead of (cl_from, cl_to).

Anyway you did not reply to the idea of first developin the parser functions
and test them. Developping the SQL schema extension should not be attempted
before at least the first function {{SORTKEY:text|locale|level}} has been fully
developed and tested on specific pages (it can be tested easily in tables).

The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed
immediately to develop the schema, but will be useful to restore the
functionality of headings. Headings don't need to be stored as they can be
computed on the fly, directly by reading sequentially the sorted result set
from the SQL query:

I tried to figure out what you were talking about here, and eventually figured out that you just want me to expose Language::convertToSortkey() and Language::firstLetterForLists() (as they're currently called) as parser functions, for the benefit of lists. That might be useful, and should be easy to do, although it's not part of my assigned job here.

You can compute headings from the returned page names, or from the existing
stored "cl_sortkey" which should be used now ONLY to store the plain-text
specified in articles with {{DEFAULTSORT:sortkey}} and
[[category:...|sortkey]].

I've introduced cl_raw_sortkey to store that, while retaining cl_sortkey for actual sorting. Based on your feedback, it might make more sense to rename cl_raw_sortkey to cl_sortkey_prefix, put {{defaultsort:}} into it, and make it the empty string by default.

(In reply to comment #187)

This does not require any change in the schema. This can be made immediately by
MediaWiki developers and will not influence the developement of all corrections
needed for bug #164 itself.

Correct. I'll probably do it anyway in the course of the work I'm doing, though, since I'll be rewriting CategoryPage.php's sort logic in any case.

For example in people's names whose page name is "FirstNames LastName" but that
we want to sort as if they were "LastName, FirstNames" by indicating only
{{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
the wiki text, as this sort hint will not be unique and the group of pages
using the same hint will still need to sort within this group using their
natural order).

I can append the page title to cl_raw_sortkey before computing cl_sortkey. That shouldn't be a problem. As noted above, maybe renaming it to cl_sortkey_prefix would make more sense.

(In reply to comment #188)

An "interface" function is DEFINITELY NOT an "implementation" detail. My
comment #180 _correctly_ describes and summarizes what is really wanted.

Unfortunately, it's hard for me to understand what you're saing. However, I think I've got it now, at least mostly. I don't think parser functions are essential, here, and they're not strictly part of this bug. They can be added after the initial implementation is finished.

It correctly explains the dependancies and why any change in the SQL schema can
be and should be delayed. In fact I consider the step (1) in my plan to have a
high priority on all the rest, and it does not imply any immediate change in
the schema to support it.

Writing those functions is not part of my job. I expect the i18n people will handle that. I'm just doing a prototype, which is agnostic about what sortkey function you give it. My current sortkey function is just strtoupper(), for testing. (This does improve English sorting when $wgCapitalLinks is off.)

  • to develop the new schema for stored sortkeys, based only on the internal PHP

functions implementing {{SORTKEY:text|locale|level}}. The schema should NOT be
deployed before the collations have been tested extensively by users and their
internal data structures reflect the wanted collations order and tailorings.

This is basically what I'm doing, except I'm not going to work out exactly what sortkey function to use.

  • to develop the {{COLLATIONMAP:text|locale|level|clusters}} parser function

(for later inclusion in the generation of the correct "column headings" in the
lists displayed by category pages, because for now these headings are almost
useless for anything else than English, or in heavily populated categories).

I'm not doing this part, but I'm setting up the framework so that i18n people can work on it (minus the parser function, which can be added later). At least, if your COLLATIONMAP is meant to be anything like my Language::firstLetterForLists(). I'm not sure if it is. If it's not, please explain better, because I don't follow.

Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but COLLATIONMAP is a more generic concept which reflects what is defined in Unicode standard annexes, which speaks about various mappings (including collan mapppings, but also case mappings)

One bad thing about the name Language::firstLetterforList(s) is that it implies that this should only be the first letter. In fact, for many locales, the significant unit is not the letter, but the collation element (for exemple digrams like "ch" or trigrams like "c’h").

For some categories, it should be convenient also to be able to use longer substrings, containing more than one grapheme cluster (in Wiktionnary for lists of terms belonging to a language, or in lists of people names, a category may need to be indexed and anchored with section headings containing the first 2 or 3 grapheme clusters, because the first grapheme is not discriminant enough and will remain identical an all columns of the disaplyed list on one page, and even sometimes on several or many successive pages : the first letter heading does not help, and is just an unneeded visual pollution)

For other categories that have very few contents, too many headings are frequently added that also appear as pollution. Being able to suppress all of them, by specifying 0 graphemeclusters in that category will better help readers locate the wanted item.

The collation map also has several levels of implementations, which match exactly the same levels as collation levels used to generate sort keys.


About sort keys now:

As sort keys are intended to be opaque binary objects, they do not qualify as being used directly as a parserfunction, without being exposed by a serialization to safe Unicode text, even if it means nothing for reader. That's why I proposed a Base-36 mapping to plain ASCII which will still sort correctly in binary order, and for use in sortable tables, but it could use any arbitrary Base that sorts correctly using binary ordering, and that uses ONLY valid Unicode characters.

The chosen base should be easy to compute, but all the standard Base-64 variants do not qualify (there's no warranty for the last two "letters" of all base-64 variants). We could use Base-62 (using all 10 digits, and the 26 pairs of Basic Latin letters), or Base-32 (simpler to compute but will generate longer texts). The only intent is not really to make the string visible in pages, but to help in the generation of accurate sort keys in sortable columns.

For now these sort keys are generated by templates as invisible text spans (using a CSS style="display:none" attribute), but ideally, the templates used in sortable tables that generate custome sortkeys should put them in some HTML attribute that can be specified on table cells, and that the Javascript will use directly. In my opinin, these opaque strings should be as compact as possible, but still safe for use inclusing in pages, and directly usable by simple Javascript without requiring any complex reimplemementation of UCA in the Javascript code.

Why do I think that exposing the functions as parser functions will be useful ? that's because it allows the implementation to be tested extensively on lots of cases, but only within a limited set of pages, long before the schema is developed, finalized and finally deployed.

In other words, it will not block the development of the schema update, as long as we agree about what are the essential functions to support, i.e. their interface that will be exposed (partly) in parser functions.

Also because I'm convinced that the exposed parser functions will not have this syntax changed, and that what they return will be well known:

  • The {{COLLATIONMAP:}} function is very well described and will effectively return humane-readable text. Its formal implementation should follow the standard Unicode definitions.

You can expose it at least in a test wiki where you'll be able to track very easily the result and progress of its implementation (just create a page containing test words in various languages, arranged in a Wikitable).

  • The {{SORTKEY:}} function can be exposed as well (and tested on the same test page for various languages, using the same list of words). Its result will be opaque for humane and compact. It will be easy to assert that it generates the expected sort order by using it in a sortable wikitable.

Both functions will be deployable rapidly, even on wikis that won't want to apply the schema change (so they will continue to use a single collation order for ALL their categories, and will anyway be able to sort specific categories using another supplied locale matching the category name).

If you think about it, changing the SQL schema may be rejected at end by lots of people. Exposing the parser functions will provide a convenient alternative that can be deployed much more simply, and with MUCH LESS risks, using the existing facilities offered by [[category:...|sortkey]] and {{DEFAULTSORT:sortkey}}, except that their parameter will be computed using the exposed {{SORTKEY:}} function:

{{DEFAULTSORT:{{SORTKEY:text|locale|level}}}}

or:

[[category:...|{{SORTKEY:text|locale|level}}]]

both being generalizable through helper templates.

There is such existing helper template named [[Modèle:Clé de tri]] in French Wiktionnary, that will NO LONGER need that we pass the article name without accents or extra punctuations or apostrophes (ignored at collation level 1), this parameter becoming ignored. Currently we use the template like this:

{{clé de tri}}

when the article name contains nothing else than basic Latin letters or spaces, otherwise we have to pass:

{{clé de tri|Basic latin}}

And we contantly need to verify that the passed parameter is correct. Instead the template would just generate this very simple code:

{{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}}}}}

As the {{SORTKEY:text|locale|level}} will use by default:

locale={{CONTENTLANGUAGE}}|level=1

this will be sufficient for French Wiktionnary. In fact it may also be sufficient in English Wikipedia.

But in Chinese Wikipedia, one may still want to be able to use:

{{DEFAULTSORT:{{SORTKEY:{{PAGENAME}}|{{int:lang}}}}}}

to support the prefered collation order of the user (traditional radical/stroke order, simplified radical/stroke order, Bopomofo order, Pinyin order)

(Note also that section headings ("first letter") will have to be "translated" to correctly report the "first letter" of the Pinyin romanization, because the page names listed will continue to display their distinctive ideographs ! The only way to do that is to use the collation mapping exposed by {{COLLATIONMAP:}})

But you'll note that it won't be possible to sort the categories using multiple locales, so the page will be stored and indexed by parsing it using {{int:lang}} forced to {{CONTENTLANGUAGE}}, which will just be "zh", using only the simplified radical/stroke order by default.

To support more collations, the categories will need to support them explicitly (but this would force to reparse the page multiple times, once for each additinal locale specified in the category).

The alternative would be to create multiple parallel categories, one for each sort order, but then each article will have to specify all these categories.

My opinion is that the same category should be sortable using different locales, and that's why they should support multiple sortkeys par indexed page, one for each specified locale. Some wikis will only sort on the {{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of sorting automatically all categories using at least the default "zh" locale which is an alias for "zh-hans", plus the "zh-hant" locale for traditional radical/stroke order, "zh-latn" for the Pinyin order, and "zh-bpmf" for the Bopomofo order.

The exact locale to which "zh" corresponds will be a user preference, but one will be able to navigate by clicking the automatically generated links that will allow them to specify the other collation orders supported specifically by the category or by default throughout the wiki project.

For example, the Chinese Wiktionnary will display links on the page showing the available choice as:

Sort by : [default] | [simplified radical/stroke] | [traditional radical/stroke] | [pinyin] | [bopomofo]

How can this be possible ? Either the wikiproject specifies that all categories will support these 4 orders, or the category page will list explicitly the additional sort orders that will be supported by the category.

The [default] link will use the index prefixes specified in the existing syntaxes [[category:...|sortkeyprefix]] or {{DEFAULTSORT:sortkeyprefix}}

All the other links will display the list sorted using the additional locales specified, but will ignore the sortkeyprefixes specified in categorized pages or indirectely via templates.

To add the additional sort orders in a category, you'll just need to insert in the category page some code like:

{{SORTAS:zh-hans}}
{{SORTAS:zh-hant}}
{{SORTAS:zh-latn}}
{{SORTAS:zh-bpmf}}

No article needs to be changed, these additional sort orders will just discard/ignore the sortkeyprefix when generating the actual opaque sortkey (specified with {{DEFAULTSORT:sortkeyprefix}} or in [[category:...|sortkeyprefix]].

However if the wikiproject offers several project-wide default locales the sortkeyprefix specified in pages will be honored for ONLY for these locales, and made immediately visible as the preselected [default] link, in the choice of sort orders.

Lets say for example that the Chinese Wiktionnary wants to support by default only the "zh-hans" and "zh-hant" collations.

This means that all categories will contain [default] sort keys computed for these two collations, from the text consisting in:

{{{sortkeyprefix}}}{{KEYSEPARATOR}}{{PAGENAME}}

if a sortkeyprefix is specified, or just {{PAGENAME}} if no sortkeyprefix is specified.

A constant {{KEYSEPARATOR}} will be used that should sort lower than every other valid text usable in {{{sortkeyprefix}}} or {{PAGENAME}}. Ideally, this should be a control character like U+000A (LF) or U+0009 (TAB), after making sure that:

  • this character will never appear in valid {{{sortkeyprefix}}} or {{PAGENAME}} (Mediawiki already process blanks and convert them to plain SPACE)
  • this character will have a NON-ZERO (ignorable) primary collation weight that is smaller than all other collation weights. Its primary collation weight should then be 1 (if needed the collation weights coming from the DUCET or from loalized tailoring will just have to be offseted by 1, if they are non-zero)
  • this character will have a ZERO collation weight for all the remaining supported levels in each locale

For all the additional sort orders specified in category pages, the {{{sortkeyprefix}} will be ignored as well as the {{KEYSEPARATOR}}, so the pages will just be indexed on {{PAGENAME}}, within the specified locale.

In the example Chinese Wiktionnary a category specifying:

{{SORTAS:zh-hans}}
{{SORTAS:zh-hant}}
{{SORTAS:zh-latn}}
{{SORTAS:zh-bpmf}}

will generate 4 additional (non [default]) sort keys, that will add to the two sortkeys already generated for "zh-hans" and "zh-hant" except that they will ignore the specified sortkeyprefixes.

This means that it will generate up to 6 sortkeys: 1 or 2 for "zh-hans", 1 or 2 for "zh-hant", and only 1 for each of "zh-latn" and "zh-bmpf"

In the English Wiktionary or on Commons, that will only use the "en" default collation order (identical to {{CONTENTLANGUAGE}}), it will be possible to specify, for specific categories, an additional sort order when the category is directly related to a specific language.

By default, that category will be sorted using the English collation rule, but it will be possible to select the additional specified collation order (in which case the defaultsortprefix specified in indexed pages will be ignored, the list will just be shown by using the natural order of page names in the manually clicked sort order).

So the [[Category:Chinese]] in English Wiktionnary will be able to specify at least:

{{SORTAS:zh-hans}}

And the [[Category:German]] in English Wiktionnary will be able to specify at least:

{{SORTAS:de|2}}

And the [[Category:French]] in English Wiktionnary will be able to specify at least:

{{SORTAS:fr|2}}

And this should be enough to be able to view the natural order of these languages (French will require collation level 2 for correct ordering by grouping letters with the same accents, and sorting them in backward order in this level)...

Note also that if the categories is presented in any selected locale, the secntion headings will also be computed in that same locale, and with the same collation level. By default it will show only 1 grapheme cluster. But if needed you can specify:

{{SORTAS:fr|2|3}}

to indicate that the category should use the first three French grapheme clusters (determined from collation mappings) for the headings, if the category is heavily populated, so you'll get the following headings:

a, à, â, ... aa, aar, ab, aba, abc, ac, aca, ace, acé, ... ad, ae, aé, ... b, ba, bac, bad, baf, bag, bai, bal, bam, ban, bar, bas, bat, bât, ....

(note that at level 2, the same heading will contain all words sharing the same letters and accents. Case will be ignored.

Headings don't need to be stored, they are generated on the fly from the index prefixes (returned in the result set, but only if this is one of the wiki's [default] sort orders, because otherwise it will be empty if the user selected a non-[default] sort order) and the pagename (also present in the result set).

Note that if you still want to present a category ordered so that all lowercase letters will sort tegether separately from all other uppercase letters, you'll need to indicate a separate collation order, by specifying an additional non-default locale in that specific category. This will look probalby not natural for most users, that's why it will be a distinct locale and not the default. For example to use it in English or in French:

{{SORTAS:en-x-uc}}<!-- ALL uppercase before lowercase in collation level 1-->
{{SORTAS:fr-x-lc}}<!-- ALL lowercase before uppercase in collation level 1 -->

This variant means that the natural tailoring is modified so that case differences will be considered as primary differences in that specific distinct localized collation. There will no longer be any ternary difference but there will be twice more headings generated (here as no maximum grapheme clusters is specified, only 1 grapheme cluster will be used in the headings, so you'll get the headings:

A, B, C, D, ... Z, a, b, c, d, ... z, (in with en-x-uc)
a, b, c, d, ... z, A, B, C, D, ... Z, (in with fr-x-uc)

And in both cases there will be no headings separating differences of accents (because we did not indeicate the collation level, the collation level remains 1)

Such options MUST use a standard BCP47 code, so the option needs to be at least 5 characters long after the language code, or must be used after the standard BCP47 suffix "-x-" for private extensions that are not indicating a language.

In all this discussion it appears that the development can be made in two separate projects developped independantly.

You can continue to develop the SQL schema extension, provided that:

  • you call a PHP function/method that will be developped separately in a "Collator" class, to compute the collation sortkeys for each locale and specified collation level.
  • you call a PHP function/method that will be developped separately in a "Collator" class, to compute the collation mappings for each locale and specified collation level and maximum grapheme clusters, in order to generate the headings on the fly in the category list
  • you think about a HTTP query parameter that will allow to change the locale (this parameter exists, it's "uselang", but another one will be needed eventually to specify sort options like "-x-lc" or "-x-uc" (the "-x-" separator will be used impliclty. So the HTTP query may contain: &uselang=fr&opt=lc). These parameters will be used later when you'll be able to store multiple sortkeys.

Separately, another project will implement the Collator class, that will load the DUCET table, load the tailorings designed per locale or locale+"-x-"+option, and prepare the table of collation elements and their associated collation weights for each level. Then it will implement the methods:

  • Collator::Collator(locale, level=1) which instanciates a Collator object for the specified locale string (language + optional options) and the collation level (it does not need to prepare the full table of colaltion weights, just those needed for the specified level). The default level will be 1.
  • Collator::sortkey(text) which returns the opaque collation key (a binary array of bytes) for the specified Collator instance.
  • Collator::map(text, maxelements='') which returns the humane readable text after applying the collation mapping associated to the Collator instance. It will stop after returning at most 'maxelements' collation elements in the remapped string, or will process all the text if maxelements is null or not specified. For category headings, you'll give by default maxelements=1 (unless another calue was specified when preparing a specific locale with the extra option of {{SORTAS:locale|maxelements}} in the visited category page.

Separately another simple project will use the Collator class to expose them in parser functions. In fact this third project will be very easy and fast to complete, even if the Collator class is not fully developped with all its options.

A very basic Collator class can be used to develop the parser function extension that will expose:

  • {{SORTKEY:text|locale|level}} defaults: locale={{CONTENTLANGUAGE}}|level=1
  • {{COLLATIONMAP:text|locale|level|maxelements}} defaults: locale={{CONTENTLANGUAGE}}|level=1|maxelements=<!--empty/null-->

With it you can immediately deply it on a test wiki server, where you'll build test lists in a "sortable wikitable". and you can continue building the Collator class. to support various locales and all the needed options.

Note that this is the minimum functional interface for the Collator class. Other methods will be certianly needed to cache the already prepared collators, or to cache the preloaded tailoring tables and the common DUCET.

In other words, the Collator::Collator() constructor is in fact probably using the service of a CollatorFactory class that will cache the needed shared data. This will save a lot of overhead and will be important for the performance.

This factory should NOT be used directly by the SQL schema extension, that MUST only use the Collator::getCollator(locale, level) static function which will use the CollatorFactory in order to instantiate the Collator object through its constuctor.

The CollatorFactory will use ICU, most probably, via its already existing PHP-extension interface module. But a basic PHP-only implementation is still possible without using ICU immediately, or if MediaWiki must be kept compatible with existing PHP servers that can't load such native extensions (using DLLs or shared libraries or requiring to recompile/relink PHP itself) in their installation.

The CollatorFactory should NOT prepare all supported locales. Only those that are requested on demand.

Exemple of interface for you:

<?php

require('.../CollatorFactory.php');
$wgCollatorFactory=CollatorFactory(); // this is the only global PHP variable

...

prepare a Collator object
(note that $locale is NOT case sensitive in BCP47)

$locale = $wgContentLanguage; // use {{CONTENTLANGUAGE}} by default
if (isset($options) && $options > '')

$locale .= '-x-' . $options; // append sort options if any (like 'uc', 'lc')

$level = 1; //use collation level 1 by default

$collator = $wgCollatorFactory->get($locale, $level);

...

use the collator object to compute opaque sortkeys (opaque binary string)
to store in categories:

$sortkey = $collator->sortKey($text);

optionally map it to plain-text, if the database can't store and
sort VARBINARY(N) fields, but only VARCHAR(N) objects:

$sortkey = base32($sortkey);

...

// use the collator object to compute a heading on the fly from a pagename:

$collationElements = 1; // by default generate only 1 collation element

$heading = $collator->map($pageName, $collationElements);

ayg wrote:

(In reply to comment #190)

Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but
COLLATIONMAP is a more generic concept which reflects what is defined in
Unicode standard annexes, which speaks about various mappings (including collan
mapppings, but also case mappings)

Which particular Unicode standards are relevant here? It would probably be a good idea for me to take a look at them.

For some categories, it should be convenient also to be able to use longer
substrings, containing more than one grapheme cluster (in Wiktionnary for lists
of terms belonging to a language, or in lists of people names, a category may
need to be indexed and anchored with section headings containing the first 2 or
3 grapheme clusters, because the first grapheme is not discriminant enough and
will remain identical an all columns of the disaplyed list on one page, and
even sometimes on several or many successive pages : the first letter heading
does not help, and is just an unneeded visual pollution)

For other categories that have very few contents, too many headings are
frequently added that also appear as pollution. Being able to suppress all of
them, by specifying 0 graphemeclusters in that category will better help
readers locate the wanted item.

This is all beyond the scope of what I'm currently doing, but it should be possible to add on without too much trouble as a separate project.

Why do I think that exposing the functions as parser functions will be useful ?
that's because it allows the implementation to be tested extensively on lots of
cases, but only within a limited set of pages, long before the schema is
developed, finalized and finally deployed.

This presupposes that the sortkey generation algorithm is settled upon before the database changes are. In fact, it's exactly the opposite: I've been contracted to do the backend changes, and other people will figure out how exactly to generate the sortkeys later. Really, we could do the changes in either order.

Both functions will be deployable rapidly, even on wikis that won't want to
apply the schema change (so they will continue to use a single collation order
for ALL their categories, and will anyway be able to sort specific categories
using another supplied locale matching the category name).

If you think about it, changing the SQL schema may be rejected at end by lots
of people.

The schema change will be part of the core software and will not be an optional update. Anyone who doesn't want to accept it will likely have to stick with 1.16 forever, because we're not going to support the old schema.

Exposing the parser functions will provide a convenient alternative
that can be deployed much more simply, and with MUCH LESS risks, using the
existing facilities offered by [[category:...|sortkey]] and
{{DEFAULTSORT:sortkey}}, except that their parameter will be computed using the
exposed {{SORTKEY:}} function:

{{DEFAULTSORT:{{SORTKEY:text|locale|level}}}}

or:

[[category:...|{{SORTKEY:text|locale|level}}]]

both being generalizable through helper templates.

It's not particularly less risky. It does encourage each wiki to hack around the problem locally instead of pushing for a proper fix in the software. You shouldn't have to do DEFAULTSORT on any pages where the right sorting is human-detectable -- it should only have to be for things like "Abraham Lincoln" being sorted as "Lincoln, Abraham", which the software can't do automatically.

(Note also that section headings ("first letter") will have to be "translated"
to correctly report the "first letter" of the Pinyin romanization, because the
page names listed will continue to display their distinctive ideographs ! The
only way to do that is to use the collation mapping exposed by
{{COLLATIONMAP:}})

Surely the most sensible idea is just to disable the section headings altogether for CJK?

My opinion is that the same category should be sortable using different
locales, and that's why they should support multiple sortkeys par indexed page,
one for each specified locale. Some wikis will only sort on the
{{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of
sorting automatically all categories using at least the default "zh" locale
which is an alias for "zh-hans", plus the "zh-hant" locale for traditional
radical/stroke order, "zh-latn" for the Pinyin order, and "zh-bpmf" for the
Bopomofo order.

The exact locale to which "zh" corresponds will be a user preference, but one
will be able to navigate by clicking the automatically generated links that
will allow them to specify the other collation orders supported specifically by
the category or by default throughout the wiki project.

This is doable if it's desired, as long as the number of locales is very limited (like four, not a hundred). However, it will not be part of my initial implementation unless Tim asks me to do it.

In the English Wiktionary or on Commons, that will only use the "en" default
collation order (identical to {{CONTENTLANGUAGE}}), it will be possible to
specify, for specific categories, an additional sort order when the category is
directly related to a specific language.

This is certainly a useful feature for some wikis (especially Wiktionaries), and it could be added fairly easily. It might make it into my initial implementation.

(In reply to comment #191)

In all this discussion it appears that the development can be made in two
separate projects developped independantly.

You can continue to develop the SQL schema extension, provided that:

  • you call a PHP function/method that will be developped separately in a

"Collator" class, to compute the collation sortkeys for each locale and
specified collation level.

  • you call a PHP function/method that will be developped separately in a

"Collator" class, to compute the collation mappings for each locale and
specified collation level and maximum grapheme clusters, in order to generate
the headings on the fly in the category list

This is what I'm doing.

  • you think about a HTTP query parameter that will allow to change the locale

(this parameter exists, it's "uselang", but another one will be needed
eventually to specify sort options like "-x-lc" or "-x-uc" (the "-x-" separator
will be used impliclty. So the HTTP query may contain: &uselang=fr&opt=lc).
These parameters will be used later when you'll be able to store multiple
sortkeys.

These don't need to be developed until that feature is actually implemented, which will probably not be in my initial implementation, unless Tim Starling asks me to.

Note that the CollatorFactory may fail to locate the specified locale for which a collator is being requested. Additionally, several locales may share exactly the same collator.

In all cases, the collatorFactory will return a valid Collator object, whose locale can be identified: the Collator object returned by:

$collator = $wgCollatorFactory->get($locale, $level);

will have a property that will contain the effective locale code (normalized) and an other property containing the collation level from which it was effectively built. You should be able to access it simply with something like:

$effectiveLocale = $collator->locale();
$effectiveLevel = $collator->level();

after just getting the collator instance from the factory.

This may be useful to avoid storing duplicate equivalent binary sortkeys, or simply to determine which effective locale to use in SQL select queries (to retrieve the sorted list of pagenames ordered by a specified locale), when the SQL schema will be able to store several sortkeys for the same page in the same category.

The factory will also instanciate a collator with an effective locale and an effective collation level only once, caching it in an internal array, for repeated use.

This will save the complex preparation of tables, and will avoid building tables for all supported languages (for example in Commons where lots of languages may be desirable, weahc one with possibly several sort options, or supported conversions to other scripts or script variants).

The factory however should probably be able to load the DUCET table associated to the CLDR "root" locale completely and immediately when it is first instanciated and stored in the global variable (there's probably no need to test this each time vecaue of lazy initializations with null member fields); and it should most probably build the default collator (for $locale=$wgContentLanguage, and $collationlevel=1) immediately, storing it in the first position of its member array of already prepared Collator instances.

But you may think the opposite, in order to speedup the server startup by some (milli-)seconds or reduce the initial CPU/memory stress in the garbage collator of PHP. However I'm not convince that the server will be ready faster, and the extra tests that will be performed at each use of the $wgCollatorFactory->get() method may impact the performance at runtime...

Note also the ICU uses the same approach of a CollatorFactory to build and cache reusable Collator instances, because it's a proven good design pattern for implementing and using collators.

A collator object may also be used to compare to texts without even generating their sortkeys, or without mapping them, so it may help to include in the Collator interface this method:

$collator->compare($text1, $text2);

that will return an integer (in other words, a Collator also implements the Comparator interface), by parsing $text1 and $text2 collation element by collation element up to the end at level 1, comparing their collation weights only at theis level, before restarting with the next level. When the collator was instanciated at level 1, the successive collation elements need not be stored, but for higher levels, it helps if they are parsed only once and kept in an indexed array that will allow faster lookup for the next levels in the table of collation weights for these levels.

I note that you have started to define the Collator class as the Language class.

Well I'm not sure that this should be the same class: a language has several collators, even if a collator is for only one language (including a virtual language like "root" which is the base of all localizations including English, with lots of common properties that are shared in multiple languages derived from root).

Also I note that the collator you define is a stub. Great for your development in the SQL schema, but someone else must work in implementing the CollatorFactory, using ICU, or a full PHP implementation, or only a stub, depending on the PHP script path you'll require() to load this factory class.

Technically, collators also share lots of properties across several languages. This is why it should be separated from the Language class.

May be the Language class can query the CollatorFactory about the list of collators (i.e. their complete locale code, including options) that are supported by that language, using an interface method of the loaded global CollatorFactory instance.

This way, a stub factory, building a single Collator instance will still work immediately.

Someone with expertise in Unicode collations will help you build the Factory if you want a PHP-only implementation.

Someone else with ICU expertize could also work on its existing integration in PHP (needs testing in the MediaWiki installation scripts for deployments) and to use it in the CollatorFactory and in the Collator class.

The most tricky development is the Collator::map() function, because Unicode does not describe the algorithm completely.

I know the principles and could offer help if you give some pointers where it would be beneficial.

But really, to test a complete implementation of the CollatorFactory, you'll need to be able to expose it in the two optional builtin parser functions that I described. As this can be clearly developped separately even if you start with a stub for the factory, and tested with the very basic Parser Functions installed on a test server.

So I maintain my position for the non-risky ParserFunctions (notably also because it will be simpler for an existing non-Wikimedia installation of MediaWiki to install the simple functions only, without upgrading to the new schema immediately, knowing that the factory can also be a basic stub as well on these installations).

ayg wrote:

(In reply to comment #187)

Drop this tacking immediately in the PHP code: the cl_sortkey is only intended
to store the clear-text sortkey "hint" specified in {{DEFAULTSORT:sortkey}} or
[[category:...|sortkey]] to override the sort order, and this clear-text
sortkey is not really a true sortkey but a sort hint to override the default
sort order.

For example in people's names whose page name is "FirstNames LastName" but that
we want to sort as if they were "LastName, FirstNames" by indicating only
{{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
the wiki text, as this sort hint will not be unique and the group of pages
using the same hint will still need to sort within this group using their
natural order). Even in that case, there's no need to bogously tack the cl_from
field in the stored field.

How do you propose this be implemented? We would need some character that sorts before all characters that can legitimately occur in a sort key.

(In reply to comment #197)

I note that you have started to define the Collator class as the Language
class.

This is completely provisional and can easily be changed later. As far as I understand, I'm not the one who will work on it.

But really, to test a complete implementation of the CollatorFactory, you'll
need to be able to expose it in the two optional builtin parser functions that
I described. As this can be clearly developped separately even if you start
with a stub for the factory, and tested with the very basic Parser Functions
installed on a test server.

So I maintain my position for the non-risky ParserFunctions (notably also
because it will be simpler for an existing non-Wikimedia installation of
MediaWiki to install the simple functions only, without upgrading to the new
schema immediately, knowing that the factory can also be a basic stub as well
on these installations).

Upgrading the schema will be handled by running maintenance/update.php, which must be run on every update anyway. We typically have a number of schema changes in every release, and those always must be applied when deploying the new code. So this is not an issue.

(Note also that section headings ("first letter") will have to be "translated"
to correctly report the "first letter" of the Pinyin romanization, because the
page names listed will continue to display their distinctive ideographs ! The
only way to do that is to use the collation mapping exposed by
{{COLLATIONMAP:}})

Surely the most sensible idea is just to disable the section headings
altogether for CJK?

Don't need to do that.

The Collator instance returned by the factory for $locale="zh-latn" (which sorts using Pinyin) just has to return an empty string for its map() method, as long as this is a stub which can't safely map ideographs to Latin initial consonnants of the Pinyin syllable.

Note that the syntax of the Latin Pinyin ampping should be quite similar to the syntax used in Minnan as the Minnan syllables have more or less the same structure as the Pinyin syllables. for ideographs that are not supported, in the Pinyin romanization, there will be no substitution so they will sort at end and will not start by a Latin letter.

Here also, it is possible to infer a common heading for all of them, such as their radical component, or just the first ideographic radical encoded in the DUCET with a non zero primary weight, or even the first stroke.

The first CJK stroke is:
31C0 ; [*10E0.0020.0002.31C0] # CJK STROKE T

But I should look at the exact sequence in the "zh" tailoring of the DUCET, in the CLDR database.

There's a demo page for "zh-hans" collation here (in the ICU project which is used by the Unicode's CLDR project as a reference implementation):

http://demo.icu-project.org/icu-bin/locexp?d_=fr&_=zh_Hans

The interesting part is the data for "Rules" which orders the examplar sinograms, where the data for "Set" just show them in the numeric codepoint order or as ranges of character with ascending numeric code point values.
But on both cases they just concentrate on the most common basic subset used in GB2312, this is not the complete set defined in GB18030 and Unicode.

For actual transforms from Sinograms to Latin (Pinyin) there's this demo page:

http://demo.icu-project.org/icu-bin/translit

To see how the DUCET orders ideographs, look at:

http://unicode.org/charts/collation/

The first sinogram (non-stroke) defined with a non-zero primary weight in the DUCET sequence is U+4E00 (一) at it seems that it provides a very convenient geading for every sinogram that we can't sort or convert to Pinyin.

Note that in all cases all the sinograms are at end of the DUCET, just before the unsupported/reserved/unassaigned characters.

The "zh-hans" is just moving all the other letters starting in the "Variable" subset upward (in fact it just moves upwards the letters starting at Latin), to fit the sinograms before them.

The "zh-Hant" collation is like "zh-Hans" but swaps some positions, according to their expected radical, but also because they differ in thei stroke count.

Only about 31000 sinograms have known transiptions to Pinyin (this number is progressing), all the other will then appear under the remaining heading group starting by U+4E00 (一), except those in the "CJK-Extension blocks" that should be listed under the heading U+3400 (㐀).

Most non-extension CJK have now a pinyin transcription, most CJK extensions don't (but they are also the rarest character used, so there should not be a lot of pages indexed there)...

For example in people's names whose page name is "FirstNames LastName" but that
we want to sort as if they were "LastName, FirstNames" by indicating only
{{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
the wiki text, as this sort hint will not be unique and the group of pages
using the same hint will still need to sort within this group using their
natural order). Even in that case, there's no need to bogously tack the cl_from
field in the stored field.

How do you propose this be implemented? We would need some character that
sorts before all characters that can legitimately occur in a sort key.

This is exactly what UCA defines as a "tailoring". We have such a character available that we don't use in pagenames and in sortkey prefixes: Control characters.

Note that control characters are present in the DUCET, and they are NOT ALL ignorable. The first of them is TAB (U+0009) and it is the first character that we DON'T USE, and that has a primary weight in the DUCET and that is not ignorable or null.

All the characters have null weights are listed wihin the NULL group, which is immediately followed by ignorable characters.

TAB has a primary weight of 0201 in the DUCET (it comes far later, after all the ignorables)

The first ignorable character has a primary weight 0021, so the primary weight 0001 is free to serve as the separator.

All we have to do is then to tailor the primary weight of TAB to assign it the weight 0001 instead of 0000. In that case, the souce text to give to Collator:sortKey() just has to use the TAB character between the two fields.

If there's no sortkey prefix, don't generate the TAB, use directly the pagename:
<? php

require('.../CollatorFactory.php'); // choose the script for the implementation
global $wgCollatorFactor = CollatorFactory();

...
$collator = $wgCollatorFactory->get($locale, $level);

...
if ($sortkeyprefix != '')

$text = $sortkeyprefix + '\t' + $pagename;

else

$text = $pagename;

$sortkey = $collator->sortkey($text);

optional to convert VARBINARY(N) to VARCHAR(N) (depends on SQL backend)
$sortkey = varbinaryToVarchar($sortkey);
e.g. Base-32

you may also need to truncate to N characters,
to fit the SQL sortable field maximum length constraint
// (according to the schema for this SQL backend)...

Note that if your collator stub does not really compute true sort keys (compacted in binary format and representing collation weights), it may return spaces (U+0020), that is represented by byte 0x20 in the UTF-8 encoding. The separator used must be lower than this encoded character, so the VARCHAR(N) database field has to accept this character.

If it does not, you may offset all the UTF-8 bytes returned by your stub by 1 (because UTF-8 bytes never use the byte value 0xFF), so that you'll be able to ue the byte value 0x20 for the separator (the UTF-8 encoded SPACE will be stored in the sortkey field as 0x21 after it has been offseted).

This does not matter because sortkeys are opaque, so this is stimple to do in the stub. The stored sort key field DOES NOT have to use the UTF-8 encoding explicitly, it must just accept binary encoded bytes that can fit a non-Unicode character. If the database wants you to specify a charset for this binary sortable field, use ISO-8859-1, as long as the database will not alter the binary order of ISO-8859-1 when computing ORDER BY clauses.

I really hope that we will always have the possibility of using VARBINARY(N) fields allowing compact random byte values, because it will be much more efficient for our use of opaque binary sortable sort keys (that are just streams of bytes).

Be also careful about the sign of bytes (i.e. their range value), otherwise byte 0x80.0xFF will sort before 0x00..0x7F in the order by clause. This should be the case if the field is declared to use the ISO 8859-1 encoding with its binary order.

Which SQL backends (and dialects) do you support in MediaWiki ? This may help me understanding some development constraints. I know you support MySQL, but are there simpler backends like BerkeleyDB, dBase files, or flat text files for small projects supported only via some ODBC driver with a PHP interface ?

v85.wikipedia wrote:

For those of us who are less technical, but keen to see this bug resolved, is there somewhere where we can submit alphabets, so that they may be added for collation? I.e. where we submit the sorting order of specific languages for use with {{sort:en}} (or whatever the mark-up will be)?

I would suggest instead making those proposals in the CLDR project.
It already contains lots of data for many languages, related to their expected "sort order" (in fact more generally about their collation).

Go to http://www.unicode.org/cldr and visit the existing ressources browser, and the demo of ICU which uses the same data.

Our collator should "in fine" use the CLDR tailorings that are already defined, with very minor changes (if they are really needed).

May I even suggest that the name of the new column does not even use "sortkey"

i.e. "cl_sort_prefix" instead of "cl_sortkey_prefix"

Why? because this column should NEVER need to convert that prefix itself to a binary sortkey, it should just still be the humane-readable prefix that was specified in {{DEFAULTSORT:sort_prefix}} or in [[Category:...|sort_prefix]].

This will avoid confusion, but it will also allow simpler recomputing of actual sortkeys for various locales, without having to parse the page again for each locale:

Note that when evaluating and expanding the parameters of {{DEFAULTSORT:sort_prefix}} and of [[Category:...|sort_prefix]] in the wiki code of a page (or when transcluding templates), the result text should NEVER depend on the UI language (so if {{int:lang}} is used in that parameter, it should evaluate always as if it was {{CONTENTLANGUAGE}}).

ayg wrote:

(In reply to comment #200)

Which SQL backends (and dialects) do you support in MediaWiki ? This may help
me understanding some development constraints. I know you support MySQL, but
are there simpler backends like BerkeleyDB, dBase files, or flat text files for
small projects supported only via some ODBC driver with a PHP interface ?

We primarily support MySQL. We also have at least theoretical support for PostgreSQL, SQLite, MSSQL, Oracle, and DB2. However, if this doesn't work for all DB backends, it's not a big deal, as long as it works with MySQL.

(In reply to comment #201)

For those of us who are less technical, but keen to see this bug resolved, is
there somewhere where we can submit alphabets, so that they may be added for
collation? I.e. where we submit the sorting order of specific languages for use
with {{sort:en}} (or whatever the mark-up will be)?

We'll be using some kind of pre-rolled algorithm from CLDR or such, so you don't need to do this. If the deployed algorithm turns out to be deficient, and it's not a bug on our side, you should probably contact CLDR, not us.

(In reply to comment #203)

May I even suggest that the name of the new column does not even use "sortkey"

i.e. "cl_sort_prefix" instead of "cl_sortkey_prefix"

Why? because this column should NEVER need to convert that prefix itself to a
binary sortkey, it should just still be the humane-readable prefix that was
specified in {{DEFAULTSORT:sort_prefix}} or in [[Category:...|sort_prefix]].

Yes, this stores the plaintext, unconverted prefix. I don't think the name needs to be changed, though. Both names work well enough.

This will avoid confusion, but it will also allow simpler recomputing of actual
sortkeys for various locales, without having to parse the page again for each
locale:

Yes, that's intended to work.

Note that when evaluating and expanding the parameters of
{{DEFAULTSORT:sort_prefix}} and of [[Category:...|sort_prefix]] in the wiki
code of a page (or when transcluding templates), the result text should NEVER
depend on the UI language (so if {{int:lang}} is used in that parameter, it
should evaluate always as if it was {{CONTENTLANGUAGE}}).

If people want to put crazy stuff in sortkeys that changes based on who's viewing it, we can't stop them. Curly braces are evaluated at an earlier stage than category links, so we can't make them behave differently based on whether they're being used in a sortkey, I don't think. You could also put {{CURRENTTIME}} in sortkeys, or any other silly thing like that.

(In reply to comment #204)

If people want to put crazy stuff in sortkeys that changes based on who's
viewing it, we can't stop them. Curly braces are evaluated at an earlier stage
than category links, so we can't make them behave differently based on whether
they're being used in a sortkey, I don't think. You could also put
{{CURRENTTIME}} in sortkeys, or any other silly thing like that.

That's because the Mediawiki parser still operates on the wrong level, and performs text subtitutions always without ingoring the context of use. Not only this is bad, but this is also very inefficient, because each processing level is converting a fullpage to another fullpage that needs to be reparsed again at the next level.

A much better scheme based on a true gramatical analyser using a finite state automata would really help defining the state at which the parser (or its ParserFunctions extensions) operate, without ever having to create huge complete page buffers between each level (which uses costly appending operations with many memory reallocations).

In other words, when you start parsing "[[" you enter in a "start-of-link" state, which then parses "category:" until it has found a colon, in which case the whole prefix is case-folded and goes to the "in-category" state, then the parser scans up to the next "{" or "|" or "]". It can then correctly process all the text using such rules.

I have suggested since long that the MediaWiki syntax should be reformulated using a LALR(1) formal syntax, from which a finite-state automata can be automatically built to cover all the state information, and then ported to PHP (Yacc, Bison or even better PCTCS could do that without difficulty). Then instead of calling parsing functions that process all the text, then return the converted text for processing to the next level, it will do that in a much simpler (and faster) processing loop, calling much simple generation methods and altering its state in a much cleaner and faster way (no more need to append small lexical items to various buffers, the atoms will be pased from level to level using a chained implementation.

This would also significantly speedup the expansion of templates, and would allow the parser to make distintions when {{int:....}} is encountered in the context of a [[category:...]] (which would have temporarily forced the UI language to the CONTENTLANGUAGE, until the final "]]" token is encountered that would restore the UI language within the parser's internal state. this would also have significant performance benefits (for example, no more need to convert and expand all the parameters of {{#if:...}} or {{#switch:...}}, only convert them lazily when they are really needed for generating the effective output.

Yes this comment is going too far out of the topic of this bug, it is architectural. PHP has all the rools needed to support the construction of tree-like data: instead of passing only strings to parser generation functions, you would pass it an ordered array, whose elements are the parsed parameters, themselves being either strings or subparsed arrays containing strings and other arrays. The builtin functions would then request back to the MediaWiki parser the evaluation/expansion of ONLY the parameters they need, and MediaWiki would still be able to call the expansions of ONLY these parameters, recursively. Most of the items in the wikicode would then be atomic and processed lazily, only when they are effectively needed.

The "crazy" things like {{CURRENTTIME}} or {{time:...}} or {{int:...}} found within [[category:...]] and whose result depends on time or on user preferences would be easily avoided. This would also simplify a lot the management of whitespaces (if they need ot be trimmed and/or compressed depends on the builtin expansion function called by the parser.

ayg wrote:

(In reply to comment #205)

That's because the Mediawiki parser still operates on the wrong level, and
performs text subtitutions always without ingoring the context of use. Not only
this is bad, but this is also very inefficient, because each processing level
is converting a fullpage to another fullpage that needs to be reparsed again at
the next level.

People have tried to fix this before, and failed miserably. It is not practical in the immediate future, if ever. Everyone likes to say how great it would be to write a "real" parser, but nobody ever seems interested in producing working code that actually replicates the current parser to a good enough level to maintain acceptable backward compatibility. If you're interested in that, please give it a try and get in touch with us when you have a working implementation. Otherwise, it's not worth discussing pointlessly yet again.

In reply to comment #206)
Top-down analyzers are very easy to write, provided that you allow parsing to be made via TWO generation points, instead of one: the first one converts and accumulates the tokens in an array as long as it is not able to be processable as a whole; this requiers a first parsing loop. Then the final step will ask the expansion of parsed tokens lazily, in a basic processing loop which will "eat" the tokens lazily, replacing them by their expansion, until there remains only one token, a basic string that can be used as an expanded parameter.

I may try working on it, it's not so complicate to formulate, even without using Yacc/bison/PCTCS, I have many ideas on how it will work : basically each non-terminal token in the formal syntax becomes TWO functions, one is used when parsing individual tokens, until the last one is reached, then a second function will be used to expand itself, that will be used later (after returning from the BY the caller, and ONLY when it will need the expansion in its OWN expansion function.

For example the MediaWiki parser would become TWO functions: mediawiki_parse() (which returns an array of tokens) and mediawiki_expand() which is stored as the first token in the array. All ..._expand() functions would have two parameters: one for the context in which they are to be evaluated (containing for example time properties, cache properties, language properties ), and the array itself to be expanded up to the end until they produce some simple text The main concept is that tokens will expand lazily, but with full control by the non-terminal parsed token which knows its context of expansion.

Basically the first step for each non-terminal token in the formal grammar is performing a bottom-up conversion, from a top-down analysis (performed recursively); this first step builds a AST tree (stored as a basic array of children). The second step is performing a *conditional* top-down expansion of the AST array into a flat text, ignoring nodes that don't need to be expanded as their effective expansion will not be needed for the expansion of the current AST node. Both steps are recursive, but in separate simple processing loops.

OK this is still out of topic, but I don't know for now where to post these concepts. I'll stop discussing this here.

ayg wrote:

*** Bug 24953 has been marked as a duplicate of this bug. ***

  • Bug 25129 has been marked as a duplicate of this bug. ***
  • Bug 24142 has been marked as a duplicate of this bug. ***

On wikitech-l, Roan writes

This support is now in MW with the category collation code Aryeh
wrote, and developers can now proceed to write language-specific
collations if they want to.

See http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/52572

(In reply to comment #211)

On wikitech-l, Roan writes

This support is now in MW with the category collation code Aryeh
wrote, and developers can now proceed to write language-specific
collations if they want to.

So other collations are possible for categories. But this bug is also about sorting in other places. Is there any support for sorting lists in special pages? For example duplicate bug 353 is about Danish letters in Special:Allpages.

ayg wrote:

Getting sorting working in other places would require work, but we could use the same infrastructure. For Special:Allpages, we'd need to add an extra column and index to the page table, which seems a bit excessive for such a minor special page.

There are lots of places where pages are listed - don't they also require or benefit form such index?

ayg wrote:

Maybe. It depends. Given the number of comments here and the number of people being spammed on every post here, though, I'd strongly suggest that people open new bugs for each place they want to have improved collation, rather than continuing to use this bug. You can mark them as depending on this one so interested parties can find them.

Aryeh, Can you point me with a link where your implementation was added in MW ? I'd like to review it, and locate possible bugs or caveats.
Also because the announces integration does not give documentation details about how we can use it (Wiki code syntax in pages, possible support templates, integration points, server settings, possible user preferences, possible parser functions for generating helper data in the HTML so that collation will also work on the browser side because there's still no support anywhere of collation in Javascript and because most Javascript code found to handle special sorts are bogous and do not implement true UCA, except some HTML/Javascript applications using Ajax requests to perform sort on the server side).


Note: jQuery still does not have any "utility" function to support collation. It provides a ".sort()" method that requires a callback function in parameter to perform the actual compare(), but inefficiently as it still cannot use precomputed collation keys from actual table data, so the user-supplied compare function would recompute collations many times from the same data cell, while performing the sort).

I have asked to jQuery to extend its interface to allow precomputing collation keys in a first pass using a user-supplied callback, and then only supply these precomputed collation keys when comparing cells during the actual sort, but this is still a dead letter. I have also asked to jQuery to start a project for implementing a $.CollatorFactory object from which collators can be prepared and instantiated, and a common interface for these collators returned by the factory.

jQuery should then supply at least a very basic CollatorFactory (a stub implementing only binary sort). Then a separate subproject could start creating a more complete CollatorFactory that will fully support multiplevel UCA, that will integrate the DUCET table, and that will integrate support for locale-specific tailorings of the DUCET, before integrating a collection of tailorings (based on CLDR data, but not necessarily on the same format, and not necessarily by allowing a dynamic instantiation of any kind of tailoring using the CLDR syntax for them.

Note that the collection of tailorings (there are many locales in the CLDR) need not always be fully loaded into the browser Javascript: a server-side Ajax request could as well provide the necessary data to the browser (most probably in a JSON structure), for allowing tailored collators to be instantiated rapidly without having to parse the CLDR syntax).

The DUCET table dataset (most probably large) could as well be loaded to the browser in a delayed way as well through Ajax only on demand (and possibly only a part of it, under the control of the CollatorFactory sollicitated by specific collators that would just ask to the CollatorFactory to preload some segments of it, to save page load time and volume, and only when the Javascript will start running some collation function that requires this data to have been initialized and loaded.

I'd like to start my own project for that, and integrate it to jQuery, in a much better way than all what I've seen on the web (all the implementations I've seen for support of collation in Javascript are simply bogous, or at least very defective). Unfortunately, Javascript itself (or the most recent ECMAScript specification), has completely forgotten to develop the subject of collation, so it will take many years before we get correct browser-side sorting becoming standard in browsers: there's not even any ongoing project for that, or even just some preliminary specs, either at ECMA, or at W3C, Apache, jQuery... and even at Oracle, Microsoft, Apple, Mozilla and Opera).

Such support on the opposite is already present in Adobe Flash, in Microsoft's Silverlight (based on the .Net libray), and since many years in Java. This means that ALL web applications that want correct browser-side collation still need to run with browser plugins, and not with pure Javascript over the HTML/CSS DOM API. Given the ongoing efforts to deprecate these plugins in favor of HTML5, I don't understand why no such effort has ever been initiated.

ayg wrote:

The interesting code is here:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Collation.php?view=markup

That's the part that does the actual collation, which is really just handed off to the intl extension. The part that actually hooks into category pages is in CategoryPage.php, but it's relatively boring. I don't think there are any parser functions or anything else exposed to users right now.

If you have questions or comments, I strongly encourage you to post to wikitech-l, not this bug. Bugzilla is used for bug tracking, not discussion, and this bug is fixed. If you have further feature requests, like the addition of new parser functions that expose the sortkey functionality, file new bugs. I am not going to answer any questions that are asked here.

OK, I immediately found a bug in the binary search function (which is defintely not beinary, has slower than expected convergence, and may even stale on testing the same index, due to incorrect handling of min/max indice after comparing):

291 function findLowerBound( $valueCallback, $valueCount, $comparisonCallback, $target ) {
292 $min = 0;
293 $max = $valueCount - 1;
294 do {
295 $mid = $min + ( ( $max - $min ) >> 1 );
296 $item = call_user_func( $valueCallback, $mid );
297 $comparison = call_user_func( $comparisonCallback, $target, $item );
298 if ( $comparison > 0 ) {
299 $min = $mid;
300 } elseif ( $comparison == 0 ) {
301 $min = $mid;
302 break;
303 } else {
304 $max = $mid;
305 }
306 } while ( $min < $max - 1 );
307
308 if ( $min == 0 && $max == 0 && $comparison > 0 ) {
309 // Before the first item
310 return false;
311 } else {
312 return $min;
313 }
314 }

The authout should have better read how TESTED binary searches are implemented. This should really be:

function findLowerBound( $valueCallback, $valueCount, $comparisonCallback, $target ) {
    $min = 0;
    $max = $valueCount - 1;
    while ( $min <= $max ) {
        $mid = ($min + $max) >> 1;
        $item = call_user_func( $valueCallback, $mid );
        $comparison = call_user_func( $comparisonCallback, $item, $target );
        if ( $comparison < 0 ) {
            $min = $mid + 1;
        } elseif ( $comparison > 0 ) {
            $max = $mid - 1;
        } else {
            return $mid; // or: $max = $mid; break;
        }
    }
    // $target not found, now $max < min (more exactly, $max = $min - 1).
    // $target is to insert between them: $max is the highest lower bound.

    if ( $max < 0 ) {
        // Before the first item
        return false; // Don't test with a simple if !
    } else {
        return $max; // May return 0 (lower bound on first item)
    }
}

Note that the final test to return false is unnecessary and causes confusion on usage (if you had written this in C/C++ instead of PHP, you would not be able to make the distinction between false (no lower bound not found) and 0 (valid lower bound). Your code should not depend on those PHP hacks, which consists in returning values of distinct types (but with PHP's implicit promotion on basic types, it's really dangerous do do that) !

It's just simpler to let it return -1 (the content already in $max when we are before the first item), and then let the caller test the negative sign of the result, instead of comparing it then with false.

I note the following comment in r80443:

  • Changed Title::getCategorySortkey() to separate its parts with a line break instead of a null character. All collations supported by the intl extension ignore the null character, i.e. "ab" == "a\0b". It would have required a lot of hacking to make it work.

If you had read my comments above, I had already spoken since long about the choice of the separator to use between the prefix and the rest of the full page name: this had to be a non-ignorable character, that is clearly lower than any character usable in the full pagename.

I had also stated which character to use, namely the TAB character that is the first non-ignorable (i.e. with non-all-zeroes weights). See my Comment #190 for the rationale (one year ago, much longer before your recent fix to use LF, because before yu had tried with NULL, which was wrong, and even without any separator which was also wrong).

You could have avoided these bugs, as I had analyzed it long before.

OK, LF (U+000A) is OK (because it is not allowed in full page names), even it is not the first one.


One word for performance and space: you have put too many fields in table "categorylinks". Only one essential think is needed there for referencial constraints and counting: the UNIQUE index on (cl_from, cl_to), which is architectural. As this index is unique, it should share the same store as the table itself. But SQL engines will never use more than one index per table to read it, because it requires additional random-access cursors to access the table. As the current unique index uses less fields than the table, it implicitly creates a random-access reference from this index to the table (basically, using internal rowids stored in the index, instead of reading directly from the table).

To avoid this problem, you have to make sure that the table is formatted so that its first fields match exactly the same order as the unique index. This is the case for the first unique index created just after that. So the table and the index share the same store, however the number of rows in the index per storage page will be limited (the table can use up to 1KB per row, this does not give a very compact and fast search in the index, as it will load too many storage pages in memory, even when using a "very" selective search).

You may optimize searches (at the price of storage space), by indicating that the unique index should still use a separate store. Normally this is the role of the specifier "PRIMARY" which indicates to not use a separate store (it is still UNIQUE). For now you don't have any PRIMARY index, only a UNIQUE index which only uses a separate index store.

Now comes the two main kind of requests performed by MediaWiki:

(1) getting the list of categories in which a page is currently listed (including when saving a page because this list must be flushed out and replaced by the categories found when parsing the page). Ideally, this should require at least an index on the cl_from. The UNIQUE index on (cl_from,cl_to) works perfectly there because it is in separate store. Other attributes in the table will not even be read, unless the SQL query wants to get these attributes (if this is the case, both the index and table store will be read, the index being read very rapidly and joined with the table using the internal table row id, stored in the index with the two keys). In all what I have seen in the MediaWiki code, this does not occur (including when rendering a page, to display the list of categories, it just needs this list but gets it from the page's wiki code parsing before saving it, and this gets saved in the page cache).

(2) getting the list of pages in a category. There you need to use collation as well because this list is sorted. Ideally, this should use an index and table that just contain the necessary fields, and that the SQL SORT BY clause will use those same fields, in the same order (and either all in ASCENDING, or all in DESCENDING order). This is the second index:

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_type, cl_sortkey, cl_from);

Obviously, it cannot use the same index store as the table. The introduction of the "cl_type" field is recent. It supposes that you will be paging the category views by page type, in the order set by cl_type, which is:
ENUM ('page', 'subcat', 'file').

I wonder if this is correct and if it's the expected order. But anyway this design means that the current code will actually perform three separate requests, each one will be paged separately: one request to get the list of normal pages, another to get the list of subcats, and another to get the list of files. This will be inefficient, and hard to manage for the paging, unless these three types can be effectively paged separately in the MediaWiki HTML UI (this is not the case, as of today).

So if we can only do paging on all categorized pages, whatever their type, the "cl_type" field in the index causes the "SORT BY cl_sortkey" clause in the SQL query to build a clostly temporary index, instead of reading directly from the index where it will finally get the "cl_from" field to display.

So as long as we cannot have separate paging by type, I really suggest that you either drop the "cl_type" field from this index, or that you place it after "cl_sortkey" needed for fast "ORDER BY". I.e. :

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey, cl_from);

My opinion if that "cl_type" is not even needed there, the page id stored in "cl_from" already has this information, because it will be always be used as a reference to the "pages" table to get the actual page name to display, as well as its namespace, none of them being stored in the "categorylinks" table itself.

Note that because this is a secondary index (in a separate store), it is still good to include 'cl_from' there, even if it's not needed for the selectibity of this index, just to avoid an additional random-access cursor to the "categorylinks" table to be used by the SQL engine using table row ids also stored in this index (SQL engines prefer to use referencial cursors, if possible, instead of creating temporary table/index stores for cross joins between tables and/or indexes, except if both joined parts are not very selective ; as this is SQL-engine specific to their internal query optimizer, I won't discuss more about the tricky details ; it's just simpler to avoid the selectivy problem here by including the small "cl_from" field in this index which does not fill up much space in the encoded index row and still allows a good compaction of many rows in the index pages).

So why "cl_type" is there (in the index) if we still cannot perform paging separately on these types when viewing a category content ? Even if you do so later, it will require separate SELECT requests, with separate sorts, and separate paging every 200 items. In this case only, "cl_type" will be specified as a constant and will become very selective, and probably better viewed with separate indexes (so even "cl_type" will not be needed if you have separate tables by splitting the "categorylinks" table for each type).

If you really want to maintain the "cl_type" field, for me it's just an informational denormalization of something really represented by "cl_from". Beware of the effects of such denormalizations (I hope that a page id will never change its type, including when a page is renamed to another namespace; but I won't bet it; I think this may be the cause of future database inconsistencies). For me the rest of the datashema is not prepared to handle such separation (notably between cl_type="page" and cl_type="file").

And in fact if you really want to maintain it in the index, as a convenience to simplify the MediaWiki client code perorming the SQL queries, it should better be, for now, only the last field of this index:

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey, cl_from, cl_type /* denormalized: see cl_from */);

This will still avoid the ORDER BY clause of your queries to create an expensive temporary index when viewing and paging the (sorted) contents of a category (notably for very populated categories, notably on Wiktionnaries that contain categories for large collections of words per language, with 200 000 entries or more).

(In reply to comment #218)

if ( $comparison < 0 ) {
    $min = $mid + 1;
} elseif ( $comparison > 0 ) {
    $max = $mid - 1;

The function is not a classical binary search, which only determines equality, instead it finds the lower bound of a range. When the test item sorts below the target ($comparison < 0), it can't be ruled out as a prospective lower bound. That's why you need $min = $mid, not $min = $mid + 1.

Maybe $max = $mid - 1 is possible in the $comparison > 0 case, with your suggested alteration to the test for the before-the-start case. But the existing code is tested and does work, and the improvement in convergence rate would be small.

OK, I immediately found a bug in the binary search function (which is defintely
not beinary, has slower than expected convergence, and may even stale on
testing the same index, due to incorrect handling of min/max indice after
comparing):

I assume by "stale" you mean an infinite loop. This does not occur because the midpoint is always different from either of the two endpoints until the difference $max - $min is reduced to 1 or 0, at which point the loop exits.

In a textbook binary search, offsetting the midpoint by 1 allows the loop to continue until $min > $max. But as I said, we can't offset the midpoint in the $comparison < 0 case, so a slightly earlier loop termination is required to avoid an infinite loop.

Note that the final test to return false is unnecessary and causes confusion on
usage (if you had written this in C/C++ instead of PHP, you would not be able
to make the distinction between false (no lower bound not found) and 0 (valid
lower bound). Your code should not depend on those PHP hacks, which consists in
returning values of distinct types (but with PHP's implicit promotion on basic
types, it's really dangerous do do that) !

It's just simpler to let it return -1 (the content already in $max when we are
before the first item), and then let the caller test the negative sign of the
result, instead of comparing it then with false.

It's conventional for PHP functions to return false to indicate an error or other unusual situation, see for instance http://php.net/array_search . It's not at all dangerous. If I wrote the function in C, I would indeed have used -1, and I would have called it an ugly hack to work around C's lack of weak typing.

If you're aware of some actual bug in findLowerBound(), you should file a separate bug report, or note it on http://www.mediawiki.org/wiki/Special:Code/MediaWiki/80443

@Philippe, InnoDB treats first UNIQUE index as clustered PRIMARY key, which stores all data with it (I assume you don't come from InnoDB background, please correct me otherwise ;-)

cl_sortkey structure allows having dataset split by type (e.g. subcategories first, files last, whatever), and still maintain decent sorted order. cl_from is included for cases where we need to page over multiple entries with same sortkey.

do note, as secondary keys would include primary key values anyway, it is not much of a cost.

Back in the day both major reads used to be covering ones, without random lookups, I'm not sure if there's a regression nowadays with all the new collation code though.

You should know that a binary search for equality of for lower bound is completely the same algorithm. (If not convined, look at the source of a C++ standard library, which uses the same function for both: finding exact matches or finding a lower bound for inserting a new item).

The difference only occurs at end of the loop, about what you return when an exact equality is not found. In both cases, the binary search loop will terminate with min=max+1 exactly if no exact match is found, so that max is the lower bound you want.

My suggestion fixes a few things:

  • it uses a while()... instead of a do...while() which is incorrect for the case where the search space opperates with count<2.
  • it has faster convergence
  • it contains no special case to handle (your code will be erroneous in those cases).
  • even when searching with count=0, it starts with min=0 and max=-1, which terminates the loop immediately, with the same condition for not found (max < min, and max if the lower bound)
  • it works with count=1 without creating an infinite loop like in your code when the target is higher that the only element [0] to check, because it ensures that the loop will ALWAYS reduce the search space at each interation.

The function is not a classical binary search, which only determines equality,

instead it finds the lower bound of a range. When the test item sorts below the
target ($comparison < 0), it can't be ruled out as a prospective lower bound.

And here you're completely wrong : in that case the comparison says that the mid idem is certainly not the lower bound. Don't forget that the item at max (and below) will also be tested: max will decrease as well, up to the point that it will either find an exact match, or will fall 1 position below min.

That's why you need $min = $mid, not $min = $mid + 1.

That's why you don't need that !

I'm reopening this based on notes from bug 29788 etc.

  • The fixes applied for this bug apply only for category member sorting, but the problem exists in many other places as mentioned in the summary, such as [[Special:Allpages]], [[Special:ListUsers]], etc
  • There doesn't appear to be any straightforward way to set the collation to a locale-specific one!

The only available options are "uppercase" and "uca-default" -- sorting in a locale-specific way (say, Swedish per bug 29788) appears to require writing additional code

  • For some reason, site content language doesn't appear to be taken into account in any way; you must set $wgCategoryCollation manually.

This seems very strange to me, and is an obvious additional burden on site administrators to figure out how to code up (!) and then activate a language-specific collation.

(In reply to comment #224)

I'm reopening this based on notes from bug 29788 etc.

  • The fixes applied for this bug apply only for category member sorting, but

the problem exists in many other places

Don't forget sortable table sorting.

I'm not sure if this bug with 224 (225 now) comments is the best bug to keep tracking this. How about creating a tracking bug, and slicing this issue into smaller, more manageable issues?

(In reply to comment #224)

I'm reopening this based on notes from bug 29788 etc.

  • The fixes applied for this bug apply only for category member sorting, but

the problem exists in many other places as mentioned in the summary, such as
[[Special:Allpages]], [[Special:ListUsers]], etc

Related bug 24574. I suppose we would have to add a sortkey field to other various table.

  • There doesn't appear to be any straightforward way to set the collation to a

locale-specific one!

The only available options are "uppercase" and "uca-default" -- sorting in a
locale-specific way (say, Swedish per bug 29788) appears to require writing
additional code

  • For some reason, site content language doesn't appear to be taken into

account in any way; you must set $wgCategoryCollation manually.

I think that was partially out of concern for php installs without the intl extension installed (So uca collation would not work). Since changing the collation requires running a maintenance script, if we just checked that the relevant class existed to determine which collation to use, and then someone upgraded their php so intl was suddenly present, things would become screwed up. (Or that's what I assumed the reasoning was. I could be wrong)

This seems very strange to me, and is an obvious additional burden on site
administrators to figure out how to code up (!) and then activate a
language-specific collation.

Perhaps making the value $wgCategoryCollation consider anything past uca to be the locale. For example it would consider the value of "uca-foo" to use IcuCollation( 'foo' ) as the Collation class.

(In reply to comment #224)

Don't forget sortable table sorting.

Sortable tables are sorted client-side. This might offer additional
flexibility. If clients were enabled to choose collations for sortable
tables at the time of accutal sorting, we could offer them options
above those available in server side database sorts.

I am currently working on a Javascript implementation of UCA (i.e. with full support of the DUCET, at least 4 collation levels, and support for language-based tailorings). This is a very difficult task if we need performance. The trick is to use efficient lookup tables, but the difficulty is being able to update them (for tailorings); also performance will highly depend on how the DUCET is represented in Javascript, so that it can be initialized sufficiently fast that you won't notice this init time, when a collator object gets instanciated.

I'm still not sure that performance will be good enough (even with today's much faster Javascript implementations), and may be some optional cooperation with a server-side script will be needed (or possibly by using a client-side native plugin interfaced with Javascript, such as Flash) could limit the number of JSON requests to a server in order to provide the tailored lookup tables. How to detect ans use client-side plugins in an open way.

I really hope that someday, the Javascript/ECMAScript standard will specify a common standard interface to collators. I've already looked for help from authors of jQuery, where such common interface would become part of its "tools" library (even if this has nothing in common with working in the HTML DOM, that is the main focus of jQuery).

For now there exists absolutely no reliable implementation of UCA in Javascript working only on the client-side (but there does exist already some Javascript libraries that in fact use JSON or XML requests to a server, in order to build or retreive the necessary lookup tables).

For practical reasons, a server-side JSON/XML request could be optimized to reduce the volume of data exchanged (for example by providing only some parts of the collation elements, these data being then requested on demand even after the collator has already been -partly- initialized, and then cached in the collator object). But this is a complication in the design that for now I don't want to investigate too early, before I get a good image of the performances effectively needed in practice.

May be it will be just enough to initialize only a part of the DUCET for a particular language and its specific tailoring, sorting all the other many characters with default weights (not necessarily from the large DUCET).

My implementation will provide at least 3 interfaced functions:

  • instantiating a collator for a given language & sort order
  • computing collation keys from strings, for more efficient sorts of large tables (even if we use a QuickSort, the same strings are compared multiple times with other strings, so it may help to compute their collation keys only once); in addition, it may not be necessary to compute the full collation key, but only keys up to the level that's sufficient to perform a specific compare; collation keys can then be computed partly, on demand, and cached, instead of being fully computed for the full string at all collation levels; in addition, not all table sorts may require the maximum collation level supported, so collation keys don't necessarily have to be long (in memory);
  • the tradeoff for precomputing keys instead of compring keys on the fly, is highly dependant with the table sizes: for small number of rows to be sorted, you gain little with precomputed keys, you just waste memory and get lower performance. So the interface will also include a 3rd function: comparing strings using this collator, without necessarily having to compute the collation keys.

I estimate that the tradeoff limit between precomputed collation keys and direct compares is for tables of about 100 rows, for which it may be helpful to provide a very fast response time when sorting them, but I may be wrong, because these small tables will not require precomputing and storing a lot of collation keys (so this first step before the QuickSort loop may be insignificant). The real limit could be memory use in very large tables for precomputing, storing or caching the collation keys; but such very large tables will probably be very uncommon in tables generated from Wiki pages (most of these tables would not include more than 500 rows, and storing 500 collation keys is not really a problem in today's browsers, except possibly in smartphones with limited memory resources and slow processors, compared to laptops, notebooks and tablets).

@Philippe Verdy: I think work on the bigger Unicode problems for JavaScript is pretty important. I tried to file a bug or feature request somewhere a couple of years ago and the sentiment from the js community was that Unicode was too big and unneeded, would bloat js and be too slow. I couldn't convince whoever was my audience that you just need it and in any case the js implememtations should be hooking into Unicode APIs provided by their host OS or browser which must have them anyway.

I have implemented some character/script property stuff from Unidata in js and for my purposes found it very fast indeed to do everything with binary search. And all the js implementations are much faster now.

I was doing stuff like looking up the script of every character in an HTML page.

I've also implemented internationalized sorting including UCA of page titles in en.wiktionary using the toolserver as a back end. All work was done on a first generation netbook (-: You have my support. Is there somewhere I can watch your project perhaps?

Until recently, it was too slow to initialize a large table like the DUCET. But now that Javascript starts being used for 2D/3D rendering with interactive speed, demonstrating complex animations, I don't think this is a performance issue. The only issue is the code size that must be loaded once (that's why I spoke about caching the DUCET and loading it by parts, on demand). But note that the DUCET is not as large as the full Unicode. You can easily exclude the Hangul syllables, and the sinograms, and can autobuild some parts of it for blocks of symbols. The core of the implementation is already written for the DUCET-based collation, I need more work to instantiate tailorings. But I've not determined the best way to handle on-demand loading of the DUCET. In addition, if I can finalize it, I will need a way to regenerate with some automatic tool the representation of the DUCET for each release of Unicode. For now I just use a static table, created with lots of manual operations: a code generator will be needed (just like in all existing full implementations of UCA).

Re JS sortable table stuff (this bug is about too many things...): Well client side js sorting stuff would be nice, since these tables aren't dynamic, wouldn't it be easier to: on the server side add a sortkey as a data attribute for each cell of the table, and then just sort that on client side. (Only issue with this is it might be complicated to implement in the parser [table code in parser is scary imho, but I'm sure some other people find it less scary], and lang preferences, but I think that just using the content language [or the page's language I suppose now that we have that] would be sufficient).

Thoughts?

If table contents are static, it's true that their collation weights could be delivered as well with the visible table.

In a past message above, I had already proposed that the function that already is needed to sort categories, and that computes the UCA collation keys, would be exposed as a ParserFunction. This would then obviously allow any wiki page to compose web tables containing both the visible data and the associated collation key.

Many very complex templates, that are used to generate sortable pseudo-collation keys would no longer be needed or would be largely simplified. In fact, I even proposed initially to first expose this function before even modifying the schema of categories : there was no immediate mergency to modify this schema, as long as we could experiment assigning sort keys when categorizing a page, using that ParserFunction.

And probably we would have been able to experiment with a better schema, including one that would have allowed multiple collation keys (generated from different locale parameters), something that has been forgotten completely (and that will require a new change in the database schema in the future, to support multiple collations according to users localisation preferences (and expectations).

Anyway, my work on JS implementation of UCA is already well advanced, and in fact more advanced now on some points that MySQL still does not support correctly, notably contractions and expansions.

And very soon I'll add support for contexts (see UTR#35 to see what I mean with those terms), because it's a trivial use of contractions and expansions, combined

But it can be implemented also as a separate function, to generate shorter keys, I'm stil not very concerned by choosing the most compact format for representing collation keys, but I have already worked a lot in how to compact automatically the sets of collation weights for each level, suppressing all unnecessary gaps that are shown in the standard DUCET table, and only provided for convenience with very basic implementation of UCA not using tailorings, or only using very basic tailorings and no contraction/expansions).

For now, the most critical algorithm (that takes most time when computing collation keys, or when directly comparing two strings) is the computing the NFD: I can save time when comparing pair of strings (for example processing only the begining of strings ontil a collation difference is found, and this does not always require strings being fully converted to NFD), but not when computing any collation key (unless I also add a constraint limiting the length of the generated collation keys, this constraint allowing to stop early). I have looked at severla implementation of Unicode normalizers in Javascript, I'm not satisfied with them as they are clearly not optimal.

In fact both the NFD transform and the generation of collation weights for each level are linked: if we sort only on primary collation level, we loose too much time in computing the NFD transform, that provides too many unnecessary details that will be finally ignored: this means that I'm also trying to perform a NFD transform, that removes these details. Such transform is effectively what the Unicode standard calls a "collation mapping" (something more powerful than just a "case mapping", or even a "case folding").

Such "collation mapping" would be extremely useful for implementing the "first letter" classification in categories, or even to provide thiner classifications in very populated categories (for example allowing two letters). This needs is in fact exactly equivalent to searching for the smallest string that has the smallest collation keys containing only two collation weights per collation level, and with a collation set to only one level, and this also can be largely optimized so that the NFD transform will remove all collation-ignorable details that would never be needed to compute the collation key.

All this is an interesting subject of research (and even ICU does not provide such a general mechanism...).

I will be also innovative in how to provide a very compact representation of tailorings. I also have ideas on how to represent, store, query and cache the precomputed lookup tables for tailorings. And on a standard interface that will allow plugable implementations in native code (if possible and needed), or for use with other languages, but also with other non-UCA based collations (including dummy collations, such as binary sorting, or sorting by standard case folding, or by standard case mappings), or complex collations requiring a lot more mapping data than the DUCET and a set of collation orders and contractions/expansions (for example for sorting Han ideographs on radical/stroke, or for sorting based on romanisations and other translitterations, that are all language-dependant; but for which I have not developed anything for now about translitterators, not even for the standard romanizations of Chinese and Japanese, that require lookup in a large dictionary, such as the huge CJK properties file found in the Unicode database as it cannot be performed algorithmically like standard romanizations of Russian or Arabic).

If you think about it more closely, ALL the existing algorithms that transform any Unicode text into another can be thought as "collation mappings", based on a language-dependant definition of a multilevel collation order. Collation is the central concept, and the distinctions between algorithms are not different from distinctions between collation tailorings (a tailoring may be language-neutral, or be language-dependant).

I will expose my solutions later, first on the Unicode mailing list, on the ICU mailing list, on the jQuery maiing list, expecting that there will be interesting contributions (or useful checks and corrections), before it can become a generic library that can be reused in other projects like MediaWiki. I'll use a licence that will be compatible both with free licences (GPL-like) and open-source licences (OSI), as well as with commercial projects. It will probably be BSD-like.

This bug has been split into 4 separate bus to track the different issues raised:

bug 30672 -- improve sorting on other pages than category pages (tracking)
bug 30673 -- Support locale-specific, or tailored, sorting (tracking)
bug 30674 -- Support better client-side sorting (tracking)
bug 30675 -- Use allkeys_CLDR.txt - the CLDR tailored DUCET instead of allkeys.txt

Don't close this bug as long as it blocks other bugs that are not closed:

bug 3969 -- UTF-8 compatibility (tracking)
bug 6948 -- natural number sorting
bug 29788 -- sort Swedish letters correctly (at end of alphabet)

These bugs effectively depend on UCA (or UCA improvements)

My opinion is that bug 3969 should be detached, but the other two should be reattached to one of the 4 new bugs, on which they depend (most probably bug 30673).

And the new bugs should still be linked to this bug 164, transformed only in a tracking number (that should not be closed as long as the new 4 bugs are open)

(In reply to comment #234)

Don't close this bug as long as it blocks other bugs that are not closed

I thought it worked the other way around; bugs that are blocked by open bugs can't/shouldn't be closed...?

RESOLVED / FIXED per hexmode. Please leave this closed. We discussed this at length about this during the triage session.

  • Bug 30287 has been marked as a duplicate of this bug. ***