Page MenuHomePhabricator

Make updateCollation.php process categorylinks on a category-by-category basis
Closed, ResolvedPublic


I'm not sure what's the order right now, but categories were surely messed up on during its migration to uca-pl (bug 42413); collation, so it's not by category.

Apparently nobody noticed but me (or cared enough to report it), but the problem existed. The migration took ~25 hours.

Getting this done will be necessary if we ever want to get, say, Commons or English Wikipedia migrated with reasonably little disruption; doing this many categorylinks might take weeks, and somebody is bound to notice and get mad.

Version: 1.21.x
Severity: enhancement



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:35 AM
bzimport set Reference to bz45970.
bzimport added a subscriber: Unknown Object (MLST).

Somebody already noticed it. :) See Change-Id: Ibcf789314d02a68157063e11dcdf81abfb9d61fb

Copying my comment in that change: I find that there's no index cl_to,cl_from (there is cl_to,cl_type,cl_sortkey,cl_from but cl_sortkey is not stable when this script is running), and I don't think it worth adding a new index only for this reason.

Even in the case where the second index is unstable, we could still use it in the case that we know which cl_collation to ignore, at the cost of a couple rows being looked at twice (specificly that could work in the non --force case currently. Im not sure if such a scheme could work with the new multi collation stuff though)

Will this actually work as expected? Having $wgCategoryCollation set to the old value would mean new additions get added with the old flag. Having it on the new one will mean new members to the category may or may not match that of the current members.

Is this really any better/different/whatever to how we are proceeding now?

It would mean that only newly created category links would potentially be ordered incorrectly. Right now those are a minority of pages ordered incorrectly. Currently we go oldest pages first, which makes categories almost entirely broken for the duration of the script running.