Page MenuHomePhabricator

cmstartsortkey: DEPRECATED! Use starthexsortkey instead
Closed, ResolvedPublic

Description

Site method using Query list=categorymembers needs to be updated.

See help https://www.mediawiki.org/wiki/API:Categorymembers

  • cmstartsortkey - DEPRECATED! Use starthexsortkey instead
  • cmendsortkey - DEPRECATED! Use endhexsortkey instead
python pwb.py newitem -catr:"údržba:Wikidata|random" -lang:cs -simulate
...
WARNING: API warning (categorymembers): The gcmstartsortkey parameter has been deprecated.

Version: core-(2.0)
Severity: normal

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:57 AM
bzimport set Reference to bz72101.
bzimport added a subscriber: Unknown Object (????).

Bump

pagegenerators.CategorizedPageGenerator(workcat, namespaces=6, start=u'Ambrosius')

WARNING: API warning (categorymembers): The gcmstartsortkey parameter has been deprecated.

Okay there are two questions to be answered:

  1. When was this introduced? I found a 1.19.17 wiki which hasn't deprecated it and has not the new parameter.
  2. What is actually the replacement. The documentation says cmstarthexsortkey but we currently don't use the cmprop=sortkey value so I'd say cmstartsortkeyprefix instead.

Haha, looking at it, if cmstartsortkeyprefix if the actual value we wanted to use then we don't need to determine when this was added as this was already present. This of course contradicts the current usage though so the question is, what actually do we want to do there?

Okay API:Categorymembers states that the hex variants were added in 1.24. And the hex sort key is basically:

sortkey = page.title()
if custom_sortkey:
    sortkey = custom_sortkey + '\n' + page.title()
sortkey = ''.join('{0:02x}'.format(b) for b in sortkey.upper().encode('utf-8'))

Unfortunately cmstartsortkeyprefix only works if a sortkey has been specified explicitly, so we could in theory in APISite.categorymembers convert the start and end sortkeys manually.

jayvdb set Security to None.

Okay looking further into it, I can only get it to work when I use one letter in cmstartsortkey while cmstartsortkeyprefix seems to work also if the sortkey has not specified explicitly (not sure what happened yesterday).

So it seems like we actually want to use cmstartsortkeyprefix which unfortunately has been implemented in 1.18, so we still need to figure out how to use`cmstartsortkey` on older wikis. My question there is actually, what is a binary string?

Smashing idea from the Mediawiki developers. Let's deprecate this thing that people use, and force them instead to use this other thing. Oh, and let's not describe anywhere what this other thing is or how it works. Nobody needs that, right?

Okay, it seems I found out. Each letter has to be changed to its hexadecimal ASCII representation, using capitals. So to start at 'ABC' or 'Abc', one has to use cmstarthexsortkey=414243. For non-ascii characters, I assume it's the Unicode representation (which for the ASCII set is the same). So far, so good. But it seems there are exceptions to this too - on Dutch, to have the same start, one has to use cmstarthexsortkey=27292B instead. On Russian it does not work at all - 0878087904 starts at the beginning, 08768087905 is beyond the end. On Spanish, Swedish and Italian, 00 is already beyond the end, so you cannot even get the full list this way. It's a mess.

Change 284637 had a related patch set uploaded (by Xqt):
[T4101] Use starthexsortkey instead of startsortkey

https://gerrit.wikimedia.org/r/284637

Change 284637 had a related patch set uploaded (by Xqt):
[T74101] Use starthexsortkey instead of startsortkey

https://gerrit.wikimedia.org/r/284637

Dalba triaged this task as High priority.Aug 12 2016, 5:11 AM

The relation between the old cmstartsortkey and the new cmstarthexsortkey is as follows:

(I'm using Python 3 here)

import binascii
cmstarthexsortkey = '26589262ff277e03042692ff274d217e032627017e88010e'
cmstartsortkey = binascii.unhexlify(cmstarthexsortkey ).decode('utf-8', 'ignore')

These are the sort keys for the article "چرخه رنگ‌ها" on fawiki. cmstartsortkey is the actual value stored in the database. Look at the cl_sortkey field in the categorylinks table.

To convert the old cmstartsortkey values to cmstarthexsortkey:

cmstartsortkey = "&Xb'~\x03\x04&'M!~\x03&'\x01~\x01\x0e"
binascii.hexlify(cmstartsortkey .encode()) == cmstarthexsortkey

Actually, we probably don't need any of these conversions. Both cmstartsortkey and cmstarthexsortkey are values that are usually obtained via the API. The old API used to both give and take cmstartsortkey and the new API uses cmstarthexsortkey. All we have to do is to change the old cmstartsortkey to cmstarthexsortkey for mw1.24+ and hopefully the users won't notice anything.

If we want a human readable way of setting the start value for the categorymembers method, we probably have to add to new arguments and pass them to API as cmstartsortkeyprefix and cmendsortkeyprefix.

Just noticed that if the user is trying to pass the cl_sortkey from the database to the bot, then the change I proposed above will break the bot's operation... :/

Maybe we can find a method to detect binary string values and convert them to hexadecimal sortkeys...

Don't know why we have to use the hexcodes anymore. Why couldn't we use just cmstartsortkeyprefix which always overrides `cmstarthexsortkey` and use
sortkeyprefix as cmprop argument?

The problem is that in the current implementation the startsort passed to site.categorymembers should be a string like &Xb'~\x03\x04&'M!~\x03&'\x01~\x01\x0e obtained from the API or database. This is only valid as a cmstartsortkey, but not cmstartsortkeyprefix or cmstarthexsortkey. Therefore I think it would not be backward compatible to just pass the current sorkey to cmstartsortkeyprefix.

As a test case consider the following:

import pywikibot as b
s = b.Site('fa', 'wikipedia')
c = b.Category(s, 'رده:رنگ')
a = c.articles(startsort="&Xb'~\x03\x04&'M!~\x03&'\x01~\x01\x0e")
p = next(a)
assert p.title() == 'چرخه رنگ\u200cها'

Any change should be able to pass the above test to be backward compatible.

pywikibot-compat removed due to decommission compat

Change 428148 had a related patch set uploaded (by Dalba; owner: Dalba):
[pywikibot/core@master] CategorizedPageGenerator: Use startprefix parameter of category.articles

https://gerrit.wikimedia.org/r/428148

Change 428148 merged by jenkins-bot:
[pywikibot/core@master] CategorizedPageGenerator: Use startprefix parameter of category.articles

https://gerrit.wikimedia.org/r/428148

Dalba claimed this task.

Change 284637 abandoned by Xqt:
[Fix] Use startsortkeyprefix instead of startsortkey for mw >= 1.18

Reason:
Already solved

https://gerrit.wikimedia.org/r/284637

Change 662081 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [cleanup] Drop startsort/endsort parameter for site.categorymembers method

https://gerrit.wikimedia.org/r/662081

Change 662081 merged by jenkins-bot:
[pywikibot/core@master] [cleanup] Drop startsort/endsort parameter for site.categorymembers method

https://gerrit.wikimedia.org/r/662081