Page MenuHomePhabricator

Dump the article titles lists (all-titles-in-ns0.gz) unsorted
Closed, DeclinedPublic

Description

For some uses it's useful to have the list of page titles in their natural unsorted order.

It's trivial for anybody to sort these lists if they are distributed unsorted.

It's impossible for anybody to restore these lists to their original order however.

The sort that's done is likely by byte or by codepoint which will be useful for English but for most other languages will be wrong. Sorting an already sorted list into a similar but different order is close to worst case for quicksort which most text file sorters use.

Retrieving the page titles from the full dump is a lot more work than the sorting would be, and is more error prone.

It would save a bit of work on the servers that do all the sorting for every wiki each time a new dump is made.


Version: unspecified
Severity: enhancement

Details

Reference
bz14415

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:08 PM
bzimport set Reference to bz14415.

There's no such thing as unsorted, just different possible sort orders.

They come out of the database in raw index order (namespace, title). If you want some other order, you can trivially sort them yourself.

Please look at one of these files. It's "raw index order" I'm asking for.

The files are not provided in "raw index order" but in fact are provided sorted.

They come out in the natural index order. Here's the query:

"select page_title from page where page_namespace=0;"

That would follow the (page_namespace,page_title) index.

Note that if you want them in some other order, like say page ID order, you can get that by pulling the stub XML dumps. These include page & revision metadata, ordered by page ID then revision ID. (The current-version only ones would only include the latest revision, which you can easily discard.)