Page MenuHomePhabricator

Add methods for getting an array of titles from a category
Closed, ResolvedPublic

Description

Author: robert

Description:
Adds two method to Category, getMemberIds() and getMemberTitles().

Several extension deal with all the members of a category in some way, it would be useful to have methods that allow one to get an array containing titles for all members in a category.

The attached patch adds two methods to the Category class:

  • getMemberIds() for getting an array of category page ids,
  • getMemberTitles() for getting an array of title objects for these page ids.

The arrays behind these are only generated when the methods are actually called (but are cached in member variables), to save on memory usage; and are ordered by the category sort key.

While this solution is clearly only adequate on smaller categories, most of the extensions that require such methods are aimed at smaller wikis - and this is therefore not a significant issue. It is often the case that certain extensions would break or need redesigning if faced with thousands, rather than hundreds, of pages in a category (this is not necesserily a bad point however, due to the primary target of such extensions). In the future if large categories need to be scanned, heavy duty progressive methods could be introduced in some way - but this is not yet necessary.

I'd appreciate it if someone could take a look over the patch, give comments, and possibly advice as to whether or not it would be suitable for core integration - after which case I will commit it if there is agreement.


Version: unspecified
Severity: enhancement

attachment categories.diff ignored as obsolete

Details

Reference
bz14923

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:15 PM
bzimport set Reference to bz14923.

And what if one of these functions is run on [[Category:Living people]]? There must be some kind of limit.

robert wrote:

As I pointed out in the bug description, this is aimed at extension being used on smaller wikis - but placed in core due to the number of extensions that could use it. It would be silly to use this in core at all, due to the very reason you pointed out, and thus I see no reason to place any artificial limitation on it. As I was careful to indicate, any traversal of a larger category would require a progressive function. If this feature is ever commited I will, however, place a comment pointing out the danger of using this in core and how it would not be suitable for anything aimed at large wikis.

ayg wrote:

Thoughts:

  • If this is going to be done, I'd add offset and limit parameters and incorporate it into CategoryPage.php, replacing doCategoryQuery(). This would also allow third-party extensions to iterate through categories in a reasonable fashion.
  • What's the point of getting the ID's? ID's are practically useless by themselves. I'd ditch that method and just return a title array.
  • I would *not* cache the result. There's a pretty good chance that it's large, so caching is a bad idea. The caller can cache it if necessary. If it's not, we don't want a ton of memory being hogged.
  • Look over your whitespace. You seem to have lots and lots of blank lines for no particular reason. Don't make the first or last line of a function blank, and use blank lines within a function to separate groups of lines. Don't put a blank line between every line.
  • To get the titles, you're:
    1. querying the database for ID's,
    2. copying all those ID's into a PHP array (which is much less efficient than a MySQL result, according to Tim),
    3. caching that array even though nobody ever asked for it,
    4. querying the database *again* to get the info you need from the ID's (in Title::newFromIDs),
    5. copying those into an array of Title objects (which is horribly memory-inefficient: how big is each Title? how much overhead per array item),
    6. caching *that* array,
    7. returning it. I would strongly suggest that instead you do the full query to retrieve all necessary fields directly, then return the result wrapper. Or better yet, an efficient TitleArray class akin to the UserArray class we now have would not go amiss.

robert wrote:

Revised method adding only the Category::getMembers() function with limit and offset parameters

I have taken into account your advice here and on IRC and have created a revised method Category::getMembers() that accepts limit and offset parameters. It uses the TitleArray class you developed and contains only one blank line (this seperates the actual querying and the construction of the limit\offset query options, which have to be added to the ORDER BY if needed). No caching is done as this would no longer be sensible given the limit and offset parameters.

attachment categories.diff ignored as obsolete

ayg wrote:

+ /**
+ * @param integer $limit
+ * @param integer $offset
+ * @return TitleArray object for category members.
+ */

More documentation, please. Give a one-sentence description of the method's purpose, at least, like "Fetch a TitleArray of up to $limit category members, beginning after the category sort key $offset." (See remarks below on changing $offset to a string.)

Also, I'm told we're supposed to be using "@param $name type", not the reverse:

http://www.mediawiki.org/wiki/Manual:Coding_conventions#Inline_documentation

+ public function getMembers( $limit = false, $offset = 0 ) {
+ $options = array( 'ORDER BY' => 'cl_sortkey' );
+ if( $limit ) $options += array( 'LIMIT' => $limit, 'OFFSET' => $offset);

Do not use OFFSET if you can possibly avoid it (except if you know that it's going to be acceptably small, which you don't). What you want to do is something like

		$dbr = wfGetDB( DB_SLAVE );
		$conds = array( 'cl_to' => $this->getName() );
		$options = array( 'ORDER BY' => 'cl_sortkey' );
		if( $limit ) {
			$options['LIMIT'] = $limit;
		}
		if( $offset !== '' ) {
			$conds []= 'cl_sortkey > ' . $dbr->addQuotes( $offset );
		}

LIMIT m OFFSET n is inefficient: it basically just calculates the first m + n rows and throws away the first n. The technique of using your sort value as an offset (which is used in Pager.php, for instance, and all pages that use it) is much more efficient for very large offsets, since it only scans the rows that are needed. Pagination can be achieved by passing the last sort key in the previous batch as the offset in the new batch. Note how category pages have offsets like "from=Bob_Smith", not "from=200", and the same is true for most other pages.

+ $dbr = wfGetDB( DB_SLAVE );
+ $members = TitleArray::newFromResult(
+ $dbr->select(
+ array( 'page', 'categorylinks' ),
+ array( 'page_id', 'page_namespace', 'page_title', 'cl_to', 'cl_from', 'cl_sortkey' ),

This is wrong. You're never going to use cl_to, cl_from, cl_sortkey. Make it

				array( 'page_id', 'page_namespace', 'page_title', 'page_len', 'page_is_redirect', 'page_latest' ),

as documented in the comment for TitleArray::newFromResult().

+ array( 'cl_to' => $this->getName() ),
+ METHOD,
+ $options,
+ array( 'categorylinks' => array( 'INNER JOIN', 'cl_from = page_id' ) )

Using the join thing here should work, but it's clunky. Really this is only needed for left/outer joins. You could drop this line and add 'cl_from = page_id' to the conditions, for the same effect. But either way works.

+ )
+ );
+ return $members;

It's not a problem, but it would be a little simpler (one line less) if you just replaced "$members = ..." above with "return ...", and skipped this line.

robert wrote:

Revised version with better comments and string offsets.

Takes into account all of the comments.

Attached:

ayg wrote:

Looks good. If you've tested it and it works, I say go ahead and check it in.