Page MenuHomePhabricator

Change AutoCreateCategoryPages to not have to grab entire namespace
Closed, ResolvedPublic

Description

. Change line 13 to $res = $dbr->select( 'page', 'page_title', array( 'page_namespace' => NS_CATEGORY, 'page_title' => $existing_cats) );

User @labster is busy and does not have time to create an upstream task at the moment so I am doing it for him

diff --git a/AutoCreateCategoryPages.body.php b/AutoCreateCategoryPages.body.php
index f0646d3..637e3b0 100644
--- a/AutoCreateCategoryPages.body.php
+++ b/AutoCreateCategoryPages.body.php
@@ -2,15 +2,15 @@
 
 class AutoCreateCategoryPages {
 		/**
-	 * Get an array of existing categories, with the name in the key and sort key in the value.
+	 * Get an array of existing categories on this page, with the unprefixed name
 	 *
 	 * @return array
 	 */
-	static function getExistingCategories() {
-		// TODO: cache this. Probably have to add to said cache every time a category page is created,
-		// by us or manually
+	static function getExistingCategories( $page_cats ) {
 		$dbr = wfGetDB( DB_SLAVE );
-		$res = $dbr->select( 'page', 'page_title', array( 'page_namespace' => NS_CATEGORY ) );
+		$res = $dbr->select( 'page', 'page_title',
+			array( 'page_namespace' => NS_CATEGORY, 'page_title' => $page_cats )
+		);
 
 		$categories = array();
 		foreach ( $res as $row ) {
@@ -37,7 +37,7 @@ class AutoCreateCategoryPages {
 		// array keys will cast numeric category names to ints
 		// so we need to cast them back to strings to avoid potentially breaking things!
 		$page_cats = array_map( 'strval', array_keys( $page_cats ) );
-		$existing_cats = self::getExistingCategories();
+		$existing_cats = self::getExistingCategories( $page_cats );
 
 		// Determine which categories on page do not exist
 		$new_cats = array_diff( $page_cats, $existing_cats );

"The logic here is that there is already an index on (page_namespace, page_title) so lookups would be very fast, given the number of categories typically on a page. But as it stands it grabs the entire namespace, which could be tons of database network traffic on a large wiki.

select sum( length(page_title) ) from page where page_namespace = 14;
[sum( length(page_title) )] => 1091582
1MB of internal db traffic per page save would be not cool." -@labster

Event Timeline

Hi, I'm here, a little. Just going to say that caching the entire category namespace sounds like a nightmare, if only because of the cache invalidation problem. So I just removed the TODO, because that seems like the Wrong Thing.

Hmm, yes, that makes a *lot* of sense :-)
I'll update the code soon. Thank you very much!

Sure, this is my responsiblity to maintain, after all. @labster, any particular form of credit? Currently it's just a line in the changelog:

  • Version 1.0.2, 2017-03-25: A much improved query of pre-existing categories, by Brent Laabs (labster).

(BTW, when looking at this I noticed I actually meant to switch over to use MW's LinkBatch to check for existing categories instead of using a direct query - although I never checked if it works for category links or just for regular links).

That credit works for me.

I guess on thinking about it a little further, at this stage of the parse, shouldn't all of the category links have been rendered? I'm thinking they should already be in the LinkCache by then. So instead of doing any DB query, maybe just use Title->exists() on all of the categories? Skipping the ones that do exist, of course.

I don't 100% know the internals on this, but maybe try it and see if it results in less queries?

Change 352352 had a related patch set uploaded (by FreedomFighterSparrow; owner: FreedomFighterSparrow):
[mediawiki/extensions/AutoCreateCategoryPages@master] A much improved query of pre-existing categories, by Brent Laabs (labster)

https://gerrit.wikimedia.org/r/352352

Sorry for taking so long on this... I have uploaded a patchset with your changes and merged it, @labster, and will merge it as soon as gerrit allows me.
I'm busy with other things ATM, so I'll leave the improved existence-check to another time.

Change 352352 merged by FreedomFighterSparrow:
[mediawiki/extensions/AutoCreateCategoryPages@master] A much improved query of pre-existing categories, by Brent Laabs (labster)

https://gerrit.wikimedia.org/r/352352

@FreedomFighterSparrow Can we assume by the merge above that the issue is now resolved?