Page MenuHomePhabricator

Entries for non-existent categories with no members should be deleted from the 'category' table
Closed, ResolvedPublic

Description

As far as I can tell, entries are never deleted from the category table. This means that if a category is added to a page and removed, it will be in the category table forever, even if a category description page never existed.

Its probably safe to say that if a category has no pages in it and has no corresponding description page, that it ceases to be a category, since it doesn't exist in any visible place on the wiki.

On the English Wikipedia, more than 37% of the category table are these non-existent categories.

mysql> SELECT COUNT(*) FROM category LEFT JOIN page ON page_namespace=14 AND page_title=cat_title WHERE page_id IS NULL AND cat_pages=0 AND cat_files=0 AND cat_subcats=0;
+----------+
| COUNT(*) |
+----------+
|   419270 |
+----------+

This is out of 1112968 entries in the category table.

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 11:14 PM
bzimport set Reference to bz26411.
bzimport added a subscriber: Unknown Object (MLST).

I'm not sure that this is actually a problem. Is it?

Its not a user-visible problem, but it is completely unnecessary clutter in the database, and could be potentially annoying to toolserver users and people using the SQL dumps.

EN.WP.ST47 wrote:

-> enhancement

If not an actual change to the category handling code, perhaps there could be a maintenance script to this effect?

I have to agree with Alex and Dan on this...data that serves no purpose and is unrelated to anything is usually considered A Bad Thing in database administration. It causes bloat, potential slow-downs (not that that's a big concern when you've got a bazillion servers <g>), and is just that much more data that might cause problems in a corruption or large-scale-change scenario. Something should definitely be implemented to clean up the unused entries.

EN.WP.ST47 wrote:

I figured that it would be nice to completely embarrass myself by writing some actual code. Attached patch adds a function to Category.php that deletes a category from the category table, and adds a maintenance script which calls that on all categories which do not have any subcats, pages, files, or a description page. I don't have a wiki to test this on, at least not one I have shell to, and I really don't know how things work, so it would be nice to know what I screwed up.

EN.WP.ST47 wrote:

Add a deleteFromTable to Category.php, new maintenance script

One of the things that I screwed up is that I forgot to attach the actual patch.

attachment CategoryUnPersistence.patch ignored as obsolete

EN.WP.ST47 wrote:

Updated patch

There was a syntax error in that maintenance script, and apparently my !== was too specific for PHP to handle. I also added the necessary code to a deleteIfUnused() function in Category.php, although the maintenance script doesn't use it since it needs the --dry-run functionality. This script is tested against r92419 with categories containing only pages, though the logic is obviously the same for subcats and files. Also tested with both extant and non-extant category description pages. Only categories with no members and no description page are removed from the database, and there is no bad behavior when viewing old revisions or anything like that.

Attached:

(In reply to comment #1)

I'm not sure that this is actually a problem. Is it?

A user hit this bug today because UploadWizard relies on MediaWiki's allcategories API module to list category suggestions. The allcategories API module reads directly from the category table, so users end up selecting category suggestions for bogus categories (categories without any members or a description page). This is most certainly a problem.

This table is used in several places to get a list of categories. One of those places is ApiQueryAllCategories. This api is used to get suggestions for category titles. And is now returning suggestions for categories that have only existed by accident in the past and have no category-namespace page and no members.

Raising priority and adding 1.19 blocker (hopefully sooner, but it's hitting UploadWizard and causing bug 30718)

sumanah wrote:

Dan, is there any reason that we should keep a record of what categories used to exist (that is, archive that list somewhere while deleting them from the Category table)?

EN.WP.ST47 wrote:

Well, none that I can think of, and using that script on my test wiki provided no ill effect. These categories are deleted anyway, looking at old versions of pages that used to be in them will still show a redlink in the categories list either way, and I can't think of any use for a list of all categories past and present. Our lists of pages and files also don't include deleted items, and as others have said it's causing problems that categories refuse to go away when you delete them. UploadWizard had to add a workaround to avoid suggesting nonexistent categories.

(In reply to comment #9)

This table is used in several places to get a list of categories. One of those
places is ApiQueryAllCategories. This api is used to get suggestions for
category titles. And is now returning suggestions for categories that have only
existed by accident in the past and have no category-namespace page and no
members.

Raising priority and adding 1.19 blocker (hopefully sooner, but it's hitting
UploadWizard and causing bug 30718)

Note, uploadWizard seems to have been fixed independently of this.

If we really care about this sort of thing, we could delete the category entries every time cat_pages gets down to 0 and the description page doesn't exist. But then we could have a lot of deleted and then recreated and then deleted entries in the category table.


Re: comment 7 (the patch)

Looks mostly good I think (I didn't take a super in-depth look, but from my quick read over, it looks pretty good). Two issues I see so far:
*When getting the entries from the category table in the maintinace script, you should sort by cat_pages. There's an index on this so it should be fast, and its a lot quicker to delete everything with 0 entries if you start with the things with 0 entries instead of going through all entries
*If you already have a category row available, do Category::newFromRow( $row ); (or even just directly look at the values in $row). Doing Category::newFromName will result in a second query for the category table which you already have.
*I don't like the idea of introducing methods that aren't used.

switch to milestone, remove release tracking dep

pushing to future since release is imminent.

Is it my imagination or has this not yet been published, despite being noticed and corrected three years ago? Certainly Wikipedia's "List of" categories show that the problem still exists. Out of the first 100, it looks like one is a redirect (List of Albanians) and none of the others exist at all.

https://en.wikipedia.org/w/api.php?action=query&list=allcategories&acprefix=List_of&aclimit=100&acprop=size

vs.

https://en.wikipedia.org/w/api.php?action=query&list=allpages&apfrom=List%20of&apnamespace=14&aplimit=2

I've tagged this for the next release, but this is my first time doing that, so if I did anything wrong, please correct me!

(In reply to Robert Morley from comment #15)

Is it my imagination or has this not yet been published, despite being
noticed and corrected three years ago?

It was noticed. It was not corrected. Dan proposed making a script that could be run manually to remove categories without any entries. I had some mild objections at the time. My biggest objection is that if we don't want such categories, we should have something automatic to remove them (ie not a maintenance script), where the comment 7 solution would require a human to press a button every so often to get rid of the extra categories.

I've tagged this for the next release, but this is my first time doing that,
so if I did anything wrong, please correct me!

Those tags aren't treated too seriously. Generally though I would only recommend setting target milestones if you are in a position to fix the bug.


So possible solutions to this bug:
*Have allcategories default to having the acmin parameter be 1. This is closer to the probable intent of users. Downside - will exclude pages in the category namespace that don't have any category members. However it should be noted that this is already the case. [[Category:*NSYNC members]] is a thing, but https://en.wikipedia.org/w/api.php?action=query&list=allcategories&acprefix=*N&aclimit=100&acprop=size doesn't include it.

Another possible solution (perhaps more proper):
Change line 3384 on includes/WikiPage.php to delete the entry in the category table iff. cat_pages = 0 and there is no description page for that category. If we did that, we would probably want to ensure that creation of a page in the category namespace creates an entry in the category table even if it has no entries.

I hadn't looked closely at the patch and thought he'd added both the automatic deletion functionality and the maintenance script. I see now that that was incorrect.

For your first solution, I think your point about *NSYNC means that a default of acmin=1 would be an easy-to-write interim solution.

I really feel that the second solution is the way to go in the long run, though. The category table never getting cleaned up is basically a resource leak, which is bad in its own right, but could also conceivably be used to expose unwanted profanity or sensitive data with no easy way to get rid of it.

I was thinking about sometimes-empty categories, but even if a category is empty, it'll almost always have a category page associated with it (in which case it should be kept, as you mentioned). Even if it doesn't, I don't think deleting it will really matter. I suppose you might want it left as a category suggestion or what have you, but really, if that's what you want, you should be creating a page for the thing anyway.

This is an urgent issue.

Our app lets people upload pictures to Commons and choose categories for them. This bug often makes people's categorization efforts worthless.

It also makes people think "Proposed categories are trash anyway, so let's not care about this whole useless categorization thing".

This is an urgent issue.

Our app lets people upload pictures to Commons and choose categories for them. This bug often makes people's categorization efforts worthless.

It also makes people think "Proposed categories are trash anyway, so let's not care about this whole useless categorization thing".

So then the app should be fixed...

@Bawolff It is a joke right? Our app launches a category search and gives the results to the user as category suggestions. Because of the present bug the server gives us trash categories, which means the server needs fixing.

Are you suggesting that after doing a category search, for every single result we should launch additional requests to check whether that category is actually really in use?

I agree with Nicolas_Raoul here. There's nothing wrong with the app, the server should not be storing useless, deleted categories that have neither pages on the wiki, nor any category tags associated with them. As I said almost two years ago, this is a resource leak, and a malicious user could theoretically add to the table indefinitely, not that that's likely.

Another issue that I hadn't considered at the time, but have come across since in another context, is that it also presents a security issue. A malicious user could publish sensitive information as, for example, Category:RobsHomePhone800-555-1212, which would then be permanently available to anyone with the know-how to find it, even if the relevant pages were deleted. (Oops, apparently I had thought of that a couple of years ago...just noticed it in my earlier comment.)

@Bawolff It is a joke right? Our app launches a category search and gives the results to the user as category suggestions. Because of the present bug the server gives us trash categories, which means the server needs fixing.

Are you suggesting that after doing a category search, for every single result we should launch additional requests to check whether that category is actually really in use?

No, im suggesting you should add &acmin=1 to your category query (assuming you are using list=allcategories modules. If your doing some other sort of query then you might have to use some other parameter depending on your needs)

The situation with empty categories is certainly non ideal...but ultimately if an api doesnt offer what you need, and there exists apis that do, you should use the one that meets your needs.

I still think this bug should be addressed, but the fact there is an app doesnt really make it any more or less urgent imo

matmarex renamed this task from Entries for non-existent categories should be deleted from the 'category' table to Entries for non-existent categories with no members should be deleted from the 'category' table.Jul 13 2016, 4:12 PM
matmarex updated the task description. (Show Details)
matmarex set Security to None.
matmarex removed a subscriber: wikibugs-l-list.

This doesn't look very difficult to fix. The relevant code is in WikiPage::updateCategoryCounts(). It could be easily extended to check, for every removed category, if the 'categorylinks' entry being removed is the last one; and if so, delete the 'category' table entry too. A simple maintenance script would also be needed to delete any existing 'category' table entries for empty non-existent categories in batches (look at maintenance/deleteSelfExternals.php for an example).

The best way to have things fixed quickly is to fix them yourself ;) I'll be happy to review any patches.

Change 298791 had a related patch set uploaded (by Anomie):
Only store currently-existing categories in the categories table

https://gerrit.wikimedia.org/r/298791

Ha, I put this on my todo list after closing T139911 because fixing it would be a nice thing to do, and now that I wrote a patch I see there was a lot of activity here since.

Related: https://en.wikipedia.org/w/index.php?title=Wikipedia_talk:Database_reports&oldid=730539128#Zombies_in_the_database.

MariaDB [enwiki_p]> select count(*) from category left join page on cat_title = page_title and page_namespace = 14 where page_title is null and cat_pages = 0;
+----------+
| count(*) |
+----------+
|  1040481 |
+----------+
1 row in set (24.60 sec)

Assuming this query is mostly correct, we have over 1 million orphaned category entries on the English Wikipedia alone. :-/

Change 298791 merged by jenkins-bot:
Only store currently-existing categories in the categories table

https://gerrit.wikimedia.org/r/298791

Marking this resolved now. Use T140811: Run maintenance/cleanupEmptyCategories.php to track the need to run the maintenance script to clean up the existing data on WMF wikis.