Automatic category redirects
Open, LowPublic

Description

Author: p_simoons

Description:
Supposing category:A redirects to category:B.
Would it be feasible to automatically move all articles placed in cat:A into
cat:B instead?
Alternateively, would it be possible to create a Specialpage that lists all
categories that are redirects, so that a bot can do the moving?


Version: unspecified
Severity: enhancement
See Also: T7346

There are a very large number of changes, so older changes are hidden. Show Older Changes

ayg wrote:

*** Bug 8685 has been marked as a duplicate of this bug. ***

ayg wrote:

I'm hardly an SQL expert, I'm afraid, but any particular reason you added an extra query rather than joining? I doubt it makes much difference, though, performance-wise.

I'm a bit alarmed that the change has to be made separately for the API, rather than both calling a general-purpose public method of something, but I guess that's a separate issue.

I'll take a look at this and hopefully commit something today or tomorrow. Although I notice a few bugs assigned to me that I've totally forgotten about, so let's hope this doesn't become one. :D

(In reply to comment #24)

A solution was proposed using the redirects table in bug 8685 ...

(In reply to comment #27)

I'm hardly an SQL expert, I'm afraid, but any particular reason you added an
extra query rather than joining? I doubt it makes much difference, though,
performance-wise.

The JOIN suggestion suggested in bug 8685 didn't work (selected the wrong data from the page table), and since I'm not particularly good at writing complex SQL queries either, I decided to do it this way. I think my way may actually be faster, since the latter query is still a regular category lookup which is indexed. A complex JOIN statement wouldn't have the indexing benefit (correct me if I'm wrong).

I'm a bit alarmed that the change has to be made separately for the API, rather
than both calling a general-purpose public method of something, but I guess
that's a separate issue.

This is partly due to the fact the API provides much more information and filtering options than you'll ever need in a regular page. Also, the current code mixes DB code with UI code, which makes a lot of functions unusable for the API. Article.php and EditPage.php are good examples.

I'll take a look at this and hopefully commit something today or tomorrow.
Although I notice a few bugs assigned to me that I've totally forgotten about,
so let's hope this doesn't become one. :D

We're all humans, we all need breaks ;) take your time.

ayg wrote:

Checking EXPLAIN shows that the query will use a filesort due to replacement of simple equality with a check for IN. I got the same trying the one-query join technique, adjusted to give correct results. Domas will probably kill me if I add a gratuitous filesort to every category, so I (we) will have to ask him why it's filesorting and how to stop it.

(By the way, more easily fixed but also significant, your check for rd_title='title' alone can't use the redirect table's indexes, because the index is on (rd_namespace, rd_title). You need to add 'rd_namespace' => NS_CATEGORY to the conditions for that query to be efficient.)

(In reply to comment #29)

(By the way, more easily fixed but also significant, your check for
rd_title='title' alone can't use the redirect table's indexes, because the
index is on (rd_namespace, rd_title). You need to add 'rd_namespace' =>
NS_CATEGORY to the conditions for that query to be efficient.)

By all means do so. I know just enough about MySQL to get by, and have no idea how all those optimizations work.

(In reply to comment #29)

Checking EXPLAIN shows that the query will use a filesort due to replacement of
simple equality with a check for IN. I got the same trying the one-query join
technique, adjusted to give correct results. Domas will probably kill me if I
add a gratuitous filesort to every category, so I (we) will have to ask him why
it's filesorting and how to stop it.

I don't really understand any of that (for instance, my query doesn't use IN AFAIK), but I understand it's a performance problem. How can that be solved?

ayg wrote:

Your code contains an IN because you have 'cl_to' => $titles as a condition, with $titles an array, and that translates to (cl_to IN ($titles)). I don't know how it can be solved, try asking Domas or someone.

ayg wrote:

After discussion with Domas, it seems that any attempt to check for redirects in the current schema will *probably* cause a filesort, or at least all the ones suggested did. We probably need a new field, cl_real_to or something, that has the redirect pre-resolved. When adding a category to a page, the actual target would be put in cl_to as now; then if it's a redirect, the redirect target would be put in cl_real_to, otherwise that would be a copy of cl_to (or it would be NULL, depending on which works better). Then cl_real_to would be used for displaying category pages in place of cl_to. Whenever a category is changed to a redirect, or the target of a category redirect is changed, categorylinks would be updated appropriately.

River pointed out that if cl_real_to is an id instead of a title, it will persist across renames of the category. But Rob pointed out that that only works if the category has an associated page. River then suggested a cat_id, which may or may not be going too far for this exercise. We can always stick updates for cl_real_to in the job queue, basically mimicking the current bot-update situation.

(In reply to comment #33)
I think cl_real_to is the way to go. Queries would be indexed, you'd have something like WHERE cl_to='title' OR cl_real_to='title';. Updating cllinks would be the simple (but potentially massive) query UPDATE categorylinks SET cl_real_to='redirtarget' WHERE cl_to='redirname';.

Eugene.Zelenko wrote:

*** Bug 15742 has been marked as a duplicate of this bug. ***

catlow wrote:

Sorry, the technical stuff is over my head, but can someone explain what the chances are of this being fixed? What effect do the patches referred to have? Can we expect members of redirected categories to show up on the target category page?

Sebastian wrote:

Just came here from http://en.wikipedia.org/wiki/Wikipedia_talk:Categories_for_discussion#Category_redirects and wanted to add a vote for this bug.

Disclaimer: It's a long time that I last was here, and I didn't find a "vote" feature, as the Mozilla db has. Also, I didn't spend much time to understand the discussion of this and related bugs, and I don't understand the difference to bug 710, which is supposedly fixed.

ayg wrote:

(In reply to comment #37)

I didn't find a "vote"
feature, as the Mozilla db has.

Bottom right, "Vote for this bug" (Ctrl-F for "vote" would have found it).

I don't understand the difference
to bug 710, which is supposedly fixed.

Bug 710 is about redirects to a category working when you navigate directly to the page, the same way they work for other pages. Prior to that bug's resolution, I guess "#REDIRECT [[:Category:XYZ]]" would do nothing, or be buggy or something (it was before my time). This is about redirects actually including one category's contents in another.

PhiLiP.NPC wrote:

may fixed in r46706.

catlow wrote:

This seems not to be working (at least, I just tried it on English Wikipedia and it didn't work - I don't know if it's supposed to be live there yet).

Wikipedia is still on r46424, see Special:Version.

catlow wrote:

I'm glad this is going to be solved soon, but there is the problem of potential exploitation by vandals. I've filed a new bug (bug 17461) to address this.

catlow wrote:

Don't like to sound impatient, but when can we expect this fix to actually come live?

catlow wrote:

OK, it is live, thanks. There still seems to be a slight problem, though, in that you can't get a list of members of the redirected category specifically. I've raised this in a new bug (bug:17571).

ipatrol6010 wrote:

It has been decided that the change will be made in the next mediawiki full release ( http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/RELEASE-NOTES?view=markup ) , so just be patient.☺

Reverted, see CodeReview r46706.

catlow wrote:

Why has this potentially helpful change been reverted? It seemed to be working well; the only problem was bug 17571, which surely can't be difficult to fix. We know that the tables don't get updated straight away when a category is changed to/from a redirect, and I presume we wouldn't want them to. Bots would handle emptying existing categories when they get redirected, exactly as they do now.

(In reply to comment #47)

Why has this potentially helpful change been reverted? It seemed to be working
well; the only problem was bug 17571, which surely can't be difficult to fix.
We know that the tables don't get updated straight away when a category is
changed to/from a redirect, and I presume we wouldn't want them to. Bots would
handle emptying existing categories when they get redirected, exactly as they
do now.

The tables indeed were not updated straight away, in fact they were not updated at all, ever. You'd have to have a bot go through and edit every page in the category, every time the redirect status or redirect target changed.

It's possible to do these updates immediately, with negligible performance loss, and to retire the bots. But it would be much more difficult to implement that feature if the categorylinks table was significantly polluted with spurious links from r46706.

catlow wrote:

I don't think it was ever envisaged that the tables would be updated automatically (I didn't think that was desirable anyway, since inappropriate redirects of large categories, and subsequent reversions, would cause lots of extra processing, of the sort that doesn't seem to happen when e.g. templates with categories get updated). But if you say it can be done, then we'll wait in eager anticipation...

(In reply to comment #49)

I don't think it was ever envisaged that the tables would be updated
automatically (I didn't think that was desirable anyway, since inappropriate
redirects of large categories, and subsequent reversions, would cause lots of
extra processing, of the sort that doesn't seem to happen when e.g. templates
with categories get updated). But if you say it can be done, then we'll wait in
eager anticipation...

Templates with categories don't cause immediate updates because those updates are put in the job queue and executed later. Presumably, updating for category redirect changes would also use the job queue.

ayg wrote:

Templates with categories don't cause immediate updates because those updates require reparsing of large numbers of pages. Category redirects don't, I don't see any reason why they should need the job queue. Except for really giant categories, maybe, where you'd want to batch the updates to not lag the slaves.

Making a "normal" Category a category can be done straight away, but unredirecting a category requires reparsing all category members.

ayg wrote:

Or adding an extra column to categorylinks. That seems like a better idea, unless un-redirecting is expected to be very rare.

That's probably the way to go. What would be that column?

ayg wrote:

cl_to_original or such, an unredirected variant of cl_to. Then if a redirect chain changes, you could do UPDATE categorylinks SET cl_to='New_redirect_target' WHERE cl_to_original IN ('Original_category1', 'Original_category2');. You'd want an index on cl_to_original, of course, so this is a pretty heavyweight addition to the table.

ipatrol6010 wrote:

I think that the best solution is you place [[A]] into [[Category:Foo]] and Foo redirects to Bar so you see [[A]] in [[Category:Bar]] and clicking on the catlink to Foo leads you to Bar. For commons they can have "co-categories" where a member of one co-category is visible in all other co-categories. This can be done by having all the categories have [[;Category:Fu]] [[;Category:Fuz]] [[;Category:Faz]] [[;Category:(...)]]

catlow wrote:

Hello, is anyone still working on this? Any progress lately? It all seemed to be going so well at one point...

Sorry my following observation is probably noted above, but I didn't check.

On [[Page A]] put "[[Category:C1]]".
Now on [[Category:C1]] put
"#REDIRECT [[Category:C2]]".

Note how Page A is not listed on Category:C2.
Instead the only way to hunt down Page A in the categories is to
visit Category:1&redirect=no !

(In reply to comment #58)

visit Category:1&redirect=no !

I meant Category;C1&redirect=no. The redirect=no part is not something the average user will know to try. So the category entry is effectively lost in this sense.

*Bulk BZ Change: +Patch to open bugs with patches attached that are missing the keyword*

  • Bug 32262 has been marked as a duplicate of this bug. ***

sumanah wrote:

Adding the keywords that seem right -- if the patches still need reviewing, please change "reviewed" to "need-review".

Qgil added a comment.Mar 23 2013, 6:24 PM

This feature request is being proposed at

http://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Automatic_category_redirects

and I'm considering whether to add it or not to

https://www.mediawiki.org/wiki/Summer_of_Code_2013#Project_ideas

Question:

Is there a potential mentor willing to help potential students interested in
this project?

Is there a reasonable support from the MediaWiki core maintainers to incorporate this feature if it's developed and meets the quality criteria?

Without these qualifications in place we can't even consider the proposal for
GSOC 2013.

(In reply to comment #63)

Question:

Is there a potential mentor willing to help potential students interested in
this project?

Yes me :)

Is there a reasonable support from the MediaWiki core maintainers to
incorporate this feature if it's developed and meets the quality criteria?

I think so. Would require schema changes which is the only bit that could potentially be sticky.

The more you know...

The current query for getting category members is:
SELECT ...
FROM page
INNER JOIN categorylinks

FORCE INDEX (cl_sortkey)
ON ((cl_from = page_id))

LEFT JOIN category

ON ((cat_title = page_title) AND page_namespace = '14')

WHERE cl_to = 'Test' AND cl_type = 'page'
ORDER BY cl_sortkey
LIMIT 201

And, true enough, if you change the cl_to check from a comparison to an IN operator, it triggers a filesort. *However*, if you instead move the contents of the WHERE clause into the INNER JOIN condition, then the filesort disappears. The resulting query is:

SELECT ...
FROM page
INNER JOIN categorylinks

FORCE INDEX (cl_sortkey)
ON ((cl_from = page_id) AND (cl_to IN ('Test')) AND (cl_type = 'page'))

LEFT JOIN category

ON ((cat_title = page_title) AND page_namespace = '14')

ORDER BY cl_sortkey
LIMIT 201

Now I'm not too much of an expert on databases, but theoretically this should produce the exact same results (since it's an INNER JOIN) but still be efficient (because the cl_sortkey index includes the cl_from and cl_to columns).

This would eliminate the need for any new columns and whatnot.

Qgil added a comment.Apr 23 2013, 9:08 PM

Just a note to say that Liangent has applied to GSoC with a proposal related to this report. Good luck!

https://www.mediawiki.org/wiki/User:Liangent/cat-redir

Re comment 66:

If I have more than a single category in the IN condition when doing that, I get a filesort:

mysql> describe SELECT /* CategoryViewer::doCategoryQuery Bawolff */ page_id,page_title,page_namespace,page_len,page_is_redirect,cl_sortkey,cat_id,cat_title,cat_subcats,cat_pages,cat_files,cl_sortkey_prefix,cl_collation FROM page INNER JOIN categorylinks FORCE INDEX (cl_sortkey) ON ((cl_from = page_id) AND cl_to in ('Foo', 'se') and cl_type = 'page') LEFT JOIN category ON ((cat_title = page_title) AND page_namespace = '14') ORDER BY cl_sortkey LIMIT 2\G

  • 1. row ******* id: 1 select_type: SIMPLE table: categorylinks type: range

possible_keys: cl_sortkey

    key: cl_sortkey
key_len: 258
    ref: NULL
   rows: 559
  Extra: Using where; Using filesort
  • 2. row ******* id: 1 select_type: SIMPLE table: page type: eq_ref

possible_keys: PRIMARY

    key: PRIMARY
key_len: 4
    ref: wikidb.categorylinks.cl_from
   rows: 1
  Extra:
  • 3. row ******* id: 1 select_type: SIMPLE table: category type: eq_ref

possible_keys: cat_title

    key: cat_title
key_len: 257
    ref: wikidb.page.page_title
   rows: 1
  Extra:

3 rows in set (0.00 sec)

Hmm, damn databases.

Success! So the issue is that the cl_sortkey index on categorylinks puts the cl_to column before the cl_sortkey column, so when you add the "cl_to IN ...", it can no longer use the index to sort by cl_sortkey (from the ORDER BY clause).

After adding the following index:

ALTER TABLE categorylinks
ADD UNIQUE cl_newsort ( cl_type, cl_sortkey, cl_to, cl_from )

And then running the following query:

EXPLAIN EXTENDED SELECT cl_from
FROM categorylinks
INNER JOIN page ON
page_id = cl_from
LEFT JOIN category ON
cat_title = page_title AND
page_namespace = 14
WHERE
cl_type = 'page' AND
cl_to IN ( 'Foo', 'Test' )
ORDER BY cl_sortkey

I finally got no more filesort. (I was even able to get rid of the FORCE INDEX usage.) If somebody could please check this and make sure I'm still sane, and that MySQL isn't just inventing things to trick my mind, that'd be great.

I havent tested this, but I would guess that unless it is doing something very fancy with merging indecies, this would cause very large scans of the categorylinks table. (Since it wouldn't be able to skip to only results in the relavent category). filesort isnt the only way that a db query can be inefficient.

(In reply to comment #71)

I havent tested this, but I would guess that unless it is doing something
very
fancy with merging indecies, this would cause very large scans of the
categorylinks table. (Since it wouldn't be able to skip to only results in
the
relavent category). filesort isnt the only way that a db query can be
inefficient.

Hmm, you're right. Now that I realize it, this would require scanning the entire cl_sortkey index (I think).

Qgil added a comment.May 3 2013, 9:56 PM

Just a note to say that Andre Saboia has submitted a GSoC proposal related to this report: https://www.mediawiki.org/wiki/User:Anboia/Automatic_category_redirects

Related URL: https://gerrit.wikimedia.org/r/65176 (Gerrit Change I29a629a514f9568d0ee4d967c516dfd599dc11ba)

Tyler: The patch received a -1, do you plan to rework it?

If I ever have free time again (read: probably not for a while), I offer to help Tyler address some of the issues with the patch.

(In reply to comment #76)

If I ever have free time again (read: probably not for a while), I offer to
help Tyler address some of the issues with the patch.

That would be great. I don't have much free time myself, although once I do I'll definitely work on it.

(In reply to comment #77)

(In reply to comment #76)
> If I ever have free time again (read: probably not for a while), I offer to
> help Tyler address some of the issues with the patch.

That would be great. I don't have much free time myself, although once I do
I'll definitely work on it.

yeah, somebody should make a graph of number of commits to mediawiki by volunteers vs when school semester starts.

Qgil added a comment.Oct 24 2013, 8:35 PM

I'm delisting this project from https://www.mediawiki.org/wiki/Mentorship_programs/Possible_projects#Automatic_category_redirects since it looks like you are almost there.

Remove milestone 1.22 - Given that this has somewhat stalled due to lack of time on the part of interested parties, seems unlikely it could possibly make it to 1.22.

T77903 isn't a duplicate of this task, it's more of a relatively easy fix until we get those long standing difficult core problems solved, which may not be before a very long time.

This task was mentioned in https://www.mediawiki.org/w/index.php?title=Outreach_programs/Possible_projects&oldid=1404823#Very_raw_projects as a possible candidate for Google Summer of Code or similar programs. Do you think it is a good candidate?

Qgil added a comment.Feb 11 2015, 1:44 PM

Wikimedia will apply to Google Summer of Code and Outreachy on Tuesday, February 17. If you want this task to become a featured project idea, please follow these instructions.

Qgil added a comment.Feb 17 2015, 9:39 AM

@Parent5446, this task is assigned to you. Do you want to work on it or propose it to the next GSoC/Outreachy round?

@Bawolff @Parent5446 Has this already been done? If not, is there interest in pushing this for upcoming GSoC round?

Nemo_bis added a comment.EditedMar 6 2015, 9:32 AM

Has this already been done?

No.

There is a partial patch from way back. See comments on gerrit.

(In reply to comment #63)

Question:

Is there a potential mentor willing to help potential students interested in
this project?

Yes me :)

Does this still hold true @Bawolff?

In T5311#1095413, @NiharikaKohli wrote:

(In reply to comment #63)

Question:

Is there a potential mentor willing to help potential students interested in
this project?

Yes me :)

Does this still hold true @Bawolff?

Not really. Like if there was a student who really super wanted to do this idea and no other, maybe, but generally I'd rather not mentor this idea, this round.

Qgil added a comment.Sep 23 2015, 9:10 AM

This is a message posted to all tasks under "Re-check in September 2015" at Possible-Tech-Projects. Outreachy-Round-11 is around the corner. If you want to propose this task as a featured project idea, we need a clear plan with community support, and two mentors willing to support it.

Qgil added a comment.Sep 23 2015, 9:36 AM

This is a message sent to all Possible-Tech-Projects. The new round of Wikimedia Individual Engagement Grants is open until 29 Sep. For the first time, technical projects are within scope, thanks to the feedback received at Wikimania 2015, before, and after (T105414). If someone is interested in obtaining funds to push this task, this might be a good way.

This is the last call for Possible-Tech-Projects missing mentors. The application deadline for Outreachy-Round-11 is 2015-11-02. If this proposal doesn't have two mentors assigned by the end of Thursday, October 22, it will be moved as a candidate for the next round.

Interested in mentoring? Check the documentation for possible mentors.

As previously mentioned, this task is moved to 'Recheck in February 2016' as it doesn't have two mentors assigned to it as of today, October 23 - 2015. The project will be included in the discussion of next iteration of GSoC/Outreachy, and is excluded from #Outreachy-11. Potential candidates are discouraged from submitting proposals to this task for #Outreachy-11 as it lacks mentors in this round.

Billghost removed a subscriber: Billghost.
Sumit added a subscriber: Sumit.Feb 18 2016, 1:33 PM
NOTE: This task is a proposed project for Google-Summer-of-Code-2016 and Outreachy-Round-12 : GSoC 2016 and Outreachy round 12 is around the corner, and this task is listed as a Possible-Tech-Projects for the same. Projects listed for the internship programs should have a well-defined scope within the timeline of the event, minimum of two mentors, and should take about 2 weeks for a senior developer to complete. Interested in mentoring? Please add your details to the task description, if not done yet. Prospective interns should go through Life of a successful project doc to find out how to come up with a strong proposal for the same.
Niharika removed a subscriber: Niharika.Feb 19 2016, 3:21 AM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 7:15 PM
Zppix moved this task from Unsorted to Working on on the Editing-Department board.Tue, Apr 26, 2:31 PM

Add Comment