Page MenuHomePhabricator

Hierarchical category system is urgently needed
Closed, DeclinedPublic

Description

Author: kleine_matthias

Description:

  1. Categories in wikipedia are chaos.
  2. The reason is: The system does not work hierarchically.
  3. Example: When I add an article to category "Cat", it should also _automatically_ belong to categories

"mammal", "animal" and "creature". When I now browse through the categorie "animal", I will find the
article. This is not the case in the current system. The result is chaos.

  1. Much work is now spent to solve the chaos in the current category system. Much work could be saved if

there would be a sound technical foundation for a _true_ category system.

  1. I discussed this issue with several engaged wikipedia authors and administrators in the german

wikipedia. They all agree that this would be a desirable issue.

Best regards
Matthias Kleine


Version: unspecified
Severity: enhancement

Details

Reference
bz1497

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 8:13 PM
bzimport set Reference to bz1497.

Discussed this with Matthias a bit on IRC, will implement his plans once we've got a firmer idea how best to do this.

Please see http://meta.wikimedia.org/wiki/Category_flatten

I am no php coder, but I think it's not really tough to get that done. Opinions are most welcome. And yes, we badly
need that.

The hard part is not proposing a flattened membership table to speed reads,
but rather implementing it efficiently. Not only reads, but writes must be taken
into account; if a major category hierarchy is rearranged (and this can be done
with a simple edit to a single page), thus must be handled without killing the
wiki for an hour rewriting the flattened membership table.

kleine_matthias wrote:

Anybody who is interested in finding an efficient solution for this problem may also take a look at

http://en.wikipedia.org/wiki/User_talk:Brion_VIBBER#Categories

Regards, Matthias

richholton wrote:

Does this presume a change to a tree-structure for categories? If not it seems
like you could end up with a situation where adding an article to a category
could add that article to virtually every category on the system. Is this what
we want?

Or, if we are talking about a true tree-structured category system, is ''that''
what we want? It would be a significant change to the current system's behavior.

kleine_matthias wrote:

Its just like categories in human mind work. All kinds of cognitive science support that view of categories: Let it
begin with Piagets studies of cognitive development, take a look at modern cognitive psychology, look how the studies of
artificial intelligence deal with the problem ... categories are structured treelike, not listlike. Look at how
scientific areas are structured. How are the books in your library sorted (I hope kind of different then articles in
wikpipedia ...). Its just a natural way of dealing with issues, saying "this issue belongs to this broader issue, and
this broader issue itself belongs to a more general issue ...".

I admit that this would not be a minor change in how things are done in wikipedia. Therefore, I appreciate the
discussion. We should be aware that even when we keep the category system as a list, like it is now, users will continue
to handle it like a tree, not knowing that the system will behave different than they think of it.

Did you ever observe people creating very special categories like [[categorie:mysmallhometown]] and changing the links
in dozens of articles? Its only a question of time until somebody even more weird will create [[categorie:
mysmallhometown (westside)]], changing the links again, so that [[categorie:mysmallhometown]] will lose quite many of
its articles. In fact, this is what happens every day in the current system ...

Regards Matthias Kleine

joern.schimmelpfeng wrote:

Ever thought about creating an Ontoloy?

One major problem I see is to detect if a subcategory or an articel in a category belongs to a
toplevel category. For example you have toplevel category "A" and "B" and you have subcategories "A1"
and "B1" as well as "AB". Lets assume "AB" is subcategory of "A" and "B" and there is a looseley
containment relationship from "AB" to "B1".

A B

\ /
AB
\

A1 B1

So logically this means "B1" is subcategory of "A". But semantically it is not neccessarily. Within
articles the problem is much more worth, because we sometimes have very looseley relationships there.
Examples for that are "Computer Science", "Social Science" and "Computers and Society".

So one idea is to create a Ontoloy. This means that relationships are semantically well defined. (Eg.
containment relationship, similarity relationship, "is part of" relationship,...). So you are able to
"understand" what kind of relationships two categories or articles have - if it is strong or just
informational.

We could adopt the Semantic Web approach (RDF/OWL) for that. I don't think that we should use it
directly because of the complexity of RDF.

What do you think?(In reply to comment #0)

  1. Categories in wikipedia are chaos.
  2. The reason is: The system does not work hierarchically.
  3. Example: When I add an article to category "Cat", it should also _automatically_ belong to

categories

"mammal", "animal" and "creature". When I now browse through the categorie "animal", I will find

the

article. This is not the case in the current system. The result is chaos.

  1. Much work is now spent to solve the chaos in the current category system. Much work could be

saved if

there would be a sound technical foundation for a _true_ category system.

  1. I discussed this issue with several engaged wikipedia authors and administrators in the german

wikipedia. They all agree that this would be a desirable issue.

Best regards
Matthias Kleine

kleine_matthias wrote:

A B

\ /
AB
\

A1 B1

So logically this means "B1" is subcategory of "A". But semantically it is not neccessarily.

In my eyes, this is clearly a problem of the user level. No architecture will prevent that a user "edits" the category
tree in a way that semantically is nonsense (i.e. classifying a car as animal or something). Surely enough, there are a
couple of models for knowledge represantation, which might be even better than a category tree (in my opinion, Minsky's
frame logic would be quite fine, but this relies on a tree structure, too). However, this aim is too far to achieve. A
simple tree would be three steps forward and might be realizable in quite a foreseeable time.

joern.schimmelpfeng wrote:

In my eyes, this is clearly a problem of the user level. No architecture will prevent that a user

"edits" the category

tree in a way that semantically is nonsense (i.e. classifying a car as animal or something).

The point is, that I don't believe it is allways nonsens. There are good reasons why a category may
belong to multiple toplevel categories. But there are different types of relationships, that you
cannot model today.

Surely enough, there are a
couple of models for knowledge represantation, which might be even better than a category tree (in

my opinion, Minsky's

frame logic would be quite fine, but this relies on a tree structure, too). However, this aim is

too far to achieve. A

simple tree would be three steps forward and might be realizable in quite a foreseeable time.

I think to give a relationship a semantic definition is not hard to implement and not too confusing
to use. Just two differnt types of relationships (isPartOf and isRelatedTo) would help a lot. One of
them must be strictly hirarchichally the other one is a graph. This allows to automatically classify
articels and categories. One interessing usecase for instance is, to use a cluster-algorithm to
detect if a category makes sense at all or you should split it.

Is there a way to disucss offline?

I do think that discussions probably should not run in bugzilla. Why not move it to http://meta.wikimedia.org/w/index.
php?title=Talk:Category_flatten ?

My apologies for not understanding, but why was this changed from LATER to FIXED?

Does MediaWiki currently have lists of "pages in a category and it's subcategories"? How can that be used? Specifically: how was fixed the problem exemplified in item (3) of comment #0 ?
(In reply to comment #0)

  1. Example: When I add an article to category "Cat", it should also _automatically_ belong to categories

"mammal", "animal" and "creature". When I now browse through the categorie "animal", I will find the
article. This is not the case in the current system. The result is chaos.

This would be very compelling for Special:RandomInCategory as one could essentially get the same enjoyable variety one gets within one's favorite television or radio station, getting say exposed to new Science articles without having to specify exactly which field one was interested in.

(Speaking of radio, it would be interesting if one could ask for random sound files in a category, and get the pages to load, play, and then load another random one in sequence; likewise for videos; scrolling through random images in a category ala Google Images would be cool too.)

Considering that

  • this feature hasn't been developed in almost 10 years
  • MediaWiki users with high demand of categorization use Semantic MediaWiki
  • in Wikimedia, categorization is increasingly a task for Wikidata / Wikibase
  • @brion is working on completely different staff nowadays

we could set this task up for grabs with a Needs Volunteer priority, which would reflect the current reality better. Or we could decline it and put our bets in Wikidata categorization in almost 10 years from now (or maybe before).

PS: currently this is the oldest open task in Phabricator assigned to someone.

To be honest I think we should close this bug. Wikidata will make this all easier. It's a matter of time. Keeping this bug open doesn't seem useful.

Agreed. Be bold. Go Wikidata!