Page MenuHomePhabricator

Categories are metadata
Closed, InvalidPublic

Description

Idea 01:
Categories are metadata. Why edit in the page when you can edit outside it? See how HotCat (or VE) has led the way. Put them on Wikidata? (how many things could be made into metadata?) Are a lot of them not just wikidata query results already?

Challenges:
Cats are old system.
Use tags instead? (Non-curated), and have categories as a way of editorially picking out the intersections of tags which make sense for humans?
Do this and the distinction becomes ambiguous. It's already very confusing in wordpress, which does this.
Previous discussion about page properties in general: https://phabricator.wikimedia.org/T55508
You can't (yet) kill them with fire though since they're still quite instrumental for /editors/, many of which base their entire workflow on them.
You could probably kill them more easily if watchlists were more flexible and allowed people to track groups of articles they're interested in in a way which makes more sense.

[Note: I copied and pasted this from an Etherpad during a Wikimedia hackathon in 2015 and these are not my words.]

Event Timeline

Jdlrobson raised the priority of this task from to Needs Triage.
Jdlrobson updated the task description. (Show Details)
Jdlrobson moved this task to Hacking proposals on the Wikimedia-Hackathon-2015 board.
Jdlrobson subscribed.
Qgil triaged this task as Low priority.Feb 2 2015, 1:18 PM
Qgil subscribed.

This is the mix of many largely orthogonal issues:

  • should categories be based on content? If you dissociate them from content, you create a new type of meta-content, which will need its own curation and patrolling workflows and interfaces, export/import mechanism etc. A pretty large task.
    • if they are dissociated from content, should categories be centralized across projects? (A prerequisite to using Wikidata.) A non-trivial social issue as currently even the different language editions of the same project can have very different categorization practices; using the same category system for Wikipedia and Wikibooks sounds awkward.
    • if they are associated with content (ie. given a revision, the parser can always recreate the category list), should they live in the wikitext, outside it (some sort of revision props) or both? Templates rely very strongly on the ability to create categories via wikitext transclusion.
  • should categories be hierarchic or shallow with easy intersection (ie. tags)? Tags are much easier to add but harder to organize.

What @Tgr said. Categories as they are are associated with pages and not topics. They are also completely different across languages/cultures. So I'd say no to moving them to Wikidata. That being said there is a case to be made for storing them as structured data with the page. That doesn't have to be Wikidata/Wikibase though.

I've bent Lydia's ear a couple of times in the past on this, once in Amsterdam and once in Paris on the way back from the boat reception.

Categories have developed for a reason. There's a real value in having groupings of content (whether articles or Commons images) into groups of a human-manageable size of say 20 to 200 items, with the groups arranged in a curated hierarchical structure. IMO the category view may be particularly valuable for images, where there is real value being able to scroll down a group of about that size (so a degree of specificity giving a group of about that size) of images on a particular topic. But it also goes for articles too, to find related material: there is value in being able to see together a group of a particular size of possible related content. Too big a group (too little specificity) and it's overwhelming; but too small a group (too much specificity) and it becomes too 'bitty', and you don't see enough options to find the article you want or get an idea of the level of context and coverage all in one place, without it being broken into endless little bits. So there is value in the category system, which has tended to refine the degree of specificity to a particular sweet spot, that is an appropriate sized group -- not too big, not too little -- to be of most value to a human reader.

There's also great value in the knowledge that is stored in that hierarchical structure -- you won't go to a conference with researchers who have used Wikipedia without there being at least someone who has mined our category system for groups and relationships. And we're using it ourselves for Wikidata, as one of the key sources people mine to systematically give P31 values and key properties to items, far far too many of which are still currently not specified.

But the category system currently has great weaknesses too. Firstly, because addition is manual, inclusion and coverage can be haphazard; and because the structure is organic and somewhat arbitrary, even finding the correct category to put something in can be unpredictable, time-consuming and onerous. Second, from an information-mining perspective, there is a difficulty because of a lack of transitiveness: if A is a member of B, and B is a member of C, it does not necessarily follow that A matches the inclusion criteria of C. As a result, a downward exploration of the hierarchy doesn't have to go very far to find category contents wheeling off in all manner of strange and unexpected directions, quite incompatible with what was originally sought to be harvested.

These are general problems of the category system, as applicable to Wikipedia as to Commons. So perhaps I should not be adding this to a Commons-specific Phabricator item. I tend to agree with Lydia that structured items for Commons categories should mostly stay on Commons, attached to a particular Commons page, or however the Commons wikibase is structures, rather than the main Wikidata. But I think the issues are the same for both -- IMO the Commons and Wikipedia category issues exactly parallel each other -- and also the value of what Wikidata (or structured data) can bring, to preserve category-like views, but to make them actually work much better -- both for humans and for machines.

Since they are so parallel, in what follows I'll discuss in terms of items on Wikidata and corresponding articles and categories on Wikipedias, but the translation to Commons should be straightforward.

So here goes.

If we're going to make categories work better, the first thing to do is to work out what is in them at the moment, and document it.

It turns out that we actually already have the Property to do it: Property 360 "is a list of", as for example demonstrated in action on Q15832361,
List of women engineers

The syntax for P360 exactly mirrors the properties that items to be included in the list or category should have, as inclusion criteria -- and in particular highlights the P31 which defines what sort of fundamental object they are.

Filling out a P360 for each category solves the problem of transitivity -- because with the explicit inclusion criteria, it is easy for a crawler to identify eg when the downward sequence of categories ceases to be ever more refined groups with eg P31=battle, and instead turns into a category about a specific battle, with its commanders etc -- where the battle in question has become a property of the inclusants, rather than their fundamental P31 being a battle.

Having P360 in place also means that (if desired) the category-view could be auto-augmented, with objects whose items match the criteria, even if they haven't got the category line in their wikitext. (Magnus's Reasonator already does this, to predict what items ought to be in list articles). Allowing users to turn this on, on a per-category basis, would hugely help the problem of comprehensiveness. Category views would then be as comprehensive as the wikidata (much easier to systematically interrogate, and compute intersections for), without all the fiddling business of having to get category names by hand.

Also, the direct converse, one could organise a constraint violation for all items included in a category (on a particular wiki) that apparently did *not* match the inclusion criteria defined in its P360 (or P360s -- there might be alternative sets of criteria acceptable). Such constraint violations might efficiently indicate either a missing property on the item, eg a missing P31 (still a big problem for us); or an additional set of acceptable criteria; or an incorrect item sitelink to identify the category from that particular language; or an item that should not be in the category -- the machine interpretability would allow these mismatches to be highlighted straightforwardly.

I should at this point note that P360 is currently labelled "is a *list* of". However, GerardM has now already filled it in for 2000 categories, from a recent start, and discussions at
Project chat and this PfD appear to agree that this is appropriate, and far better model for taking this further than the vague P971 "category combines topics".

There are a few more wrinkles about what does and does not tend to get included in categories, that are worth thinking about, to make the system above work.

a) Sometimes a category contains key articles that do not match the inclusion criteria.

For example, usually if there is a survey article directly on the topic of the category, that will be included, usually at the top of the article, eg under the alphabetisation "*", separated off from the regular articles. So a category "20th-century painters" might include a category lead article "Painters of the 20th century" -- even though that is not an article about a painter.

Fortunately we already have a property to identify such a lead article for a category -- P301, "category's main topic".

However, there may be other such articles -- eg "List of 20th century painters" -- that are typically included in such a category.

If these were indicated as values of a new property "Category auxiliary article", then they could be included appropriately in an automatically generated category view, despite not meeting the main criteria, or excluded from constraint violation reports.

b) One other thing to recall is that categories typically do not directly include objects that are included in sub-categories.

So one other new property is also needed, a property "category is a sub-category of", to record that the current category A is a sub-category of Category X; so that in any auto-generated (or auto-augmented view of the parent category X, all sub-categories with this property can be identified, and all items satisfying the inclusion criteria of any of the sub-categories can be excluded from the generated category view. (Which is a slightly involved search, but shouldn't be beyond whatever query engine ultimately gets specified).

One further wrinkle is that different languages have different category structures -- so, as a qualifier on the property, one would also need to record in which languages Category A was a sub-category of Category X -- in other languages it might be a sub-category of something else. But that is fine. This I think bothered you Lydia, when you noted that categories are "completely different across languages/cultures" -- but what is recorded on Wikidata does not have to be *normative* -- it's not telling every language how things ought to be subcategorised, the one true revealed way -- rather, the subcategory property would be *descriptive* -- recording how things actually have been categorised in all the various different languages, without any judgement at all.

In this way the view of the contents of a particular category would be wiki-specific (just as it is now). Each language would be different, and Commons different again. But that's not a problem. And having the union of all those patterns stored on Wikidata (or rather, the information behind the different categorisations stuctures) would actually be a boon to researchers, and to crawlers, which could then actually crawl the category tree in all languages at once -- one more example of the internationalisation achieved by Wikidata.

So that's how I think categories should go forward.

What would that translate into, in practical terms?

  • Filling out the P360s probably needs to be done by hand -- but GerardM is already showing what can be achieved.
  • Adding the information as to what is a sub-category of what probably should be a big central batch update -- with ongoing mechanisms to keep it synchronised with categorisation changes on the different wikis in the different languages. But there are real advantages to the sub-category appearing to be "just another property", and queryable as such, so that people can include it in a query like anything else, with the same syntax for anything else -- even if in reality it was actually a "virtual property", being generated on the fly from the main SQL tables.
  • Constraint-violation reports -- should not be a problem to generate, even right away, as soon as P360s are in place; though they might need some scripting, to pull together Wikidata properties with wiki data that is currently sitting in SQL tables. Can GerardM already do this?
  • Auto-augmentation of category views -- needs a bit of thought, but shouldn't be prohibitive, even soon?
  • Other category goodies -- looking again at how categories are generated and presented, going to a more dynamic model, might allow other goodies to be reasonably easily incorporated -- eg alternative possible sort orders for the results (the sequence of item-properties for each one specified e.g. as the value of a property of a category item; or by that category-item-property holding a reference to a standard order, defined on another item, eg Property:Sort_option = Q{Sort category by date}).
  • One could also imagine adding filtering options to a category view -- eg to only show the best image, if different images all have the same statements on them (suppress alternated/duplicates).

So I think there's a lot that Wikidata could bring to category views; but the start is to understand what is already there, by getting P360s in place.

Just checking, is there someone working on this task in Wikimedia-Hackathon-2015? If not, please remove the project.

T107595: [RFC] Multi-Content Revisions may be of relevance to this task and its apparently child.

Why edit in the page when you can edit outside it?

Wow. That comment is so backwards. I really wish we had more overlap between the dev community and the editing community.

Why would we want to turn a wiki into a big complicated hard-to-use app when we can simply edit inside the page? A wiki is just a bunch of pages, and a page is just a text file with a name and history. Dead simple. All of the complexity lies in rendering the page.... all of the complexity lies in learning what you can write in the text file. I can 100% write an article in notepad.exe, CTRL-V it in the edit window, and save. Why would I want a complicated app glued onto the side of the page, forcing me to do categories separately?

For the record I am not the original author of this text.

@Alsee to be fair this was copied and pasted from an Etherpad during an idea generation in a session I ran back at Wikimedia dev summit back in January 2015 and has been taken out of context.

I'm not sure who the original author was, but I think you should assume good faith here. Some ideas are not necessarily good ones :).

@Jdlrobson: I definitely didn't assume any bad faith here, and I didn't mean to "blame" anyone in particular. Consider this my attempt to stir concepts into the idea sessions :)

The reason I wound up here was from Multi-Content_Revisions (MCR), which cites this as a possible use case. I do understand why MCR (and use cases) seem like good ideas, but I think MCR is floating on a swirl of bad ideas. The key is why it's a good or bad idea. From the developer point of view it's very tempting to look at the various stuff editors do and think "I can build an app for that!". You can build a great app that does the exact task great, and efficiently, with a great interface for that exact task, and having separate structured data makes things much easier on the software side. But that turns the wiki into a big complicated pile of complicated apps.

Any powerful system is going to have complexity. The question is where that complexity is located, and how it's presented. The wiki model is that everything is a page, and pages are dead-simple text files. The journey is learning the various neat things you can write in that text file. A six year old can click EDIT and start typing away. A six year old can accomplish virtually anything with a blind copy-paste from an existing page. They don't need to understand what they copied. Maybe the community has a weird attachment to that approach, but it's the approach that made wiki so successful.

From the developer point of view it's very tempting to look at the various stuff editors do and think "I can build an app for that!".

The Wikipedia community did in fact build an app for that, so I am not sure why you are talking as if this idea was somehow forced on the community by some developers against their will.

To what extent the editability of metadata should be preserved under MCR is a reasonable question to ask but this task is a terrible place to ask it - it is about some random idea written down at a hackathon and reservations noted here will almost certainly be missed. You should use the MCR wiki page or the main task.

What is the goal of this task? Does it still serve a purpose? It seems to be something old from 2015 without a clear goal. I would just close this story as invalid.

It was made in 2015 during a Wikimedia tech conference. I don't think we need to leave it lingering in Phabricator.