Page MenuHomePhabricator

Support the creation and use of volunteer tools that help to convert information in Commons categories to structured data
Open, LowPublic

Description

As soon as SDC General is rolled out in 2018, volunteer tools will be very helpful to assist in translating current unstructured or semi-structured data to structured data in file pages on Wikimedia Commons.

Commons categories are a specific use case. A lot of information is contained in them; some categories are connected to Wikidata, but many are not (both very specialized subcategories, and intersection categories that combine two or more separate concepts).

  • Help inventorize how such tools would ideally work
  • Support volunteer developers when developing such tools
  • Inform the Commons and Wikidata communities about the tools and how to use them (documentation)

Event Timeline

A first rough breakdown of some types of categories we are dealing with. Corrections and additions welcome.

Without fully structured data, categories on Wikimedia Commons have been the best vehicle to 'tag' media files on Commons and to organize them. The Commons category system is multihierarchical (i.e. it's a tree structure and each 'node/branch' in the tree can have multiple parents and children).
A lot of (often detailed) information is stored in Commons categories. We want to lose as little of this informational value as possible and want to work towards the best transition of this information to structured data.

Wikimedia Commons contains roughly 6,066,000 categories. (source 1) (source 2) (checked on November 9, 2017)

1. Categories with purely informational value

1.1. Simple categories about single (or combined) topics, with already (some) connection to structured data

Examples:

1.2. Very specific categories, not connected to structured data

1.2.1 Probably notable enough to deserve a Wikidata item

1.2.2 Probably not notable enough to deserve a Wikidata item

1.3 Intersection categories (combining various topics)

2. Categories with (or including) administrative and maintenance functions

Low priority; this is definitely on my radar, but not something I will spend many hours on in Q2 of 2017-18 (Oct-Dec 2017).

Magnus added a subscriber: Magnus.Nov 10 2017, 2:54 PM

Trying to decompile all these into statements and/or checks:

The year 1191 AD

  • depicts

or

  • creation date

or

  • creation date of depicted object (?)

Ethanol

  • depicts

George Washington

Could be

  • depicts
  • creator
  • owner

Maps of the United Kingdom

  • instance of:map
  • depicts(?):UK

Bibliothèque Nationale MS Fr. 2646

  • depicts: [new item]
  • instance of:manuscript

Pol Fruit

Could be [new item "Pol Fruit"]:

  • depicts
  • creator
  • owner

The New Orleans Bee May 1874

  • publication (or something): [new item "The New Orleans Bee"]
  • publication date: May 1874

Composers from Denmark

  • Too abstract to add direct statements
  • Check that all files have creators, and that all those creators are from Denmark

Addax nasomaculatus in Jerusalem Biblical Zoo

  • depicts: Addax nasomaculatus
  • location: Jerusalem Biblical Zoo

Disease incidence maps of the United Kingdom

  • instance of: Disease incidence map
  • depicts(?): United Kingdom

Files from Internet Archive Book Images Flickr stream

  • imported from: [new item "Internet Archive Book Images Flickr stream"]

Photographs taken on 2016-08-01, Uploaded with Mobile/Web

  • creation date: 2016-08-01
  • upload path(?): Mobile/Web

CC-BY-SA-2.0

  • licence (of file): CC-BY-SA-2.0

Media with locations

  • obsolete, convert locations from template and/or EXIF to statements for all files
  • maybe useful as one-off check: Once all location statements are created, all these files should have one, highlight if not

Mérimée with PA parameter

  • obsolete, should become statement using a Mérimée property (or some genetic "external ID" prop)

I would humbly suggest the following approach to resolve these:

  • Create a "category" property (plain string) on Commons SD (and maybe a "sort order" qualifier statement as well)
  • Add a (partially) filled (or even blank) template to each file on Commons, that renders each category statement value as a [[Category:]] link
  • For each file, add all categories (that are not in templates) as a statement, then remove the category from the wikitext

The file description page should now render exactly as before, and all categories should work as before, but

  • categories and other statements (on Commons SD and Wikidata alike) can now be queried together via SPARQL
  • a tool/JavaScript can perform a single action (e.g. "remove all category statements with this name", "add creator:Michelangelo") to the results of a SPARQL query (maybe via QuickStatements)
  • [on second thought, one can already do that in PetScan, but maybe an "official" way would be nicer...]
  • that way, information contained in the category name can be added as new statements
  • once the information is accessible via SPARQL query, the category statements are no longer necessary and can be removed (or at least deprecated)

my 2 eurocent

SandraF_WMF removed SandraF_WMF as the assignee of this task.Nov 13 2017, 10:17 AM

Clogging up my backlog now :-) we'll get to it when it becomes relevant.

Jheald added a subscriber: Jheald.Nov 21 2017, 6:00 PM

Hi Magnus,

I am intrigued by the idea of the categorisation information being directly accessible in the file's wikibase page; and I presume the template hack to add a category statement on the File page would also keep the SQL tables up to date, which so many tools, as well as the category presentation infrastructure depend on. Code would need to be written to intercept new categories being added/changed/rewritten on the page, either by humans or tools, to make sure that this was routed appropriately to the wikibase.

However, I don't buy the idea of 'draining' the categories as information becomes accessible by SPARQL. I think this would go down very badly with Commons. At the minimum I think there's going to have to be a long period of parallel running between the category system and SPARQL-driven searches, during which the category system will need to be kept intact. Indeed, I suspect they will still continue to have some important roles even when SPARQL is fully implemented and well populated.

So, rather than removing category statements on the file items, instead better I think would be a qualifier to indicate that the categorisation entry could be accounted for by statements on the file item. It would be good if to some extent this could be updated by bot, as categorisations were added/revised.

As you have noted above, the translation of the meaning(s) of a category into statements can be very varied. I don't know whether you would agree, but I believe it would be *extremely* helpful to be able to store the main "machine meanings" of categories in some accessible place, where it could be easily edited by all-comers (humans and machines) and accessed by all-comers. (The "category combines" statement on Wikidata is a good example of how this information might be modelled).

I've suggested to Sandra that by far the best way to do this would be to have a wikibase entry for each category -- it would be easily accessible, easily writable, easily inspectable with tools we substantially already have. I think it would also be a very good platform for live-testing some of the Structured Data technology at scale -- eg multi-content revisions, federation, etc -- in a known environment, not subject to the progress with the more involved designs for the file pages. I'd be very interested to have your opinion on that. I know via Sandra that the project is very wary of adding anything to the roadmap, but it seems to me it might well pay for itself as a useful test platform down the line, and I'd be curious as to whether you'd think it would add that much of an additional requirement, given that all the enabling technology appears now to either already be in place, or to main-line for the project development.

Steinsplitter added a subscriber: Steinsplitter.EditedNov 27 2017, 4:57 PM

During the IRC meeting we talked about tags, which will be in addition to category's.

I don't think the goal should be to remove category's because there is a wide consensus and even a policy on how they should be used:

People spent years to build cat structures, etc. so it would be useful to re-use them imho.

So likely the goal should be to make category's structured (so we can re-use them) and allow translation those titles, etc.
:-)

FDMS added a subscriber: FDMS.Nov 28 2017, 8:59 PM
Perhelion added a subscriber: Perhelion.EditedDec 13 2017, 12:05 PM

Categories on Commons are really big stuff (many gadgets are there for maintenance). Such main change would be very critical, I'm very skeptical (but open for new technologies).

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 18 2017, 3:04 PM
Elitre added a subscriber: Elitre.Jul 5 2018, 12:50 PM
This comment was removed by Elitre.