Provide access to category information from WDQS SPARQL
Open, NormalPublic

Description

The WDQS SPARQL service is fantastic; but it is a shame that it seems to be in such a separate silo from the category information that is used so heavily on Wikipedia (including for maintenance, and to mark topics of interest to given wikiprojects.

Yes there are tools like Magnus's Pet scan which allow one to combine the results of searches in the two silos.

But it would be nice if one could access category information more directly, from SPARQL itself.

Something I think would be a great addition to WDQS would be a SERVICE that would take a category name (or perhaps wikidata category item, or a list of them) and a wikipedia language code, and replace it with a VALUES list of all the items in that category in that wikipedia.

I could imagine this running like a preprocessor directive, doing a straight substitution before the query is passed on to BlazeGraph. (Or at whenever would be the most efficient stage to pass in such a list).

Refinements could include also looking at categorisation of talk pages (since eg on en-wiki it is the talk pages that are categorised to show the interest of a particular wiki-project in a page); or allowing a specified recurse depth.

Other services of a similar nature that I believe would also be useful would be

  • a SERVICE to return the items for all pages that have a particular template transcluded on them in a particular wiki
  • a SERVICE to return a graph of all of the categorisations for a particular item across the different wikis, eg ?item some_prefix:has_categorisation ?cat . ?cat some_prefix:category ?item2 . ?cat schema:isPartOf ?wiki

    The latter might be used eg as part of a query to try to identify categorisations that cannot (yet) be 'explained' by the properties currently on the object.
Jheald created this task.Feb 9 2017, 12:37 PM
Restricted Application added a project: Discovery. · View Herald TranscriptFeb 9 2017, 12:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There are two ways to approach it:

  1. Use MW API (T148245)
  2. Export categories as graph and load into WDQS (see https://gerrit.wikimedia.org/r/#/c/327862/ for preview of how it could look like)
Bugreporter added a comment.EditedFeb 10 2017, 6:15 AM

If you want to export categories metadata as RDF in MediaWiki core there're much more that can exposed: size, number of links, last edits and whether it is flagged/make by bot, the redirect/disambig status, and even pages links to or transcluded. All are supported in PetScan.

Possible example:

<http://somewiki/wiki/Animal> a mediawiki:Page ;
        rdfs:label "Animal" ;
        mediawiki:size "12345"^^xsd:integer ;
        mediawiki:numberoflinks "123"^^xsd:integer ;
        mediawiki:firstedit <http://somewiki/w/index.php?oldid=12345> ;
        mediawiki:lastedit <http://somewiki/w/index.php?oldid=7777777> ;
        mediawiki:revisions <http://somewiki/w/index.php?oldid=6666666> ;
        mediawiki:revisions <http://somewiki/w/index.php?oldid=4444444> ;
        mediawiki:linkto <http://somewiki/wiki/Bird> ;
        mediawiki:transcludes <http://somewiki/wiki/Template:Example> ;
        mediawiki:incategory <http://somewiki/wiki/Category:Animals> ;
        mediawiki:isredirect  "false"^^xsd:boolean ;
        mediawiki:isdisambig  "false"^^xsd:boolean .

<http://somewiki/w/index.php?oldid=7777777> a mediawiki:Revision ;
        mediawiki:size "12345"^^xsd:integer ;
        mediawiki:numberoflinks "123"^^xsd:integer ;
        mediawiki:by <http://somewiki/wiki/Special:Contributions/Example> ;
        mediawiki:time "+2010-04-12T00:00:00Z"^^xsd:dateTime ;
        mediawiki:flagged  "true"^^xsd:boolean .

<http://somewiki/wiki/Special:Contributions/Example> a mediawiki:User ;
        mediawiki:inusergroup  <http://somewiki/wiki/Special:ListUsers/sysop>; 
        mediawiki:usercreated "+2004-04-12T00:00:00Z"^^xsd:dateTime ;
        mediawiki:numberofedits "88888"^^xsd:integer .

We may finally get rid of PetScan, and than rename WDQS to Wikimedia Query Service.

I do not think we plan to represent all mediawiki database contents in RDF just yet. Categories may make sense since categories are a graph-like structure anyway, and may be useful for structured commons. Anything else would require much more planning. And, probably, a different update mechanism.

Smalyshev triaged this task as Low priority.Feb 14 2017, 9:45 PM
Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.May 5 2017, 3:38 PM
schana added a subscriber: schana.May 21 2017, 3:12 PM

Change 359055 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/vendor@master] Add Purtle library for RDF generation

https://gerrit.wikimedia.org/r/359055

Change 359055 merged by jenkins-bot:
[mediawiki/vendor@master] Add Purtle library for RDF generation

https://gerrit.wikimedia.org/r/359055

Smalyshev raised the priority of this task from Low to Normal.Fri, Jul 14, 8:24 PM

Change 327862 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/core@master] Produce RDF dump of all categories and subcategories in a wiki.

https://gerrit.wikimedia.org/r/327862

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptSat, Jul 22, 12:20 AM