Page MenuHomePhabricator

Provide access to category information from WDQS SPARQL
Closed, ResolvedPublic

Description

The WDQS SPARQL service is fantastic; but it is a shame that it seems to be in such a separate silo from the category information that is used so heavily on Wikipedia (including for maintenance, and to mark topics of interest to given wikiprojects.

Yes there are tools like Magnus's Pet scan which allow one to combine the results of searches in the two silos.

But it would be nice if one could access category information more directly, from SPARQL itself.

Something I think would be a great addition to WDQS would be a SERVICE that would take a category name (or perhaps wikidata category item, or a list of them) and a wikipedia language code, and replace it with a VALUES list of all the items in that category in that wikipedia.

I could imagine this running like a preprocessor directive, doing a straight substitution before the query is passed on to BlazeGraph. (Or at whenever would be the most efficient stage to pass in such a list).

Refinements could include also looking at categorisation of talk pages (since eg on en-wiki it is the talk pages that are categorised to show the interest of a particular wiki-project in a page); or allowing a specified recurse depth.

Other services of a similar nature that I believe would also be useful would be

  • a SERVICE to return the items for all pages that have a particular template transcluded on them in a particular wiki
  • a SERVICE to return a graph of all of the categorisations for a particular item across the different wikis, eg ?item some_prefix:has_categorisation ?cat . ?cat some_prefix:category ?item2 . ?cat schema:isPartOf ?wiki

    The latter might be used eg as part of a query to try to identify categorisations that cannot (yet) be 'explained' by the properties currently on the object.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

There are two ways to approach it:

  1. Use MW API (T148245)
  2. Export categories as graph and load into WDQS (see https://gerrit.wikimedia.org/r/#/c/327862/ for preview of how it could look like)

If you want to export categories metadata as RDF in MediaWiki core there're much more that can exposed: size, number of links, last edits and whether it is flagged/make by bot, the redirect/disambig status, and even pages links to or transcluded. All are supported in PetScan.

Possible example:

<http://somewiki/wiki/Animal> a mediawiki:Page ;
        rdfs:label "Animal" ;
        mediawiki:size "12345"^^xsd:integer ;
        mediawiki:numberoflinks "123"^^xsd:integer ;
        mediawiki:firstedit <http://somewiki/w/index.php?oldid=12345> ;
        mediawiki:lastedit <http://somewiki/w/index.php?oldid=7777777> ;
        mediawiki:revisions <http://somewiki/w/index.php?oldid=6666666> ;
        mediawiki:revisions <http://somewiki/w/index.php?oldid=4444444> ;
        mediawiki:linkto <http://somewiki/wiki/Bird> ;
        mediawiki:transcludes <http://somewiki/wiki/Template:Example> ;
        mediawiki:incategory <http://somewiki/wiki/Category:Animals> ;
        mediawiki:isredirect  "false"^^xsd:boolean ;
        mediawiki:isdisambig  "false"^^xsd:boolean .

<http://somewiki/w/index.php?oldid=7777777> a mediawiki:Revision ;
        mediawiki:size "12345"^^xsd:integer ;
        mediawiki:numberoflinks "123"^^xsd:integer ;
        mediawiki:by <http://somewiki/wiki/Special:Contributions/Example> ;
        mediawiki:time "+2010-04-12T00:00:00Z"^^xsd:dateTime ;
        mediawiki:flagged  "true"^^xsd:boolean .

<http://somewiki/wiki/Special:Contributions/Example> a mediawiki:User ;
        mediawiki:inusergroup  <http://somewiki/wiki/Special:ListUsers/sysop>; 
        mediawiki:usercreated "+2004-04-12T00:00:00Z"^^xsd:dateTime ;
        mediawiki:numberofedits "88888"^^xsd:integer .

We may finally get rid of PetScan, and than rename WDQS to Wikimedia Query Service.

I do not think we plan to represent all mediawiki database contents in RDF just yet. Categories may make sense since categories are a graph-like structure anyway, and may be useful for structured commons. Anything else would require much more planning. And, probably, a different update mechanism.

Change 359055 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/vendor@master] Add Purtle library for RDF generation

https://gerrit.wikimedia.org/r/359055

Change 359055 merged by jenkins-bot:
[mediawiki/vendor@master] Add Purtle library for RDF generation

https://gerrit.wikimedia.org/r/359055

Smalyshev raised the priority of this task from Low to Medium.Jul 14 2017, 8:24 PM

Change 327862 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/core@master] Produce RDF dump of all categories and subcategories in a wiki.

https://gerrit.wikimedia.org/r/327862

Change 327862 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/core@master] Produce RDF dump of all categories and subcategories in a wiki.

https://gerrit.wikimedia.org/r/327862

Change 373354 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] [WIP] Add RDF dumps for categories

https://gerrit.wikimedia.org/r/373354

Change 373395 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Add script for creating new namespaces

https://gerrit.wikimedia.org/r/373395

Change 373395 merged by jenkins-bot:
[wikidata/query/rdf@master] Add script for creating new namespaces

https://gerrit.wikimedia.org/r/373395

Change 373404 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[operations/puppet@production] Enable access to arbitrary namespaces for WDQS

https://gerrit.wikimedia.org/r/373404

Change 327862 merged by jenkins-bot:
[mediawiki/core@master] Produce RDF dump of all categories and subcategories in a wiki.

https://gerrit.wikimedia.org/r/327862

Change 373404 merged by Gehel:
[operations/puppet@production] Enable access to arbitrary namespaces for WDQS

https://gerrit.wikimedia.org/r/373404

Change 377372 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Add script for loading category data

https://gerrit.wikimedia.org/r/377372

Change 377372 merged by jenkins-bot:
[wikidata/query/rdf@master] Add script for loading category data

https://gerrit.wikimedia.org/r/377372

Change 379838 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Add stored query for category traversal

https://gerrit.wikimedia.org/r/379838

Change 379838 merged by jenkins-bot:
[wikidata/query/rdf@master] Add stored query for category traversal

https://gerrit.wikimedia.org/r/379838

I somehow had the impression that this was complete. Is that wrong?

We have the category tree in Blazegraph (in a separate namespace), but I think this task was asking for the membership of individual pages in categories (i. e. not just subcategories), which we don’t have yet.

Yes we track category tree but not category membership for individual pages (much bigger data set, obviously).

Gehel claimed this task.
Gehel subscribed.

The initial work has been done, it is unclear what additional work needs to be done and what value this would bring.