Page MenuHomePhabricator

Deepcat search returns incomplete results
Open, MediumPublic

Description

An enWiki search for deepcat:"Musicals by topic" returns 87 results. But a visual inspection of the category tree shows that it contains well over 100 pages. For a more specific example, a search for deepcat:"Musicals by topic" intitle:"Kid" returns 1 result, (Kid Boots), but Kid Victory is at least one other example of a page that should have been returned. Both pages are in Category:LGBT-related musicals which is a direct subcat of "Musicals by topic".

This category should not be hitting the documented limits of max depth 5 and 256 max categories. It contains only 8 direct subcats, and 1 grandchild category (https://en.wikipedia.org/wiki/Category:LGBT-related_musical_films), nothing else. This category is in fact the one used as an example on Wikipedia's search help page.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 19 2019, 7:37 PM
Restricted Application added a project: Wikidata. · View Herald TranscriptNov 19 2019, 8:20 PM

The SPARQL query endpoint that provides the categories to search against doesn't appear to be returning all expected sub-categories.:

ebernhardson@mwmaint1002:~$ curl -s -XPOST http://wdqs-internal.discovery.wmnet/bigdata/namespace/categories/sparql?format=json -d 'query=SELECT ?out WHERE {
      SERVICE mediawiki:categoryTree {
          bd:serviceParam mediawiki:start <https://en.wikipedia.org/wiki/Category:Musicals_by_topic> .
          bd:serviceParam mediawiki:direction "Reverse" .
          bd:serviceParam mediawiki:depth 5 .
      }
} ORDER BY ASC(?depth)
LIMIT 50' | jq '.results.bindings | map(.out.value)'
[
  "https://en.wikipedia.org/wiki/Category:Musicals_by_topic",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_writers",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_World_War_II",
  "https://en.wikipedia.org/wiki/Category:Musicals_set_in_the_Roaring_Twenties",
  "https://en.wikipedia.org/wiki/Category:Plays_and_musicals_about_disability",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_World_War_I",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_the_Great_Depression"
]

In particular this is missing:

  • Category:LGBT-related musicals‎
  • Category:Teen musicals

Checked the latest dump (which should be loaded into SPARQL): https://dumps.wikimedia.org/other/categoriesrdf/20191116/enwiki-20191116-categories.ttl.gz

The RDF includes the statements:

<https://en.wikipedia.org/wiki/Category:Teen_musicals> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:Musicals_by_topic>,
        <https://en.wikipedia.org/wiki/Category:Teens_in_fiction> .
<https://en.wikipedia.org/wiki/Category:LGBT-related_musicals> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:LGBT_portrayals_in_media>,
        <https://en.wikipedia.org/wiki/Category:LGBT_theatre>,
        <https://en.wikipedia.org/wiki/Category:Musicals_by_topic> .

Oddly if we ask blazegraph about one of these categories it doesn't seem to know anything:

ebernhardson@mwmaint1002:~$ curl -s -XPOST http://wdqs-internal.discovery.wmnet/bigdata/namespace/categories/sparql?format=json -d 'query=SELECT ?out WHERE {
>     <https://en.wikipedia.org/wiki/Category:Teen_musicals> mediawiki:isInCategory ?out
> } LIMIT 50'
{
  "head" : {
    "vars" : [ "out" ]
  },
  "results" : {
    "bindings" : [ ]
  }
}

While asking about a different category in same way works fine:

ebernhardson@mwmaint1002:~$ curl -s -XPOST http://wdqs-internal.discovery.wmnet/bigdata/namespace/categories/sparql?format=json -d 'query=SELECT ?out WHERE {
    <https://en.wikipedia.org/wiki/Category:Musicals_about_writers> mediawiki:isInCategory ?out
} LIMIT 50' | jq '.results.bindings | map(.out.value)'
[
  "https://en.wikipedia.org/wiki/Category:Works_about_writers",
  "https://en.wikipedia.org/wiki/Category:Musicals_by_topic"
]

Summary: It seems like the dumps aren't being imported into blazegraph properly, perhaps some of the triples are erroring out or some such?

Hm, but metadata about the category is present:

$ curl https://query.wikidata.org/bigdata/namespace/categories/sparql -H 'Accept: text/tab-separated-values' -d query='SELECT * WHERE { <https://en.wikipedia.org/wiki/Category:Teen_musicals> ?p ?o. }'
?p	?o
<https://www.mediawiki.org/ontology#pages>	16
<https://www.mediawiki.org/ontology#subcategories>	0
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<https://www.mediawiki.org/ontology#Category>
<http://www.w3.org/2000/01/rdf-schema#label>	"Teen musicals"
EBernhardson triaged this task as Medium priority.Dec 9 2019, 11:20 PM
EBernhardson moved this task from needs triage to Wikidata Search on the Discovery-Search board.