Page MenuHomePhabricator

Deepcat search returns incomplete results
Closed, DuplicatePublic

Description

An enWiki search for deepcat:"Musicals by topic" returns 87 results. But a visual inspection of the category tree shows that it contains well over 100 pages. For a more specific example, a search for deepcat:"Musicals by topic" intitle:"Kid" returns 1 result, (Kid Boots), but Kid Victory is at least one other example of a page that should have been returned. Both pages are in Category:LGBT-related musicals which is a direct subcat of "Musicals by topic".

This category should not be hitting the documented limits of max depth 5 and 256 max categories. It contains only 8 direct subcats, and 1 grandchild category (https://en.wikipedia.org/wiki/Category:LGBT-related_musical_films), nothing else. This category is in fact the one used as an example on Wikipedia's search help page.

Event Timeline

The SPARQL query endpoint that provides the categories to search against doesn't appear to be returning all expected sub-categories.:

ebernhardson@mwmaint1002:~$ curl -s -XPOST http://wdqs-internal.discovery.wmnet/bigdata/namespace/categories/sparql?format=json -d 'query=SELECT ?out WHERE {
      SERVICE mediawiki:categoryTree {
          bd:serviceParam mediawiki:start <https://en.wikipedia.org/wiki/Category:Musicals_by_topic> .
          bd:serviceParam mediawiki:direction "Reverse" .
          bd:serviceParam mediawiki:depth 5 .
      }
} ORDER BY ASC(?depth)
LIMIT 50' | jq '.results.bindings | map(.out.value)'
[
  "https://en.wikipedia.org/wiki/Category:Musicals_by_topic",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_writers",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_World_War_II",
  "https://en.wikipedia.org/wiki/Category:Musicals_set_in_the_Roaring_Twenties",
  "https://en.wikipedia.org/wiki/Category:Plays_and_musicals_about_disability",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_World_War_I",
  "https://en.wikipedia.org/wiki/Category:Musicals_about_the_Great_Depression"
]

In particular this is missing:

  • Category:LGBT-related musicals‎
  • Category:Teen musicals

Checked the latest dump (which should be loaded into SPARQL): https://dumps.wikimedia.org/other/categoriesrdf/20191116/enwiki-20191116-categories.ttl.gz

The RDF includes the statements:

<https://en.wikipedia.org/wiki/Category:Teen_musicals> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:Musicals_by_topic>,
        <https://en.wikipedia.org/wiki/Category:Teens_in_fiction> .
<https://en.wikipedia.org/wiki/Category:LGBT-related_musicals> mediawiki:isInCategory <https://en.wikipedia.org/wiki/Category:LGBT_portrayals_in_media>,
        <https://en.wikipedia.org/wiki/Category:LGBT_theatre>,
        <https://en.wikipedia.org/wiki/Category:Musicals_by_topic> .

Oddly if we ask blazegraph about one of these categories it doesn't seem to know anything:

ebernhardson@mwmaint1002:~$ curl -s -XPOST http://wdqs-internal.discovery.wmnet/bigdata/namespace/categories/sparql?format=json -d 'query=SELECT ?out WHERE {
>     <https://en.wikipedia.org/wiki/Category:Teen_musicals> mediawiki:isInCategory ?out
> } LIMIT 50'
{
  "head" : {
    "vars" : [ "out" ]
  },
  "results" : {
    "bindings" : [ ]
  }
}

While asking about a different category in same way works fine:

ebernhardson@mwmaint1002:~$ curl -s -XPOST http://wdqs-internal.discovery.wmnet/bigdata/namespace/categories/sparql?format=json -d 'query=SELECT ?out WHERE {
    <https://en.wikipedia.org/wiki/Category:Musicals_about_writers> mediawiki:isInCategory ?out
} LIMIT 50' | jq '.results.bindings | map(.out.value)'
[
  "https://en.wikipedia.org/wiki/Category:Works_about_writers",
  "https://en.wikipedia.org/wiki/Category:Musicals_by_topic"
]

Summary: It seems like the dumps aren't being imported into blazegraph properly, perhaps some of the triples are erroring out or some such?

Hm, but metadata about the category is present:

$ curl https://query.wikidata.org/bigdata/namespace/categories/sparql -H 'Accept: text/tab-separated-values' -d query='SELECT * WHERE { <https://en.wikipedia.org/wiki/Category:Teen_musicals> ?p ?o. }'
?p	?o
<https://www.mediawiki.org/ontology#pages>	16
<https://www.mediawiki.org/ontology#subcategories>	0
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>	<https://www.mediawiki.org/ontology#Category>
<http://www.w3.org/2000/01/rdf-schema#label>	"Teen musicals"
EBernhardson moved this task from needs triage to Wikibase Search on the Discovery-Search board.
SD0001 added a subscriber: SD0001.

Is this being fixed? This bug is affecting the work of bots on wikipedia, so it would be great if this could be looked into. Thanks.