Page MenuHomePhabricator

Deepcategory returns only very few results
Open, MediumPublic8 Estimated Story Points

Description

And for the life of me I can't figure out why. I don't use it often, so I'm not sure if anything changed, though as I was searching for a solution I found T243101. Dunno if that's related.

For example: https://commons.wikimedia.org/w/index.php?search=deepcat%3A%22Mike+Pence+in+2020%22&title=Special%3ASearch&go=Go&ns6=1 returns 11 files. Some from https://commons.wikimedia.org/wiki/Category:Donald_Trump_rally_in_Wildwood,_New_Jersey_(January_28,_2020), some from https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_2020 and a few others. It's unclear how these are selected, but https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_January_2020 alone contains 188 files so obviously 11 is not enough.

https://commons.wikimedia.org/w/index.php?sort=relevance&search=deepcat%3A%22F.+van+der+Kraaij+Collection%22&title=Special:Search&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns6=1 returns 248 results, but https://commons.wikimedia.org/wiki/Category:Images_donated_by_F._van_der_Kraaij alone contains 462 files.

Documentation at https://www.mediawiki.org/wiki/Help:CirrusSearch#Deepcategory is virtually non-existent. It says "The depth of the tree is limited by 5 levels currently (configurable)" but how, where or who can configure this is a complete mystery. Even if the issue could be resolved by adding undocumented parameters, that would mean the defaults are bad.

Event Timeline

Restricted Application added a project: Discovery-Search. · View Herald TranscriptMar 1 2020, 10:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
AlexisJazz updated the task description. (Show Details)Mar 1 2020, 11:24 AM
EBernhardson triaged this task as Medium priority.Mar 12 2020, 7:10 PM
EBernhardson moved this task from needs triage to Current work on the Discovery-Search board.

This is the query being issued, it looks reasonable but the results from blazegraph are basically empty. Perhaps something wrong with the process that updates categories in blazegraph? @dcausse perhaps you have ideas of where that process lives, or where to find error logs?

query issued to blazegraph for deepcat: https://w.wiki/KHv

dcausse claimed this task.EditedMar 17 2020, 10:50 AM

Had a quick look and indeed we somehow missed this update.
I'm able to find some triples in the rdf dump:

<https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_January_2020> mediawiki:isInCategory <https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_2020> .

But this one is not found in blazegraph:

select (count(*) as ?cnt) where {
<https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_January_2020>  ?p <https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_2020>
}

The immediate fix is to reload the categories.
The root cause needs more investigation but since the leak is not isolated to a single blazegraph server we should look at how the daily diffs are generated and transferred:

  • Category:Mike_Pence_in_2020 was created on 2020-01-07T22:14:04
  • Category:Mike_Pence_in_January_2020 on 2020-01-07T22:13:22‎

Looking closer at logs it seems that we fail to apply some upates:

07:34:14.638 [qtp226170135-38760] ERROR c.b.r.sail.webapp.BigdataRDFServlet - cause=java.util.concurrent.ExecutionException: java.lang.StackOverflowError, query=SPARQL-UPDATE: updateStr=prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
[sparql]
 req.requestURI=/bigdata/namespace/categories20190115/sparql, req.xForwardedFor=null, req.queryString=null, req.method=POST, req.remoteHost=localhost, req.requestURL=http://localhost:9990/bigdata/namespace/categories20190115/sparql, req.userAgent=curl/7.52.1
java.lang.StackOverflowError: null
        at com.bigdata.rdf.sail.sparql.ast.SyntaxTreeBuilderTokenManager.jjMoveNfa_0(SyntaxTreeBuilderTokenManager.java:2575)
Wrapped by: java.util.concurrent.ExecutionException: java.lang.StackOverflowError
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)

Problem is that the update strategy is to delete everything related to a set of entity and then as another update query to add all the category data.
Another example of missing data is https://commons.wikimedia.org/wiki/Category:Norbert_of_Xanten which has lost of its sub categories.
Hopefully the error is just some limits at parse time, increasing the stack size or splitting the queries might solve the problem.

Problematic update query: https://people.wikimedia.org/~dcausse/T246568_broken_sparql_update.sparql.gz

Change 582829 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/blazegraph@master] SPARQL-update grammar: avoid recursion over update stmt

https://gerrit.wikimedia.org/r/582829

Change 582833 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/deploy@master] Split category daily sparql script

https://gerrit.wikimedia.org/r/582833

Change 582829 merged by jenkins-bot:
[wikidata/query/blazegraph@master] SPARQL-update grammar: avoid recursion over update stmt

https://gerrit.wikimedia.org/r/582829

What else do we need to do to close this up? It looks like the patch to wikidata/query/deploy needs to be deployed, do we also need a full import to get the state back in sync with mediawiki?

dcausse added a comment.EditedApr 14 2020, 10:17 AM

We deployed a fix to blazegraph so the wikidata/query/deploy patch should no longer be needed. Last failure occurrence is around 2020-03-21 and the fix was deployed last week on 2020-04-09, it's probably too early to claim victory.
Next step is to do a full reload of the category graph and monitor carefully that the daily dumps are properly applied.

What else do we need to do to close this up? It looks like the patch to wikidata/query/deploy needs to be deployed, do we also need a full import to get the state back in sync with mediawiki?

It can be closed when it works again. https://commons.wikimedia.org/w/index.php?search=deepcat%3A%22Mike+Pence+in+2020%22&title=Special%3ASearch&go=Go&ns6=1 currently returns only one result (the only file that is directly in https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_2020) so it's not yet working.

Change 582833 abandoned by DCausse:
Split category daily sparql script

https://gerrit.wikimedia.org/r/582833

Gehel closed this task as Resolved.Jul 13 2020, 12:53 PM

The issue doesn't seem to be solved. For example

select ?cat where {
  ?cat mediawiki:isInCategory <https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_2020>.
}

currently only returns <https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_February_2020>, but not <https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_January_2020>.

CamelCaseNick reopened this task as Open.Jul 26 2020, 4:12 PM
CBogen set the point value for this task to 8.Jul 27 2020, 5:37 PM

Looking at the output mwscript categoryChangesAsRdf.php --wiki commonswiki -s 2020-07-08T23:16:00 -e 2020-07-08T23:17:00 time at which Category:Mike Pence in July 2020 was created.
The output shows that the category metadata are properly created and they are present in the graph.

# Additions
INSERT DATA {

<https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_July_2020> a mediawiki:Category ;
        rdfs:label "Mike Pence in July 2020" ;
        mediawiki:pages "0"^^xsd:integer ;
        mediawiki:subcategories "0"^^xsd:integer .

<https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_July_2020> mediawiki:isInCategory <https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_2020> .

};

But later on while handling edits it removes everything from

# Changes
DELETE {
?category ?x ?y
} WHERE {
   ?category ?x ?y
   VALUES ?category {
 <https://commons.wikimedia.org/wiki/Category:Mike_Pence_in_2020>
   }
};

I think the maint script is removing the data it wants to add.

Change 617481 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/core@master] Write category parent hierarchy when handling categorization

https://gerrit.wikimedia.org/r/617481

Once the fix to the dump script is deployed we will have to reload the categories again.

Change 617481 merged by jenkins-bot:
[mediawiki/core@master] Write category parent hierarchy when handling categorization

https://gerrit.wikimedia.org/r/617481

A kinda reverse bug now seems to be occurring. Deepcat searches on enwiki in Category:Articles needing expert attention gives: "A warning has occurred while searching: Deep category query returned too many categories". No results are actually displayed (not even a limited set of results).

Is this the expected behaviour?

dcausse added a comment.EditedAug 31 2020, 8:56 AM

A kinda reverse bug now seems to be occurring. Deepcat searches on enwiki in Category:Articles needing expert attention gives: "A warning has occurred while searching: Deep category query returned too many categories". No results are actually displayed (not even a limited set of results).

Is this the expected behaviour?

Yes it is expected when the subcategory tree is too large (more than 256 currently). The category Articles needing expert attention seems relatively large and is probably hitting this limit. As to why it does not even return some partial result the reason is that the 256 categories selected by deepcat might not be the ones containing articles.

As to why it does not even return some partial result the reason is that the 256 categories selected by deepcat might not be the ones containing articles.

The code actually explicitly returns nothing if the category tree is too large. The comment says:

According to T181549 this means we fail the filter application

CBogen added a subscriber: CBogen.Sep 21 2020, 5:34 PM

This is currently waiting on T259588