Page MenuHomePhabricator

Category graph includes deleted categories
Open, NormalPublic

Description

e.g. https://query.wikidata.org/bigdata/namespace/categories/sparql?query=SELECT%20*%20WHERE%20{%0A%3Chttps%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCategory%3ABreakthrough_Prize_winners%3E%20%3Fa%20%3Fb.%0A}&format=json

The category Category:Breakthrough_Prize_winners was deleted in 09:37, 19 June 2019.

The data even includes categories deleted in March (Category:Recipients of the Jeton de Vermeil)

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptJul 17 2019, 10:58 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Smalyshev triaged this task as Normal priority.Aug 8 2019, 5:40 AM
Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.

Looks like there's some problem with deletion handling. E.g. https://en.wikipedia.org/wiki/Category:Delaware_elections,_2006 has been deleted and is listed in enwiki-20190826-daily.sparql.gz dump as deleted, but still present in the database. Strangely enough, the log shows the file was successfully processed - but somehow the results are not there. Will investigate further.

Looks like DELETE SPARQL clauses that the daily dump is generating are wrong... Weird I haven't noticed it.

Change 532824 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/core@master] Fix categories detele SPARQL clause

https://gerrit.wikimedia.org/r/532824

After the patch is merged and deployed, categories DB needs to be re-loaded according to procedure here: https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Categories_reload_procedure

I recommend doing it on wdqs1009 or wdqs1010 and then copy categories.jnl to other servers. Since categories are updated daily (see blazegraph cron) it is recommended to start the procedure so that there's enough time to copy the DB to all servers before it's time for the daily update. Since the DB is small, it should not be a problem to copy to all servers in a single day.

Smalyshev moved this task from Next to In review on the User-Smalyshev board.Tue, Aug 27, 11:24 PM

Change 532824 merged by jenkins-bot:
[mediawiki/core@master] Fix categories detele SPARQL clause

https://gerrit.wikimedia.org/r/532824

Smalyshev removed Smalyshev as the assignee of this task.Thu, Aug 29, 7:44 AM
Smalyshev added a subscriber: Smalyshev.