Page MenuHomePhabricator

rdf munger and hence wdqs-updater requires siteLinks to be formed using a specific articlePath
Open, Needs TriagePublic

Description

When using a non-standard $wgArticlePath (that is, not /wiki/$1), the automatic process of feeding Wikibase data into the query service will stop working.

The error message when running munge:

0:16:33.843 [main] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized subjects: [

then follow thousands of statements, all in the form https://artbase.rhizome.org/entity/statement/Q4198-xxxxxx -- then the end of the error:

] while processing https://artbase.rhizome.org/entity/null.  Expected only sitelinks and subjects starting with https://artbase.rhizome.org/wiki/Special:EntityData/ and [https://artbase.rhizome.org/entity/]

This is for a wiki that is set up with the $wgArticlePath /$1, according information in the sites table, and a correctly changed .htaccess for Apache to handle the URL routes.

When changing the $wgArticlePath back to the default /wiki/$1, (including switching back information in the sites table and reversing to detault .htaccess) the Munger process completes without any issues.

I am unclear why the Munger would verify siteLinks in the first place (or do any data validation), but if it needs to do that it should check for linked wiki's articlePath, which can be found out using the Mediawiki API.

My suggestion would be to use the existing command line switch --skipSiteLinks to at least not check for the formatting of siteLinks when they're not going to be exported to the query service anyway. At the moment, this switch will prevent siteLinks from being exported, but check their form regardless.

Preferred would be another switch that would make Munger accept any type of siteLink.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Addshore subscribed.

A few questions.

https://artbase.rhizome.org/entity/null did this actually say null?

Could you provide an example from the [] of a subject that is not prefixed with either https://artbase.rhizome.org/wiki/Special:EntityData/ and [https://artbase.rhizome.org/entity/] ?

Sounds like this is mainly something for the query service team to look at

@despens could you provide a reproducible test case (a small RDF file that triggers the problem would be great). I don't see how site links could be involved in the problem you raise and a test case will definitely help. Thanks!

  • The URL was indeed https://artbase.rhizome.org/entity/null with a null
  • I am unable to reproduce the log at the moment. This only can be done on the production machine (site links do not work on a localhost-based docker deployment, T268231, T264009) and the project deadline is too close right now. 😅