Page MenuHomePhabricator

wdqs-updater configuration and data validation issues
Open, Needs TriagePublic

Description

Once a public docker setup of Wikibase is using SiteLinks to connect local items with local wikipages, the updating process that feeds changes to the query service will stop working, complaining about not being able to update any item that contains site links. Example error message:

artbase-query-updater | org.wikidata.query.rdf.tool.rdf.Munger$BadSubjectException: Unrecognized subjects:  [https://artbase.rhizome.org/entity/Q551, https://artbase.rhizome.org/entity/statement/Q551-FAEED5FD-26DC-4722-8C8A-5783FF55A0BD, https://artbase.rhizome.org/entity/statement/Q551-A66E3A8D-A2B1-49E4-80A0-E69590DC93B4, https://artbase.rhizome.org/entity/statement/Q551-92C4242A-16E1-4EB9-A8A7-7C766A763CF6, https://artbase.rhizome.org/entity/statement/Q551-C8CF6FB7-1DA3-4583-A01F-65EFA3E43C6C, https://artbase.rhizome.org/entity/statement/Q551-854800D6-FFB6-4A7B-BD1F-233226A1AF21, https://artbase.rhizome.org/entity/statement/Q551-48719F5B-28F4-44E6-9708-25C60F1BDBE6, https://artbase.rhizome.org/entity/statement/Q551-19EA4083-8A96-4ACA-9938-68FABF2D5741, https://artbase.rhizome.org/entity/statement/Q551-3EA1AA0C-FBC3-43DF-A574-A2F04F3D34D0].  Expected only sitelinks and subjects starting with http://wikibase.svc/wiki/Special:EntityData/ and [http://wikibase.svc/entity/]

Of course this has all kinds of side effects on the whole setup, basically the SPARQL endpoint is not reflecting the contents of the local Wikibase.

In the default docker-compose.yaml, WIKIBASE_HOST is set to wikibase.svc. AFAIK it is not documented where exactly this variable is used. Anyway, in no case wikibase.svc would be a meaningful SiteLink host, since this DNS name is only accessible from within the docker network.

Relevant excerpt from default docker-compose.yaml:

wdqs-updater:
  image: wikibase/wdqs:0.3.10
  restart: unless-stopped
  command: /runUpdate.sh
  depends_on:
  - wdqs
  - wikibase
  networks:
    default:
      aliases:
       - wdqs-updater.svc
  environment:
   - WIKIBASE_HOST=wikibase.svc
   - WDQS_HOST=wdqs.svc
   - WDQS_PORT=9999

This appears connected to T264009, in which the main issue is the use of a URL variable for very different purposes, under the assumption that all URLs must be globally accessible.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

When chaning WIKIBASE_HOST to artbase.rhizome.org, SiteLinks are still rejected, probably because the protocol is assumed to be http instead of https:

artbase-query-updater | 14:18:50.580 [update 2] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized subjects: [https://artbase.rhizome.org/entity/statement/Q1996-E4FAB500-453A-4896-B3BC-ED840ADB5DB0, https://artbase.rhizome.org/entity/statement/Q1996-cf52a70e-496a-b8f6-8ee8-979b9230ff42, https://artbase.rhizome.org/entity/statement/Q1996-3883cd45-4abd-52cc-fd01-40d9b1676948, https://artbase.rhizome.org/entity/statement/Q1996-40b62fbb-491d-9387-a353-9a960aa6f1ea, https://artbase.rhizome.org/entity/Q1996, https://artbase.rhizome.org/entity/statement/Q1996-76ee588f-404f-82ae-1514-bc11e61327bf, https://artbase.rhizome.org/entity/statement/Q1996-b270ab2e-424a-8e2e-1f69-d0946c26b429, https://artbase.rhizome.org/entity/statement/Q1996-07e62af6-4c07-7905-4d2a-ffffdf74c0da, https://artbase.rhizome.org/value/d49b98a030ea836c922381a07d3a55a5] while processing http://artbase.rhizome.org/entity/Q1996.  Expected only sitelinks and subjects starting with http://artbase.rhizome.org/wiki/Special:EntityData/ and [http://artbase.rhizome.org/entity/]

In general it seems weird that wdqs-updater is doing its own validation of data it is supposed to shove into wdqs: if there is any issue with any kind of data, it must be flagged on creation in Wikibase.

I was able to fix the local issue by adding WIKIBASE_SCHEME=https to wdqs-updater environment.

The issues remain:

  • two variables (instead of one) define a URL that is used for both API access and data validation. This makes a setup not portable and greatly hinders local testing.
  • data should not be validated in the updater (I guess structural RDF validation happens, that cannot cause harm)

I guess this task has to be renamed 😉

despens renamed this task from wdqs-updater stops working once SiteLinks are used to wdqs-updater configuration and data validation issues.Nov 21 2020, 3:13 PM