Page MenuHomePhabricator

[SUPPORT] Munging errors "Unrecognized subjects"
Closed, ResolvedPublic

Description

I am following this excellent tutorial by @Addshore to reset my WDQS after migrating data to a new Wikibase.

Relevant images running: wikibase/wdqs:0.3.10 and wikibase/wikibase:1.30-bundle

The munging throws an error:

#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
23:57:52.108 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to /tmp/db-dumps/mungedOut/wikidump-000000001.ttl.gz
23:58:02.401 [main] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized subjects: [https://artbase.rhizome.org/entity/statement/Q4198-2BB92CD9-DB2D-4482-87F1-115C209FE3A9, https://artbase.rhizome.org/prop/statement/value/P109, https://artbase.rhizome.org/entity/statement/Q1623-28C1990A-7E66-442C-8617-37D890C63B30, https://artbase.rhizome.org/prop/statement/value/P107, https://artbase.rhizome.org/prop/statement/value/P108, https://artbase.rhizome.org/prop/statement/value/P101, https://artbase.rhizome.org/prop/statement/value/P102, https://artbase.rhizome.org/entity/statement/ [...]

...then follows a list of each and every triple in my Wikibase as "Unrecognized subject". Finally, the output concludes with...

	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:965)
	at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:493)
	at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:148)
	at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:192)
	at org.wikidata.query.rdf.tool.Munge$EntityMungingRdfHandler.munge(Munge.java:255)
	at org.wikidata.query.rdf.tool.Munge$EntityMungingRdfHandler.endRDF(Munge.java:243)
	at org.wikidata.query.rdf.tool.rdf.DelegatingRdfHandler.endRDF(DelegatingRdfHandler.java:28)
	at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:223)
	at org.wikidata.query.rdf.tool.Munge.run(Munge.java:115)
	at org.wikidata.query.rdf.tool.Munge.main(Munge.java:76)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta http-equiv="Content-Type" content="text&#47;html;charset=UTF-8"><title>blazegraph&trade; by SYSTAP</title
></head
><body<p>totalElapsed=188ms, elapsed=65ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0ms</p
><hr><p>COMMIT: totalElapsed=242ms, commitTime=1581813802754, mutationCount=6</p
></html
>Processing wikidump-000000001.ttl.gz
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta http-equiv="Content-Type" content="text&#47;html;charset=UTF-8"><title>blazegraph&trade; by SYSTAP</title
></head
><body<p>totalElapsed=92ms, elapsed=92ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0ms</p
><hr><p>COMMIT: totalElapsed=131ms, commitTime=1581813803605, mutationCount=6</p
></html
>File wikidump-000000002.ttl.gz not found, terminating

The munger creates a wikidump-000000001.ttl.gz that is 535 bytes long.

I wonder what the issue is with my exported TTL file, but it doesn't look problematic to me, all namespaces are defined at the top of the file with the correct base URI https://artbase.rhizome.org and the docker-compose file contains the environment to make that known to WDQS:

environment:
  - WIKIBASE_SCHEME=https
  - WIKIBASE_HOST=artbase.rhizome.org

I was trying to examine the munger's code to see when it throws the "Unrecognized subjects" error but the Dockerfile just loads compiled JVM binaries and I don't know where to find the source.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Are the environment variables set for the updater (rather than just the query service itself) ?

Addshore renamed this task from Munging errors "Unrecognized subjects" to [SUPPORT] Munging errors "Unrecognized subjects".Feb 17 2020, 9:34 AM
Addshore added a project: Wikidata-Campsite.
Addshore moved this task from Incoming to Needs Tech Work on the Wikidata-Campsite board.

The environments are the same for wdqs and wdqs-updater containers, here are the full entries:

environment:
  - WIKIBASE_SCHEME=https
  - WIKIBASE_HOST=artbase.rhizome.org
  - WDQS_HOST=wdqs.svc
  - WDQS_PORT=9999

What is triggering the "unknown subject" error in the munger?

When you are munging are you passing the concept URI?

./munge.sh -f /tmp/rdfOutput -d /tmp/mungeOut -- --conceptUri http://someFancyLocation.foo

What command exactly are you running?
You should be using "https://artbase.rhizome.org" as the conceptUri there.

OK now I am specifying conceptUri, and now all data is moving fine into the query service!

docker-compose exec wdqs ./munge.sh -f /tmp/db-dumps/2020-02-18.ttl -d /tmp/db-dumps/mungedOut -- --conceptUri https://artbase.rhizome.org

I was under the impression the conceptUri was optional when you use the default 🤦‍♂️

Sorry for the trouble.