Page MenuHomePhabricator

Some http (not https) conceptUris seen in the query service of migrated wikis
Closed, ResolvedPublic

Description

https://enlightenedmedialities.wikibase.cloud/query/#select%20%2a%20where%20%7B%20%3Fa%20%3Fb%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3Fc%20%7D

for example contains http conceptUris for subjects while still having https urls for predicates. We suspect this is related to the munging script for migrating wikis:

Errors seen in the output of the munging step:

14:36:27.281 [main] WARN  org.wikidata.query.rdf.tool.Munge - Error munging Q1265
org.wikidata.query.rdf.tool.rdf.Munger$BadSubjectException: Unrecognized subjects:  [https://enlightenedmedialities.wikibase.cloud/entity/Q1265, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-40525DA0-7BD6-4F6B-B17A-34911BF8DD20, https://enlightenedmedialities.wikibase.cloud/reference/247af1696b04906e7f0bec2161115fb93596673e, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-0859a6c8-4fd1-bdde-eb24-54fbf779dac0, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-a3b755e8-4315-78c5-8c26-93646a943956, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-7a026671-4f85-4449-2e52-4ee621143030, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-95258d6b-491f-47f3-b4a8-ead4e2fd0570, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-aeec280e-4466-16f5-e2d1-952f820ddf56, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-335befa1-488b-f025-c71f-fba2596035db, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-26EA9DDD-2321-4D4A-9F59-9CC13AC3046D, https://enlightenedmedialities.wikibase.cloud/value/86b6caa718fb092cd744427aafe65837, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-3420bb7e-479c-f80c-a2b9-2ded723588ff, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-7eda0a63-4ee6-ffa7-d083-23a80ce95d44, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-7e5c19d9-4314-1dc7-6114-389d7c735a4f, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-674c1d38-4430-5b1d-6ebd-2fb9d592feb2, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-6a6b1bdb-4089-0f1c-9805-61d1fe4a714f, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-b590e359-40ae-f415-3ee8-d5cc99aecc80, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-4CA1A360-DD6C-4CC2-B082-12C796AB1595, https://enlightenedmedialities.wikibase.cloud/entity/statement/Q1265-367be39f-4a77-6a47-5916-0c7e33f836bc].  Expected only sitelinks and subjects starting with http://enlightenedmedialities.wikibase.cloud/wiki/Special:EntityData/ and [http://enlightenedmedialities.wikibase.cloud/entity/]
    at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.finishCommon(Munger.java:941)
    at org.wikidata.query.rdf.tool.rdf.Munger$MungeOperation.munge(Munger.java:493)
    at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:148)
    at org.wikidata.query.rdf.tool.rdf.Munger.munge(Munger.java:192)
    at org.wikidata.query.rdf.tool.Munge$EntityMungingRdfHandler.munge(Munge.java:346)
    at org.wikidata.query.rdf.tool.Munge$EntityMungingRdfHandler.handleStatement(Munge.java:298)
    at org.wikidata.query.rdf.tool.rdf.DelegatingRdfHandler.handleStatement(DelegatingRdfHandler.java:43)
    at org.wikidata.query.rdf.tool.rdf.NormalizingRdfHandler.handleStatement(NormalizingRdfHandler.java:62)
    at org.openrdf.rio.turtle.TurtleParser.reportStatement(TurtleParser.java:1194)
    at org.openrdf.rio.turtle.TurtleParser.parseObject(TurtleParser.java:530)
    at org.openrdf.rio.turtle.TurtleParser.parseObjectList(TurtleParser.java:453)
    at org.openrdf.rio.turtle.TurtleParser.parsePredicateObjectList(TurtleParser.java:424)
    at org.openrdf.rio.turtle.TurtleParser.parseTriples(TurtleParser.java:409)
    at org.openrdf.rio.turtle.TurtleParser.parseStatement(TurtleParser.java:259)
    at org.wikidata.query.rdf.tool.Munge$ForbiddenOk$HackedTurtleParser.parseStatement(Munge.java:665)
    at org.openrdf.rio.turtle.TurtleParser.parse(TurtleParser.java:214)
    at org.wikidata.query.rdf.tool.Munge.run(Munge.java:197)
    at org.wikidata.query.rdf.tool.Munge.main(Munge.java:113)

Possible Fix: https://github.com/wbstack/migrate/pull/12

Event Timeline

tried the fix on staging, no warnings or errors during munging:

Loading ttl into query service
+ kubectl --context=gke_wikibase-cloud_europe-west3-a_wbaas-2 exec -it queryservice-59b74c98dc-9sg66 -- bash -c 'java -cp lib/wikidata-query-tools-*-jar-with-dependencies.jar org.wikidata.query.rdf.tool.Munge --from /tmp/output-enlightenedmedialities.wikibase.dev.ttl --to /tmp/mungeOut-enlightenedmedialities.wikibase.dev/wikidump-%09d.ttl.gz --chunkSize 10000 -w enlightenedmedialities.wikibase.dev --conceptUri https://enlightenedmedialities.wikibase.dev'
#logback.classic pattern: %d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n
09:42:20.398 [main] INFO  org.wikidata.query.rdf.tool.Munge - Switching to /tmp/mungeOut-enlightenedmedialities.wikibase.dev/wikidump-000000001.ttl.gz
+ kubectl --context=gke_wikibase-cloud_europe-west3-a_wbaas-2 exec -it queryservice-59b74c98dc-9sg66 -- bash -c './loadData.sh -n qsns_ac01788ad2 -d /tmp/mungeOut-enlightenedmedialities.wikibase.dev/'
Processing wikidump-000000001.ttl.gz
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta http-equiv="Content-Type" content="text&#47;html;charset=UTF-8"><title>blazegraph&trade; by SYSTAP</title
></head
><body<p>totalElapsed=1805ms, elapsed=1801ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0ms</p
><hr><p>COMMIT: totalElapsed=6243ms, commitTime=1650620552953, mutationCount=121177</p
></html
>File wikidump-000000002.ttl.gz not found, terminating

also the right scheme in query service results: https://enlightenedmedialities.wikibase.dev/query/#SELECT%20%2a%20WHERE%20%7B%3Fa%20%3Fb%20%3Fc%7D

Tarrow claimed this task.