Running the following query (https://w.wiki/UMS), the result for https://www.wikidata.org/wiki/Q1410578 is always strange: for https://www.wikidata.org/wiki/Property:P5739 the result is https://www.wikidata.org/wiki/Special:EntityData/36259 instead of simply 36529 and for https://www.wikidata.org/wiki/Property:P19 the result is https://www.wikidata.org/wiki/Special:EntityData/Q1410578.ttl?flavor=dump&revision=1205531147 instead of simply wd:Q1410578; the results for https://www.wikidata.org/wiki/Property:P569, https://www.wikidata.org/wiki/Property:P570 and https://www.wikidata.org/wiki/Property:P20 are regular.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
mediawiki/extensions/Wikibase | master | +5 -1 | [rdf] check type of DataValue::getValue | |
wikidata/query/rdf | master | +36 -1 | Add debug code to investigate T255657 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | RKemper | T267927 Reload wikidata journal from fresh dumps | |||
Resolved | dcausse | T255657 Strange result in Wikidata query (full URLs given instead of identifiers) |
Event Timeline
SELECT * WHERE { wd:Q1410578 wdt:P5739 ?pusc; wdt:P19 ?pob. }
Very strange. Wikibase RDF output looks normal. According to schema:dateModified the data was imported on 12 June 2020. I get the same response from wdqs1004, wdqs1005 and wdqs1006 (with various whitespace in the query text to bypass the cache), so it’s not just one server either (I believe wdqs1009 is supposed to test the new updater, either now or soon in the future).
Still trying to figure out what has happened.
- tested the munger locally and it does not produce this
- checked the dump that were used to load the data and it looked OK
- re-synced manually the entity on wdqs1004 and the problem disappeared
This will need more investigations, lowering the prio as it does not seem like a widespread issue.
See also the query https://w.wiki/UUJ, which shows all the parts of the items (there are other properties which look strange).
This task has been put back in the backlog to reflect the fact that there's no active work on this.
I've recorded most of its triples to
for future investigations and will reload it to cleanup the wdqs instances.
Please let us know when you encounter this problem again.
Realized that the file I uploaded do not contain the culprit data even though I did not reload the item yet... Only wdqs1010 (non-public test server) does seem to have the culprit data which I uploaded here:
Change 618354 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Add debug code to investigate T255657
Change 618354 merged by jenkins-bot:
[wikidata/query/rdf@master] Add debug code to investigate T255657
Looks like we might have another instance of this issue (T266211), time to see if the debug code reported anything useful?
All the revisions I manually checked were created on this same day 2020-06-12 before mw1384 was depooled, I'm trying to extract a full list from one server but I'm having hard times making blazegraph not fail:
select ?s ?c ?date { ?s wdt:P31 ?c . FILTER (STRSTARTS(STR(?c), "https://www.wikidata.org/wiki/Special:EntityData")) ?s schema:dateModified ?date .} limit XX
where XX=20 is the max blazegraph is able to respond without failing with com.bigdata.rwstore.sector.MemoryManagerOutOfMemory on wdqs1010.
Moving back to in progress as the journal of most servers seem corrupted with such data and we need to either confirm or discard T255282 (a single occurrence of this incoherence for a revision not created on 2020-06-12 before 17:00 UTC would discard this possibility) but more importantly cleanup the data.
Regarding how this could happen from the wdqs-updater perspective:
When parsing the item RDF data the updater will uses this URI construct https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106 as the baseURI for the sesame RIO parser.
StatementCollector collector = new StatementCollector(); RDFParser parser = RDFParserSuppliers.defaultRdfParser().get(collector); String baseUri = "https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106"; parser.parse(new StringReader("<uri:subject> <uri:pred> <> ."), baseUri); RDFWriter writer = RDFWriterRegistry.getInstance().get(RDFFormat.TURTLE).getWriter(System.out); writer.startRDF(); for (Statement st : collector.getStatements()){ writer.handleStatement(st); } writer.endRDF();
Will interpret the turtle:
<uri:subject> <uri:pred> <> .
as
<uri:subject> <uri:pred> <https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106> .
On the PHP side the notice mentionned in T255282 indicates that trim() expects parameter 1 to be string, object given.
trim will return NULL when given an object which is then passed to \Wikimedia\Purtle\RdfWriter::is( $base, $local ) which will output <> when given NULL for both args:
$writer = new \Wikimedia\Purtle\TurtleRdfWriter(); $writer->start(); $writer->about( "uri", "subject" ); $writer->say( "uri", "predicate" )->is( null ); $writer->finish(); print( $writer->drain() );
will output:
uri:subject uri:predicate <> .
It is very probable that the notices seen in T255282 have caused such triples to be written in the ttl output of Special:EntityData.
Change 635855 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/Wikibase@master] [rdf] check type of DataValue::getValue
Change 635855 abandoned by DCausse:
[mediawiki/extensions/Wikibase@master] [rdf] check type of DataValue::getValue
Reason:
was just to illustrate the problem