Page MenuHomePhabricator

Strange result in Wikidata query (full URLs given instead of identifiers)
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Query directly for this data:

SELECT * WHERE {
  wd:Q1410578 wdt:P5739 ?pusc;
              wdt:P19 ?pob.
}

Very strange. Wikibase RDF output looks normal. According to schema:dateModified the data was imported on 12 June 2020. I get the same response from wdqs1004, wdqs1005 and wdqs1006 (with various whitespace in the query text to bypass the cache), so it’s not just one server either (I believe wdqs1009 is supposed to test the new updater, either now or soon in the future).

Gehel triaged this task as High priority.Jun 17 2020, 12:37 PM
Gehel moved this task from Incoming to Small Tasks on the Wikidata-Query-Service board.
dcausse lowered the priority of this task from High to Medium.EditedJun 17 2020, 7:09 PM

Still trying to figure out what has happened.

  • tested the munger locally and it does not produce this
  • checked the dump that were used to load the data and it looked OK
  • re-synced manually the entity on wdqs1004 and the problem disappeared

This will need more investigations, lowering the prio as it does not seem like a widespread issue.

See also the query https://w.wiki/UUJ, which shows all the parts of the items (there are other properties which look strange).

Aklapper renamed this task from Strange result in Wikidata query to Strange result in Wikidata query (full URLs given instead of identifiers).Jun 18 2020, 11:59 AM
dcausse subscribed.

This task has been put back in the backlog to reflect the fact that there's no active work on this.
I've recorded most of its triples to


for future investigations and will reload it to cleanup the wdqs instances.

Please let us know when you encounter this problem again.

Realized that the file I uploaded do not contain the culprit data even though I did not reload the item yet... Only wdqs1010 (non-public test server) does seem to have the culprit data which I uploaded here:

Please let us know when you encounter this problem again.

I just came across some for the property P21 - https://w.wiki/Yhn

Change 618354 had a related patch set uploaded (by DCausse; owner: DCausse):
[wikidata/query/rdf@master] Add debug code to investigate T255657

https://gerrit.wikimedia.org/r/618354

Change 618354 merged by jenkins-bot:
[wikidata/query/rdf@master] Add debug code to investigate T255657

https://gerrit.wikimedia.org/r/618354

The revision reported in T266211 was created on 2020-06-12T06:36:58Z which also coincides with the date of problems identified in T264042.
Looking at logs we seemed to have had troubles with a MW machine at this times: T255282 which relates to the opcache issue and the RDF code in wikibase.

All the revisions I manually checked were created on this same day 2020-06-12 before mw1384 was depooled, I'm trying to extract a full list from one server but I'm having hard times making blazegraph not fail:

select ?s ?c ?date {
  ?s wdt:P31 ?c .
  FILTER (STRSTARTS(STR(?c), "https://www.wikidata.org/wiki/Special:EntityData"))
  ?s schema:dateModified ?date .} limit XX

where XX=20 is the max blazegraph is able to respond without failing with com.bigdata.rwstore.sector.MemoryManagerOutOfMemory on wdqs1010.

Moving back to in progress as the journal of most servers seem corrupted with such data and we need to either confirm or discard T255282 (a single occurrence of this incoherence for a revision not created on 2020-06-12 before 17:00 UTC would discard this possibility) but more importantly cleanup the data.

Regarding how this could happen from the wdqs-updater perspective:

When parsing the item RDF data the updater will uses this URI construct https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106 as the baseURI for the sesame RIO parser.

StatementCollector collector = new StatementCollector();
RDFParser parser = RDFParserSuppliers.defaultRdfParser().get(collector);
String baseUri = "https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106";
parser.parse(new StringReader("<uri:subject> <uri:pred> <> ."), baseUri);
RDFWriter writer = RDFWriterRegistry.getInstance().get(RDFFormat.TURTLE).getWriter(System.out);
writer.startRDF();
for (Statement st : collector.getStatements()){
    writer.handleStatement(st);
}
writer.endRDF();

Will interpret the turtle:

<uri:subject> <uri:pred> <> .

as

<uri:subject> <uri:pred> <https://www.wikidata.org/wiki/Special:EntityData/Q15066632.ttl?flavor=dump&revision=1205546106> .

On the PHP side the notice mentionned in T255282 indicates that trim() expects parameter 1 to be string, object given.
trim will return NULL when given an object which is then passed to \Wikimedia\Purtle\RdfWriter::is( $base, $local ) which will output <> when given NULL for both args:

$writer = new \Wikimedia\Purtle\TurtleRdfWriter();
$writer->start();
$writer->about( "uri", "subject" );
$writer->say( "uri", "predicate" )->is( null );
$writer->finish();
print( $writer->drain() );

will output:

uri:subject uri:predicate <> .

It is very probable that the notices seen in T255282 have caused such triples to be written in the ttl output of Special:EntityData.

Change 635855 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/Wikibase@master] [rdf] check type of DataValue::getValue

https://gerrit.wikimedia.org/r/635855

Change 635855 abandoned by DCausse:
[mediawiki/extensions/Wikibase@master] [rdf] check type of DataValue::getValue

Reason:
was just to illustrate the problem

https://gerrit.wikimedia.org/r/635855