Page MenuHomePhabricator

Unrecognized subject messages on Updater
Closed, ResolvedPublic

Description

I discovered that we have a bunch of these messages in the log:

May 09 16:10:05 wdqs1003 bash[23933]: 16:10:05.348 [update 6] INFO  o.wikidata.query.rdf.tool.rdf.Munger - Unrecognized subjects: [http://www.wikidata.org/value/965deb8f70144795ef92696ba583be02] while processing http://www.wikidata.org/entity/Q52910742.  Expected only sitelinks and subjects starting with http://www.wikidata.org/wiki/Special:EntityData/ and http://www.wikidata.org/entity/

Something is clearly wrong with updating process.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

On further research, the immediate cause of the message is a small logical error in our code when calculating updates.

When we look which values and references we have in our database, we check it by mentions in other predicated, however we do not check that the actual reference/node exists.

If there is a situation where statement S refers to reference R, which refers to value V in TTL, but R does not actually exist in the database, then we will think R is already in the database since the mention is there, but when parsing V we will consider it unrecognized since R is not processed, presumably because it already exists.

The above situation should never happen in normal data, so I am trying to figure out now how it arose.

Thinking more about it, I see a flaw in the cleanup process. Right now we are using this query:

DELETE { ?s ?p ?o } WHERE {
  VALUES ?s { %values% }
  # Since values are shared we can only clear the values on them when they are no longer used
  # anywhere else.
  FILTER NOT EXISTS {
    ?someEntity ?someStatementPred ?s .
    FILTER(?someStatementPred != %wikibase:quantityNormalized%)
  }
  ?s ?p ?o .
};

However, if we have changed statement referring to reference R, which refers to value V, it is not clear that the query will delete both R and V, since there's still reference to V from R, even though R will be deleted at the end.
We may need to split it into two queries.

I don't think this flaw is what caused the above issue (since it is still not clear how we have S referring to R without having R - if S refers to R, we should have not deleted R) but it still needs to be fixed.

Noticed another weird thing:

wdref:3e56175e78d06d8aa0e98bcac8dcfa800d1d82d2 on some servers has only one statement (wdq3) while on others has proper set of statements, e.g. see: https://www.wikidata.org/wiki/Special:EntityData/Q15813111.ttl?nocache=1525896723343&flavor=dump

Not sure what's going on there... The reference seems to be created in one edit by quickstatements, not sure how it could happen that it got only half the data.

Change 432332 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] [WiP] Updater fixes

https://gerrit.wikimedia.org/r/432332

Change 432332 merged by jenkins-bot:
[wikidata/query/rdf@master] Updater fixes

https://gerrit.wikimedia.org/r/432332

Fixed the immediate problem, still no clue why the references became broken. It should never have partial data, but it is clear that this is what is happening. I will try to reset the broken items and see if I can see any pattern in where the breakage happens, but it seems to be completely random - for different items, different servers have broken data, with no prediction about which one would have it. Though wdq3 seems to have broken data most often, but for some data wdq3 is not broken. Will watch it and see if I can have any ideas what is going on there.

Smalyshev lowered the priority of this task from High to Low.May 30 2018, 8:46 PM

Looking at https://github.com/SuLab/WikidataIntegrator/issues/65 looks like "partial data" situation happens for statements too, not only for references. Still not sure why it could happen...

Smalyshev raised the priority of this task from Low to High.Jun 4 2018, 8:39 PM

I see more of the case of partial references in the data, no idea what is happening but it seems to be much more frequent than I have thought. There is definitely some bug either when generating RDF or when updating it - different servers have different data under the same reference ID. This is bad and should never happen, but seems to be happening a lot.

Mentioned in SAL (#wikimedia-operations) [2018-06-04T22:57:28Z] <smalyshev@deploy1001> Started deploy [wdqs/wdqs@825863f]: Potential mitigation for T194325

Mentioned in SAL (#wikimedia-operations) [2018-06-04T23:11:58Z] <smalyshev@deploy1001> Started deploy [wdqs/wdqs@825863f]: Potential mitigation for T194325

Mentioned in SAL (#wikimedia-operations) [2018-06-04T23:21:37Z] <smalyshev@deploy1001> Finished deploy [wdqs/wdqs@825863f]: Potential mitigation for T194325 (duration: 09m 39s)

Looks like it's not happening anymore.