Page MenuHomePhabricator

WDQS missing some particular data (property P6885)
Closed, ResolvedPublicBUG REPORT

Description

For this query:

SELECT * WHERE {
  VALUES ?item { wd:Q48194 wd:Q470380 wd:Q470445 } .
  ?item wdt:P6885 wd:Q43266 .
}

it should yield three results but there is only one. For instance, Q470380 does have this property and it is in RDF export, too.

Previously, both items had this property in WDQS, as demonstrated by this ListeriaBot list (rows "Přerov" and "Prostějov").

Their disappearance coincides with the data loss described in T228569.

Event Timeline

Restricted Application added subscribers: Cyberpower678, Aklapper. · View Herald Transcript

Possibly related? We have found a similar problem with missing P6736 data: Q38192330 has P6736 (since 2019-07-03), but it cannot be found using WQS.

Hmm looking at the date, it's close to where the dump is generated, so I wonder if there's not some race condition between dumping and updating and loading, etc. where some updates may fall through.
T228569 is likely unrelated (except both being consequences of data reload) since it was caused by missing Lexeme dumps.

Smalyshev triaged this task as Medium priority.Jul 22 2019, 8:00 PM

Could we maybe raise priority for this? As I understand, WQS is one of the most popular ways to consume wikidata, and it's returning incorrect data.

@AMDmi3 I see the query returning three results, as requested. Can you please provide a query that is still wrong?

@Smalyshev Here is one that returns incorrect results: https://w.wiki/6Zh It returns Q31773 but the correct result is Q3093304.

It seems that the set of missing entries varies from time to time. This is currently working example

SELECT * WHERE {
  wd:Q3237690 wdt:P6931 ?repology_project .
}

PS. A bit clumsy, but here's how you can get working examples at any time (and also a demonstration of how this directly affects consumers).

  • Go to https://repology.org/repositories/updates
  • Scroll to bottom, click middle column (Last parse) for Wikidata
  • Check entries with WARNING: entry has packages, but not Repology project name. Some of these are valid and caused by P6931 missing, others are false positives caused by the bug.

ATM, some of the entries are: Q1765672, Q3337877, Q244140, Q11354, Q8189917, Q7886333, Q3093304, Q2243417, Q214743, Q3237690, Q617014

Looks like all affected items were updated on July 3, the same time where dump was generated, so I think we have some kind of timing issue between dump, updating and loading that affects some items that were updated while dump is being generated. I'll look into it

In any case, I have updated the affected entities.

@Smalyshev It is not only July 3; here is one from July 2 that is still incorrect: https://w.wiki/6bb It produces two values but there should only be one.

In any case, I have updated the affected entities.

There are still problems with entries I haven't mentioned. I don't see a point of fixing individual entries, as the problem is global and affects unknown (probably much larger) number of entries - probably all entries edited in specific time period should be "updated".

In any case, I have updated the affected entities.

There are still problems with entries I haven't mentioned. I don't see a point of fixing individual entries, as the problem is global and affects unknown (probably much larger) number of entries - probably all entries edited in specific time period should be "updated".

Yes I confirm that I accidentally discovered dozens of items edited on July 3rd, all are missing in WDQS.

Looks like the earliest dump started at 2019-07-01T23:00:02Z and the latest at 2019-07-03T16:20:54Z. Between those dates, there might be updates missed due to T229617. I'll try to update it but the problem is that RC stream seems to be preserved only for 30 days, so I only have data since 2019-07-02T22:04:34. I'll try to see if I can find which items have been updated between 2019-07-01T23:00:02Z and 2019-07-02T22:04:34 but that data may not be available anymore.

I've updated affected items from 2019-07-02T12:59:28 to 2019-07-03. The items between 2019-07-01T23:00:02Z and 2019-07-02T12:59:28 still may be missing updates, but since all streams from that time seem to be already purged, I can't update them, so please tell me if any items are still missing, and I'll update them. Or just edit them. Also please tell me if any items outside that timeframe are wrong - that would be some other issue.

@Smalyshev Thank you! Do you estimate that all items edited in that remaining time window are affected?

If the item has been edited since that time, it is probably not affected. If not, then it depends - whether the modification has been made before dumping code got to it or after. There's no real way for me to know it for each item, at least I don't think I know any way.

I've updated affected items from 2019-07-02T12:59:28 to 2019-07-03. The items between 2019-07-01T23:00:02Z and 2019-07-02T12:59:28 still may be missing updates, but since all streams from that time seem to be already purged, I can't update them, so please tell me if any items are still missing, and I'll update them. Or just edit them. Also please tell me if any items outside that timeframe are wrong - that would be some other issue.

I've run into an item still not updated, at least 2 last revisions are not visible, I've updated it manually.

https://www.wikidata.org/w/index.php?title=Q773196

I can't update them, so please tell me if any items are still missing

Can't this be done reliably, by iterating all entries last updated in the given timesan?