Page MenuHomePhabricator

Query service returns weird Commons links when asking for P50 (author) statement value
Closed, ResolvedPublic

Description

Please fell free to rename the bug... also please excuse my layman reporting of error - I am a mere user of WQS.

This Query gives me very weird results:
https://w.wiki/eCa

Column "Autor" has some unexpected commons links which I cannot explain. This query should IMO return a list of statement values for P50 and has nothing to do with Commons AFAIK.

Adding Matěj Suchánek who encouraged me to report the error.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The results, as displayed by the WDQS UI:

itemstatement1statement2autor
wd:Q41190637wds:Q41190637-17758AA9-341A-48E6-89B8-C043F035FC5Bwds:Q41190637-32FBD281-C583-4741-A71E-3FCA6231A01Dcommons:
wd:Q54215360wds:Q54215360-AA6A9EFC-6385-4105-8AAC-203D13784B9Cwds:Q54215360-8613EC86-ED66-4A78-87EF-06567D592CFEcommons:
wd:Q54215360wds:Q54215360-8613EC86-ED66-4A78-87EF-06567D592CFEwds:Q54215360-AA6A9EFC-6385-4105-8AAC-203D13784B9Ccommons:
wd:Q48158290wds:Q48158290-24BFA7F3-3761-4E53-9C94-6CB9556FD905wds:Q48158290-85E131BD-4D3B-4B97-B406-44FDF89D2433commons:
wd:Q54215360wds:Q54215360-1BB1D70F-6A80-4329-9654-BFE85250678Ewds:Q54215360-8613EC86-ED66-4A78-87EF-06567D592CFEcommons:
wd:Q54215360wds:Q54215360-EFBC0253-2684-45C3-B3A6-47E44111F775wds:Q54215360-8613EC86-ED66-4A78-87EF-06567D592CFEcommons:
wd:Q54215360wds:Q54215360-1BB1D70F-6A80-4329-9654-BFE85250678Ewds:Q54215360-AA6A9EFC-6385-4105-8AAC-203D13784B9Ccommons:
wd:Q54215360wds:Q54215360-EFBC0253-2684-45C3-B3A6-47E44111F775wds:Q54215360-AA6A9EFC-6385-4105-8AAC-203D13784B9Ccommons:
wd:Q63968867wds:Q63968867-99DE3AC2-C2FB-4A49-BBB4-C93B8AFED280wds:Q63968867-8B8B07DD-E2DA-43A5-83E7-297F81BD66B2commons:
wd:Q63968867wds:Q63968867-B6C665E4-FF83-4FFB-9A31-21D3FAC7E66Bwds:Q63968867-8B8B07DD-E2DA-43A5-83E7-297F81BD66B2commons:

It’s not just a UI glitch either; here’s the first “bindings” JSON object:

{
  "statement1": {
    "type": "uri",
    "value": "http://www.wikidata.org/entity/statement/Q41190637-17758AA9-341A-48E6-89B8-C043F035FC5B"
  },
  "autor": {
    "type": "uri",
    "value": "http://commons.wikimedia.org/wiki/Special:FilePath/"
  },
  "statement2": {
    "type": "uri",
    "value": "http://www.wikidata.org/entity/statement/Q41190637-32FBD281-C583-4741-A71E-3FCA6231A01D"
  },
  "item": {
    "type": "uri",
    "value": "http://www.wikidata.org/entity/Q41190637"
  }
}

The autor really is the URI for a blank Commons media value.

In the Wikidata output, the author looks normal – or at least, it does now:

$ curl -s 'https://www.wikidata.org/wiki/Special:EntityData/Q41190637.ttl?flavor=dump&revision=1205887158' | grep -A5 Q41190637-17758AA9-341A-48E6-89B8-C043F035FC5B
wd:Q41190637 p:P50 s:Q41190637-17758AA9-341A-48E6-89B8-C043F035FC5B .

s:Q41190637-17758AA9-341A-48E6-89B8-C043F035FC5B a wikibase:Statement,
                wikibase:BestRank ;
        wikibase:rank wikibase:NormalRank ;
        ps:P50 wd:Q51888859 ;
        pq:P1545 "5" ;
        prov:wasDerivedFrom ref:be0bd403d0f9314c45bcc049994aeff9e61904e3 .

The affected properties are mostly “author”, but not exclusively:

SELECT ?p (COUNT(*) AS ?count) WHERE {
  ?s ?p <http://commons.wikimedia.org/wiki/Special:FilePath/>.
}
GROUP BY ?p
ORDER BY DESC(?count)
pcount
ps:P50187
wdt:P5096
ps:P10623
wdt:P10619
ps:P13610
wdt:P1368
ps:P2751
pq:P5121
pq:P1061
wdt:P2751

We will time box the investigation here to 1 day of work (plus monitoring time) and then reevaluate if it is complex.

Zbyszko added a subscriber: Zbyszko.

First look at the issue - usual culpruits don't seem to apply here:

  • Munged dump is correct for one of the affected entitites
  • query, after loading the data into blazegraph is fine,too

Interestingly, every single entity I found with this, was updated on 12 June 2020 (though with correct data and by different users) - I'm not sure if there's a significance here yet or just a freak coincidence.

Reload might be in order - currently process works correctly.

According to this query:

SELECT DISTINCT ?item ?rev ?date WHERE {
      {
        ?st ps:P50|ps:P106|ps:P136|ps:P275|pq:P512|pq:P106 <http://commons.wikimedia.org/wiki/Special:FilePath/>.
        ?item ?p ?st 
      }
      UNION
      {
        ?item wdt:P50|wdt:P106|wdt:P136|wdt:P275 <http://commons.wikimedia.org/wiki/Special:FilePath/> .
      }
  ?item schema:version ?rev .
  ?item schema:dateModified ?date .
}

we have 119 entities affected which I will refresh. Results -

.

All the entities affected were refreshed and this:

SELECT ?p (COUNT(*) AS ?count) WHERE {
  ?s ?p <http://commons.wikimedia.org/wiki/Special:FilePath/>.
}
GROUP BY ?p
ORDER BY DESC(?count)

no longer returns any results.
All affected entitites had been updated on 12 June 2020. @Lucas_Werkmeister_WMDE, are you aware of any issue around that day concerning wikidata? I can't find anything on wdqs side that could produce that kind of a localized result.

I checked a few places but didn’t find anything noteworthy on that day. No idea where those came from.

Change 635547 had a related patch set uploaded (by ZPapierski; owner: ZPapierski):
[wikidata/query/rdf@master] Add logging of suspicious statements

https://gerrit.wikimedia.org/r/635547

Little bit more on what I found:

  • Both eqiad and codfw were affected, but list of Q-entities was greatly different
  • While almost all issues were reported for truthy and reified statements, there was a single case for a reference. Interestingly, that case was also for author (P50). Broken reference query:
SELECT  ?p ?o WHERE {
 
  <http://www.wikidata.org/reference/6cd4b812994d39bb1840bbc7ce8bb65446773b9a> ?p ?o
}

output:

p	o
<http://www.wikidata.org/prop/reference/P50>	<http://commons.wikimedia.org/wiki/Special:FilePath/>
<http://www.wikidata.org/prop/reference/P50>	<http://www.wikidata.org/entity/Q12480>
<http://www.wikidata.org/prop/reference/value/P813>	<http://www.wikidata.org/value/2dbc21b6cc76e660e513349edd4aad24>
<http://www.wikidata.org/prop/reference/P813>	2019-08-08T00:00:00Z
<http://www.wikidata.org/prop/reference/P854>	https://statbel.fgov.be/sites/default/files/Over_Statbel_FR/Nomenclaturen/NIS6nwithnamefrom01012019.xlsx
<http://www.wikidata.org/prop/reference/P854>	<https://statbel.fgov.be/sites/default/files/Over_Statbel_FR/Nomenclaturen/NIS6nwithnamefrom01012019.xlsx>

Reference was broken only on eqiad DC.

  • There was a duplicated P854 property for mentioned reference - one with non-IRI value (which is not a correct one).

I added logs to parse for at a later time to see if the issue repeats.

@Lucas_Werkmeister_WMDE little follow up to this - is it possible that this could be the result with opcache issues, like described here - https://phabricator.wikimedia.org/T255282 ? Date matches so we are wondering if those could be related.

I suppose that’s possible, but I can’t think of a way to confirm it (unless we have some log that records which appserver the entity data came from).

I wonder if this is related to T255657, a different “strange values” bug.

Change 635547 abandoned by ZPapierski:
[wikidata/query/rdf@master] Add logging of suspicious statements

Reason:
Since the original issue probably happened because opcache issue, makes no sense to leave this log in Munger.

https://gerrit.wikimedia.org/r/635547