Page MenuHomePhabricator

Many files on Commons cannot be found in WCQS
Closed, ResolvedPublicBUG REPORT

Description

Many files on Wikimedia Commons which have structured claims cannot be found in Wikimedia Commons Query Service.

One example is https://commons.wikimedia.org/wiki/File:Dux-Markt-1.jpg. It has this concept URI http://commons.wikimedia.org/entity/M47869727 and it has structured data since 30 September 2020. But this query in WCQS has no results:

SELECT *
WHERE
{
  sdc:M47869727 ?predicate ?object.
}

Many files are missing in WCQS. This query finds some images on Wikidata, and then tries to find the mediaInfo entities in WCQS:

SELECT (COUNT (DISTINCT ?image) AS ?images) (COUNT(DISTINCT ?file) AS ?files)
WITH
{
  SELECT ?image ?contentUrl
  WHERE
  {
    SERVICE <https://query.wikidata.org/sparql>
    {
      ?item wdt:P31 wd:Q5153359 .
      ?item wdt:P18 ?image .
    }
    BIND (REPLACE(wikibase:decodeUri(SUBSTR(STR(?image), 52)), " ", "_") AS ?filename)
    BIND (MD5(?filename) AS ?MD5)
    BIND (URI(CONCAT("https://upload.wikimedia.org/wikipedia/commons/",
                     SUBSTR(?MD5, 1, 1), "/", SUBSTR(?MD5, 1, 2), "/", ?filename)) As ?contentUrl)
  }
} AS %get_some_images_from_Wikidata
WHERE
{
  INCLUDE %get_some_images_from_Wikidata
  OPTIONAL { ?file schema:contentUrl ?contentUrl . }
}

The content URLs are constructed as described in https://www.mediawiki.org/wiki/Manual:$wgHashedUploadDirectory and all I tested were correct. But the query gives this result:

imagesfiles
63281013

So only 1013 out of 6328 files used in Wikidata claims can be found in WCQS.

Event Timeline

Timebox to a half a day investigation and re-evaluate.

First M-entity mentioned in the ticket was missing because there was a bug with weekly reloads, that has now been fixed - entries added before the reload should be available.

As for the second part of the ticket - sdc has uri encoded contentUrl (although apparently MD5 is calculated from decoded filename). I modified the query to match that fact:

SELECT (COUNT (DISTINCT ?image) AS ?images) (COUNT(DISTINCT ?file) AS ?files)
WITH
{
  SELECT ?image ?contentUrl
  WHERE
  {
    SERVICE <https://query.wikidata.org/sparql>
    {
      ?item wdt:P31 wd:Q5153359 .
      ?item wdt:P18 ?image .
    }
    BIND (REPLACE(wikibase:decodeUri(SUBSTR(STR(?image), 52)), " ", "_") AS ?filename)
    BIND (REPLACE(SUBSTR(STR(?image), 52), "%20", "_") AS ?filenameUnencoded)
    BIND (MD5(?filename) AS ?MD5)
    BIND (URI(CONCAT("https://upload.wikimedia.org/wikipedia/commons/",
                     SUBSTR(?MD5, 1, 1), "/", SUBSTR(?MD5, 1, 2), "/", ?filenameUnencoded)) As ?contentUrl)
  }
} AS %get_some_images_from_Wikidata
WHERE
{
  INCLUDE %get_some_images_from_Wikidata     
  OPTIONAL { ?file schema:contentUrl ?contentUrl . }
}

This yields (at the time of writing):

imagesfiles
63266318

Which leaves 8 unaccounted for. Out of those, 7 do not have structured data defined and last one has a new structured data content, so possibly it wasn't present in the latest dump we reload from.

Please, let me know if this resolves the issue.

Gehel added a subscriber: Gehel.

Looks like this is resolved. If you find more issues, feel free to re-open.