Page MenuHomePhabricator

item not returned in SPARQL query on geokb wikibase.cloud instance
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

What happens?:

The item I'm trying to retrieve doesn't show up in the query results. Looking at something like "what links here" on the classification item lists the item that's not being returned via SPARQL. The item in question also comes up with type-ahead search in the UI just fine.

What should have happened instead?:

Q138349 should be in the SPARQL results

Software version (skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Work that was done on previous issues I reported where items were not showing up for some reason was all done, and the wikibase.cloud platform is now smokin'! It is really performant and working quite well. Type ahead search is available almost immediately. I pushed another 85K items into the instance. While slow through the API, it is getting the job done and I'm tickled pink. Just have to figure out why this item is not coming up via SPARQL. There may also be others that are problematic; I just haven't come across them yet.

Event Timeline

I do have other things missing in SPARQL queries now. The following is supposed to pull about 83K items (a whole tranche of publication items I just brought in):

https://geokb.wikibase.cloud/query/#PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FindexId%0AWHERE%20%7B%0A%20%20%3Fclasses%20wdt%3AP2%20wd%3AQ11%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Fclasses%20.%0A%20%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%0A%7D

There are only about 71K results from that query. The following is a query for one specific ExternalID that should have been returned but is missing:

https://geokb.wikibase.cloud/query/#PREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP114%20%22wri834142%22%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D

That query does return the item in question. So, the items are there, they should be returned via SPARQL, but a bunch of them are not.

The reason I'm pulling a huge batch of identifiers back like this is I need to build out claims from another source that has the internal identifier. I can do the matchup with my QIDs in a batch with the source data and then run a process to build and commit the claims.

I thought this was something different at first, but perhaps it is related. I noticed I had at least one item where the "what links here" page is showing things linked but without labels.

https://geokb.wikibase.cloud/wiki/Special:WhatLinksHere/Item:Q44246

If you follow those, you'll see that they look like perfectly good items, complete with English language labels. Here's an example:

https://geokb.wikibase.cloud/wiki/Item:Q47728

So, I ran a query for all person items affiliated with that organization item:

https://geokb.wikibase.cloud/query/#PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP1%20wd%3AQ3%20.%0A%20%20%3Fitem%20wdt%3AP108%20wd%3AQ44246%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D

The SPARQL results are missing those items entirely. I also tried the query without running the label service just to get the IDs. I still don't turn up all of the people that should be affiliated with the organization. It seems like something is now clogged up somewhere in one or more of the components where all this information is supposed to be replicated.

The ingest rate appears to be very fast. I wonder if it could be adding items so fast that the WDQS updater can't read them before they scroll off recent changes (which, last I heard, is still the way it figured out what new items it needs to review for its own processing)?

There is another way of streaming updates to WDQS that is designed to be more scalable, but if Wikibase.cloud is using that I am not aware of it.

It does appear that certain triples may have been lost in the process of ingesting to Blazegraph. For a query like the following where I'm trying to get all of the items that have a particular instance of classification, I'm missing a bunch of items that are actually classified that way.

https://geokb.wikibase.cloud/query/#PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FindexId%0AWHERE%20%7B%0A%20%20%3Fclasses%20wdt%3AP2%20wd%3AQ11%20.%0A%20%20%3Fitem%20wdt%3AP1%20%3Fclasses%20.%0A%20%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D

However, I can go and run different query for specific values of a foreign identifier in the items, and I'm able to return those items.

https://geokb.wikibase.cloud/query/#PREFIX%20wd%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fentity%2F%3E%0APREFIX%20wdt%3A%20%3Chttps%3A%2F%2Fgeokb.wikibase.cloud%2Fprop%2Fdirect%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3FindexId%0AWHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP114%20%3FindexId%20.%0A%20%20VALUES%20%3FindexId%20%7B%20%22852%22%20%22wri034009%22%20%22wri034064%22%20%22wdr2009%22%20%22wdr2008%22%20%22wdr2007%22%20%22wdrPA051%22%20%22wdrPA052%22%20%22wdrHI051%22%20%22wdr2006%22%20%22wdrNJ053%22%20%22wdrAL051%22%20%22wdrVA051%22%20%22wdrCT051%22%20%22wdrMDDEDC052%22%20%22wdrAK051%22%20%22wdrNV051%22%20%22wri20034004%22%20%22wdrNJ051%22%20%22wdrNJ052%22%20%22wdrMDDEDC051%22%20%22wdrIN051%22%20%22wdrOH051%22%20%22wdrOH052%22%20%22wdrFL051B%22%20%22wdrWI051%22%20%22wdrVA052%22%20%22wdrGA05%22%20%22wdrLA051%22%20%22wdrMN051%22%20%22wdrNC051%22%20%22wdrNC052%22%20%22wdrSC051%22%20%22wdrND051%22%20%22wdrUT051%22%20%22wdrIL051%22%20%22wdrND052%22%20%22wdrNY052%22%20%22wdrNY053%22%20%22wdrAK041%22%20%22wdrFL053A%22%20%22wdrFL053B%22%20%22wdrHI041%22%20%22wdrNC042%22%20%22wdrFL054%22%20%22wdrAL041%22%20%22wdrIA051%22%20%22wdrNY041%22%20%22wdrNY042%22%20%22wdrNY043%22%20%22wdrNY051%22%20%22wdrGA041%22%20%22wdrGA042%22%20%22wdrPA041%22%20%22wdrPA042%22%20%22wdrPA043%22%20%22wdrSC041%22%20%22wdrLA041%22%20%22wdrCA041%22%20%22wdrNC041%22%20%22wdrUT041%22%20%22wdrMN041%22%20%22wdrTX043%22%20%22wdrTX044%22%20%22wdrND041%22%20%22wdrTX041%22%20%22wdrTX042%22%20%22wdrTX046%22%20%22wdrND042%22%20%22wdrTX045%22%20%22wdrIA041%22%20%22wdrIA042%22%20%22wri034249%22%20%22wdrHI031%22%20%22wdrOH-31%22%20%22wri034184%22%20%22wri20034312%22%20%22wdrAL031%22%20%22wri20034151%22%20%22wdrAK031%22%20%22wri034315%22%20%22wri034100%22%20%22wri034224%22%20%22wri034301%22%20%22wri034137%22%20%22wri034288%22%20%22wri034287%22%20%22wri024284%22%20%22wri034195%22%20%22wri034218%22%20%22wri034324%22%20%22wri034318%22%20%22wri034322%22%20%22wri034327%22%20%22wri034240%22%20%22wri034323%22%20%22wri034334%22%20%22wri034293%22%20%22wri034295%22%20%22wri034145%22%20%7D%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22%20.%20%7D%0A%7D

I'm struggling to get my head around all of the details if your query but I wondered if you were intending to traverse your subclass network but not doing so:
Does this do what you wanted?

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?indexId
WHERE {
  ?classes wdt:P2* wd:Q11 .
  ?item wdt:P1 ?classes .
  ?item wdt:P114 ?indexId .
}

Thanks for the input. I probably provided too much detail. That's a useful query, but it wasn't the issue. Bottom line is that I have items that are not showing up in SPARQL queries that absolutely should be showing up based on how they look like they are structured visibly in the UI and how they are returned via the wikimedia API. Here is a single concrete example:

Item representing a journal article: https://geokb.wikibase.cloud/wiki/Item:Q154148

Note the visual presence of the "USGS Publications Warehouse IndexID" (P114) with a value of 70026262

Note the presence of the P114 claim in the JSON rendering of the item: https://geokb.wikibase.cloud/wiki/Special:EntityData/Q154148.json

The API wbgetclaims response also shows the P114 claim: https://geokb.wikibase.cloud/w/api.php?action=wbgetclaims&format=json&entity=Q154148&property=P114

The problem is, this claim (and the ID I'm trying to verify as present in the Wikibase) does not show up in any of the SPARQL queries I'm using to pull these claims. Here is the most simple form of a query attempting to get every P114 identifier:

PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?indexId
WHERE {
  ?item wdt:P114 ?indexId .
}

Note: I use LIMIT and OFFSET to paginate and get all results without breaking something.

Here's a very specific query simply trying to get that P114 identifier using a VALUES statement.

PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?indexId
WHERE {
  ?item wdt:P114 ?indexId .
  VALUES ?indexId { "70026262" }
}

So, the claim is obviously there in the Wikibase instance. But it appears that the triple did not get written to the Blazegraph store.

One other thing I tried doing is replacing and then recreating a P114 claim. If I simply use wikibaseintegrator with action_if_exists=REPLACE_ALL, this doesn't have any effect. However, if I first remove the P114 claim entirely and then run another operation to add it back in using wikibaseintegrator, a query on the ID turns up the expected result. This one, P114:70235876, was missing before I did that.

PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?indexId
WHERE {
  ?item wdt:P114 ?indexId .
  VALUES ?indexId { "70235876" }
}

This can obviously be used to fix my problem, but I have to first know that there's a problem. If I have claims (triples) missing and can't turn them up with a SPARQL query, that's obviously a major issue for the functionality of the system. At this point, I don't know if this is something about the velocity in which I'm trying to introduce items and claims or the complexity of the "item packages" I'm sending in. I can try slowing things way down, introducing barebones items and then claims one at a time. But operations of any significant magnitude are already pretty dang time consuming.

Interestingly enough, I've been posting some one-off things to Wikidata as part of some research into Cherokee Nation Tribal leaders today. SPARQL queries there are not immediately showing results of adding things like start time/end time qualifiers on claims. Here's the query showing missing information I'm attempting to contribute. Maybe it's an inherent delay into Blazegraph?

SELECT ?person ?personLabel (YEAR(?startTime) AS ?start) (YEAR(?endTime) AS ?end)
?replaces ?replacesLabel ?replacedBy ?replacedByLabel ?article
WHERE {
  ?person p:P39 ?statement.
  ?statement ps:P39 wd:Q7245055 .
  OPTIONAL {
  ?statement pq:P580 ?startTime;
             pq:P582 ?endTime;
             pq:P1365 ?replaces;
             pq:P1366 ?replacedBy.
  }
  OPTIONAL {
    ?article schema:about ?person ;
             schema:inLanguage "en" .
    FILTER (SUBSTR(str(?article), 1, 25) = "https://en.wikipedia.org/")
  }
  SERVICE wikibase:label { 
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" .
  }
}

I added start time and end time qualifiers for these two items, which I would have expected to show up in the query now.

This seems like a fairly major problem or at least a limitation we need to be aware of in Wikibases. In major data-to-knowledgebase-representation workflows, this means we need to keep a log of exactly what's been done and any identifiers that have resulted.

I still have information that is not being retrieved via SPARQL queries when I can see it visibly on items in the geokb.wikibase.cloud instance. Here's a query:

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?item ?itemLabel ?profile_url ?retrieved ?status_code
WHERE {
  ?item wdt:P1 wd:Q3 ;
        wdt:P31 ?profile_url ;
        p:P31 ?ref_url_statement .
  OPTIONAL {
    ?ref_url_statement pq:P151 ?status_code ;
                       pq:P139 ?retrieved .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

If you take out the optional, you'll get several hundred fewer results. Here's a case in point item that is not returned unless you make P151 and P139 optional:

https://geokb.wikibase.cloud/wiki/Item:Q45859

You can see visibly that the P31 claim has qualifiers for P151 and P139. If you pull the item json, you'll see the same thing:

https://geokb.wikibase.cloud/wiki/Special:EntityData/Q45859.json

It appears something went haywire between the API route I used to commit this information and the data store that drives Blazegraph. This is a programmatic process I'm building to keep many of our items up to date with source material. In this case, it's a web scraping routine that grabs data from a structured page, caches it in the Item_talk wiki page as YAML of the subject item, and then uses that information to build/update claims. I want to know when the URL was last retrieved and what the HTTP status code was at that time, so I can drive the update processing. If I can't get it back in the SPARQL query, it makes things a bit more difficult.

Tarrow renamed this task from item not returned in SPARQL query on wikibase.cloud instance to item not returned in SPARQL query on geokb wikibase.cloud instance.Feb 23 2024, 5:19 PM
Tarrow changed the task status from Open to Stalled.Feb 23 2024, 5:31 PM

Hi @Skybristol could you take a look and see if these issues appear to be fixed for you?

We've been working hard on the queryservice and improving both the process for how it is updated but also rebuilding all of them from the primary data. Unfortunately I'm not able to actually tell if in your case things are working as expected. The most recent queries you paste don't seem to behave how you describe you hope them to and I'm not sure if that's because now the underlying data has changed or if an issue remains.

Cheers!

Thanks to the team for all the work on this issue. I ended up changing some things on the items I had posted about here (different property for the URL associated with USGS staff). The problems I noted no longer seem to be an issue as those entities/claims/triples were all updated. I am, however, seeing a problem in a different area this morning. I was doing some work to resolve names of organizational units to identifiers instantiated in the Wikibase instance. I'm not turning up entities in that query that should be there. Here's a query:

PREFIX ge: <https://geokb.wikibase.cloud/entity/>
PREFIX gp: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?org ?orgLabel ?org_alt_label
WHERE {
  ?org gp:P1/gp:P2* ge:Q50862 ;
       skos:altLabel ?org_alt_label .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

The Idaho Water Science Center is one of a number of entities that should be showing up in that query. Other instances of USGS Water Science Center, which is pulled in the transitive subclass of, do show up in the SPARQL query (e.g., New England Water Science Center). I've tried to manually tweak the item in question, including removing and recreating the instance of claim, but I cannot get it to turn up in a query result. Another example of an org entity not showing up in the query is the New Mexico Water Science Center.

Ignore my last note on this. That was a problem in my query. I hadn't realized I didn't have all the alt labels in the data yet.

Tarrow claimed this task.

Great, in that case I'm going to close this ticket. If you think I did this in error please feel free to reopen