Page MenuHomePhabricator

Queries involve finding strings (e.g. labels, descriptions or aliases) in a language are slow
Open, MediumPublic

Description

There are often cases where I (and other people) want to make queries which involve finding items which have (or don't have) a label (or description or alias) in a particular language. In my experience, these queries are often slow or time out. It would be useful to either improve the speed of the existing queries somehow, or provide another way to filter by language which is faster.

For example, this recent request wanted all disambiguation items with an English description which have a description other than the usual description for such items. There I posted this query which works for English. I was also able to adapt it to produce this query for German. I then wanted to query for Austrian German, but the following just times out:

select * {
  hint:Query hint:optimizer "None" .
  ?item wdt:P31 wd:Q4167410 .
  minus { ?item schema:description "Wikimedia-Begriffsklärungsseite"@de-at }
  ?item schema:description ?desc filter (lang(?desc) = "de-at") .
}

I've tried various things and the one thing that stands out to me is that queries for simple triples are fast even when huge amounts of data are involved, e.g. select * { ?item schema:description "Wikimedia-Begriffsklärungsseite"@de-at } gives over 800 thousand results in just over 5 seconds and select * { ?item wdt:P31 wd:Q4167410 } gives over 1 million results in under 9 seconds. That suggests to me that one option would be to add something like ?item someprefix:hasLabelInLanguage "de-at".

More examples of when I've wanted something like this:

A few days ago I wanted to select all humans with Japanese labels, and then filter for labels which looked like they needed fixing (e.g. those which included disambiguation information). I ended up having to download and parse a data dump because I couldn't find a way to make a query that didn't time out.

A while back, I wanted to find all labels for a small language (only a few thousand labels) so that I could check and fix the capitalisation (many items should be lowercase, but are currently capitalised since they get copied from the automatically capitalised MediaWiki page names). I ended up having to use one of Magnus's tools to fetch the labels because the queries I tried timed out.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Yes, the problem here is that while it is fast to scan the index, language itself is not indexed, since the literal is string + language, so it is indexed as a whole. So query like "get me label in language" would not be fast. Adding triple for it may work, though it would duplicate the number of triples for labels/descriptions, but maybe it's ok. I'll think what else it possible to do.

Here's another case which appears to be affected by this: https://www.wikidata.org/wiki/Wikidata:Request_a_query/Archive/2017/06#Alias_.3D_Label (continued in German at https://www.wikidata.org/wiki/Wikidata:Forum#Bezeichnung_.3D_Alias).

I can get 100,000 results before it times out if I don't filter by language, so if we could filter by language more efficiently, I would expect it to be able to find all the results.

Smalyshev triaged this task as Medium priority.Feb 12 2018, 8:05 AM
Nikki renamed this task from Queries involve finding labels, descriptions or aliases for a language are slow to Queries involve finding strings (e.g. labels, descriptions or aliases) in a language are slow.Oct 29 2020, 10:20 AM

This is also a problem for monolingual text statements, lemmas, forms and senses.

Other things I've tried to do which have timed out:

  • Find all statements using the language code mis and check whether they have a qualifier specifying the actual language.
  • List language codes and the number of times they've been used for forms, grouped by language (like this table but for forms, not lemmas).
  • Find items which have a label in a particular language and a P279 (subclass of) statement, where the label starts with a capital letter so I can check whether they should be fixed.

Maybe we should move from
* ?item rdfs:label "text"@en
To
*?item wikibase:label "text"
*"text" schema:inLanguage "en"
This may look like it would double the number of triples, but it happens that "text" is the same for several languages.

Also, this would simplify analysis on unique labels, e..g T286257#7212146