Queries involve finding strings (e.g. labels, descriptions or aliases) in a language are slow
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Nikki
	Jun 7 2017, 9:52 PM

Description

There are often cases where I (and other people) want to make queries which involve finding items which have (or don't have) a label (or description or alias) in a particular language. In my experience, these queries are often slow or time out. It would be useful to either improve the speed of the existing queries somehow, or provide another way to filter by language which is faster.

For example, this recent request wanted all disambiguation items with an English description which have a description other than the usual description for such items. There I posted this query which works for English. I was also able to adapt it to produce this query for German. I then wanted to query for Austrian German, but the following just times out:

select * {
  hint:Query hint:optimizer "None" .
  ?item wdt:P31 wd:Q4167410 .
  minus { ?item schema:description "Wikimedia-Begriffsklärungsseite"@de-at }
  ?item schema:description ?desc filter (lang(?desc) = "de-at") .
}

I've tried various things and the one thing that stands out to me is that queries for simple triples are fast even when huge amounts of data are involved, e.g. select * { ?item schema:description "Wikimedia-Begriffsklärungsseite"@de-at } gives over 800 thousand results in just over 5 seconds and select * { ?item wdt:P31 wd:Q4167410 } gives over 1 million results in under 9 seconds. That suggests to me that one option would be to add something like ?item someprefix:hasLabelInLanguage "de-at".

More examples of when I've wanted something like this:

A few days ago I wanted to select all humans with Japanese labels, and then filter for labels which looked like they needed fixing (e.g. those which included disambiguation information). I ended up having to download and parse a data dump because I couldn't find a way to make a query that didn't time out.

A while back, I wanted to find all labels for a small language (only a few thousand labels) so that I could check and fix the capitalisation (many items should be lowercase, but are currently capitalised since they get copied from the automatically capitalised MediaWiki page names). I ended up having to use one of Magnus's tools to fetch the labels because the queries I tried timed out.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Igorkim78	T235759 [TRACKING] WDQS / Blazegraph optimization / bug fixes
		Open		None	T167361 Queries involve finding strings (e.g. labels, descriptions or aliases) in a language are slow

Event Timeline

Nikki created this task.Jun 7 2017, 9:52 PM

Restricted Application added projects: Wikidata, Discovery-ARCHIVED. · View Herald TranscriptJun 7 2017, 9:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Yes, the problem here is that while it is fast to scan the index, language itself is not indexed, since the literal is string + language, so it is indexed as a whole. So query like "get me label in language" would not be fast. Adding triple for it may work, though it would duplicate the number of triples for labels/descriptions, but maybe it's ok. I'll think what else it possible to do.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Jun 11 2017, 4:25 PM

Here's another case which appears to be affected by this: https://www.wikidata.org/wiki/Wikidata:Request_a_query/Archive/2017/06#Alias_.3D_Label (continued in German at https://www.wikidata.org/wiki/Wikidata:Forum#Bezeichnung_.3D_Alias).

I can get 100,000 results before it times out if I don't filter by language, so if we could filter by language more efficiently, I would expect it to be able to find all the results.

Smalyshev added a subscriber: daniel.Jul 13 2017, 10:30 PM

Lucie subscribed.Aug 15 2017, 11:00 AM

Smalyshev triaged this task as Medium priority.Feb 12 2018, 8:05 AM

Nikki mentioned this in T197161: Gather information on users of wb_terms replicas on WMF cloud infrastructure.Jun 20 2018, 10:49 AM

Ijon subscribed.Aug 8 2018, 1:32 PM

Gehel added a parent task: T235759: [TRACKING] WDQS / Blazegraph optimization / bug fixes.Oct 17 2019, 12:43 PM

Gehel moved this task from Incoming to Small Tasks on the Wikidata-Query-Service board.Feb 13 2020, 2:16 PM

Nikki renamed this task from Queries involve finding labels, descriptions or aliases for a language are slow to Queries involve finding strings (e.g. labels, descriptions or aliases) in a language are slow.Oct 29 2020, 10:20 AM

This is also a problem for monolingual text statements, lemmas, forms and senses.

Other things I've tried to do which have timed out:

Find all statements using the language code mis and check whether they have a qualifier specifying the actual language.
List language codes and the number of times they've been used for forms, grouped by language (like this table but for forms, not lemmas).
Find items which have a label in a particular language and a P279 (subclass of) statement, where the label starts with a capital letter so I can check whether they should be fixed.

• MPhamWMF moved this task from Small Tasks to Blazegraph on the Wikidata-Query-Service board.Jul 1 2021, 2:36 PM

Maybe we should move from
* ?item rdfs:label "text"@en
To
*?item wikibase:label "text"
*"text" schema:inLanguage "en"
This may look like it would double the number of triples, but it happens that "text" is the same for several languages.

Also, this would simplify analysis on unique labels, e..g T286257#7212146

Esc3300 added a project: Language codes.Jul 16 2021, 11:31 AM

Esc3300 moved this task from Backlog to Monitoring on the Language codes board.

Manuel subscribed.Mar 24 2022, 5:41 PM

Nikki mentioned this in T326720: Provide a way to find lexemes with a given lemma/form.Jan 11 2023, 10:59 AM

QLever is able to handle certain queries like these, e.g. https://qlever.cs.uni-freiburg.de/wikidata/nF1iKM (one of the queries mentioned in the description) gave me the results in under a second, whereas the same query in Blazegraph times out.

It only supports the syntax filter (lang(?variable) = "...")) though.

Queries involve finding strings (e.g. labels, descriptions or aliases) in a language are slowOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Queries involve finding strings (e.g. labels, descriptions or aliases) in a language are slow
Open, MediumPublic
Actions

Related Objects
Search...