Efficient API for querying language links for multiple pages
Open, Needs TriagePublic

Description

I have found two ways to get language links:

  1. action=wbgetentities
  2. prop=langlinks

First does not support any filtering for language, second supports filtering by language, but only one at a time.

I would need an API that can return whether a list of pages exists in two languages.

Simplest way seems to extend prop=langlinks to take multiple parameters for lllang. Thoughts?

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 4 2016, 1:20 PM
Anomie added a subscriber: Anomie.Nov 4 2016, 2:41 PM

The query for the existing code is generically one of these three:

-- No lllang or lltitle
SELECT ll_from,ll_lang,ll_title  FROM `langlinks` WHERE ll_from IN ( /* up to 5000 quoted integers */ ) ORDER BY ll_from, ll_lang LIMIT 5001;

-- lllang, no lltitle
SELECT ll_from,ll_lang,ll_title  FROM `langlinks` WHERE ll_from IN ( /* up to 5000 quoted integers */ ) AND ll_lang = '...' ORDER BY ll_from LIMIT 5001;

-- lllang and lltitle
SELECT ll_from,ll_lang,ll_title  FROM `langlinks` WHERE ll_from IN ( /* up to 5000 quoted integers */ ) AND ll_lang = '...' AND ll_title = '...' ORDER BY ll_from LIMIT 5001;

The request here would add these additional possibilities:

-- lllang, no lltitle
SELECT ll_from,ll_lang,ll_title  FROM `langlinks` WHERE ll_from IN ( /* up to 5000 quoted integers */ ) AND ll_lang IN ( /* up to 500 short strings */ ) ORDER BY ll_from, ll_lang LIMIT 5001;

-- lllang and lltitle
SELECT ll_from,ll_lang,ll_title  FROM `langlinks` WHERE ll_from IN ( /* up to 5000 quoted integers */ )  AND ll_lang IN ( /* up to 500 short strings */ ) AND ll_title = '...' ORDER BY ll_from, ll_lang LIMIT 5001;

For the first, some test queries against enwiki don't seem too bad. Even in the worst cases where I convinced it to hit 1570000 rows (all Handler_read_key), it still completes in under 3 seconds.

For the second, an EXPLAIN for some reason takes a long time (15-30 seconds) to give the same plan as for the first. Actually executing the query also takes much longer even though the Handler stats are the same.

Based on these results, perhaps we could get away with allowing multiple lllang values without lltitle, while only allowing a single lllang when lltitle is used.