Page MenuHomePhabricator

API:Search maxes out at 10000
Closed, DeclinedPublic

Description

When using API:Search .. it returns the first 10000 then stops. Reported by 3 editors on the documentation talk page.

https://www.mediawiki.org/wiki/API_talk:Search#Limit

Nothing in the docs concerning a limit. Feature or bug?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2017, 4:37 AM
Anomie added a project: CirrusSearch.
Anomie added a subscriber: Anomie.

There's nothing in the API that's limiting things here. Most likely it's an intentional limit in CirrusSearch, but I'll let someone familiar with that extension comment on that.

Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptOct 3 2017, 1:20 PM
dcausse added a subscriber: dcausse.Oct 3 2017, 1:30 PM

Yes this is a limit imposed by elasticsearch to prevent digging too deeply in the search result set (which is extremely expensive as it has to keep the resultset in memory for sorting).
We could evaluate raising this limit but in the end there will be always a limit.
We believe that 10000 is a sane trade-off (elasticsearch defaults) and I strongly suggest users that want more results to consider using dumps if possible.
Also I'd be curious to know the usecase behind this, maybe we have other tools that may satisfy such usecases?

@dcausse thanks for the info, did not know this.

In my case, the usecase is a unix commandline tool I wrote called wikiget

https://github.com/greencardamom/Wikiget

It's a unix tool interface to some API functions. Example

./wikiget -a "insource:/en[.]wikipedia[.]org/" -n 3

Will print a list of pages in namespace 3 (talk pages) containing that regex. I could add a warning sent to stderr when it reaches 10000. Unless there's another option to go beyond. Database dump scans are possible such as with AWB etc.. but they are very slow compared to the API and not possible via wikiget.

@Green_Cardamom thanks for pointing to your tool.
Perhaps instead of duplicating the hard limit on your side you could check that the totalhits you get in searchinfo does not match the number of results you've fetched?
We may re-evaluate this limit at some point but it's very unlikely that we remove it completely, it'd be very easy for malicious API users to overload the search cluster.
We have also cirrusdumps that are maybe easier to use but that involves setting up an elasticsearch installation on your side which is probably not what you want.

Agreed that checking totalhits against total results is a good idea anyway. Understood about the limit .. what it is. I'll add something to the API doc page so future editors are aware. Didn't know about cirrusdumps that's probably more than I want to do but will keep it in mind for the future.

The other thing you can do when working with external links is to use the linksearch feature (Be sure to check for both http and https versions of the URL) Ive been using it for years and should be a lot more reliable than using the search index.

debt closed this task as Declined.Oct 3 2017, 5:46 PM
debt added a subscriber: debt.

Closing this as declined, since changing up the API limit is probably not something we want to do and there is a few work-arounds for @Green_Cardamom to look at.

@Betacommand - yes wikiget supports linksearch. I was using a link in the example search but it could be anything.
@debt - from what I understand the limit is an Elasticsearch configuration unrelated to the API. There aren't really any API workarounds, except a local install of Elasticsearch and fresh copies of the index (or the dump).

debt added a comment.Oct 4 2017, 4:04 PM

from what I understand the limit is an Elasticsearch configuration unrelated to the API. There aren't really any API workarounds, except a local install of Elasticsearch and fresh copies of the index (or the dump).

Correct, and because of the issues @dcausse mentioned in T177270#3653657, we won't be extending this limitation, due to the very valid concern about overloading the server clusters.

we won't be extending this limitation

Right, I said "Understood about the limit". The limit is understood. Just clarifying since you initially said it was an "API limit". Clarifying it is the search engine itself not the API, and there are no API workarounds.

How about limiting access to >10000 results to those with the bot flag?

@Headbomb I'm not sure its a great idea, but to at least think about it how many results do you think you need? "all of them" is unfortuntately not possible, as it would mean returning 10's of millions of results from search shards to the coordinator.

@EBernhardson well personally I don't run into this issue. Just offering a possible "compromise", increase that limit for trusted users (aka bots). Those searches could be increased to 100 000 / 1 000 000 / 10 000 000 or whatever makes sense. Perhaps with a limit of one such query per 10 seconds / 100 seconds / 1000 seconds or whatever makes sense.

At least this limitation should be mentioned in the API description.

Generally, i wonder why there no limits for the search via the Website but for the API. With some effort one could easily use that to get the complete results.

Agreed on the documentation.
The limit applies to the website as well.

Generally, i wonder why there no limits for the search via the Website but for the API.

This limit is imposed by the search backend whether it's being accessed via the web UI or the action API. The action API imposes no limitation here, as I already said in T177270#3653483.

At least this limitation should be mentioned in the API description.

To be able to put it into the API's auto-generated documentation, first someone would have to add a method to the search backend interface to fetch that limit. Feel free to open a new task for that.