Page MenuHomePhabricator

Set a hard byte or character limit for queries
Closed, ResolvedPublic

Description

We're currently seeing some incredibly long queries that are impractical to test and run. Find out an appropriate maximum query length and configure cirrus/mediawiki to reject or truncate queries longer than that.

There are two main reasons to limit the maximum query length:

  1. Manual analysis of the queries above a length of a few hundred characters shows almost all of them to be meaningless gibberish. Limiting that improves the quality of our statistics.
  2. Longer queries have a much larger impact on performance. Limiting them improves performance, with little downside considering point 1 about many of them being gibberish.

Many search engines, such as Google, implement extremely aggressive maximum query lengths for the above reasons.

Event Timeline

Ironholds claimed this task.
Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds subscribed.

@TJones What's your recommendation for the maximum query length here?

Deskana triaged this task as Medium priority.Aug 6 2015, 4:43 PM

FWIW, I tried a query of of 461 words in Google, and I got a 400 bad request.

Deskana set Security to None.

Everything over 200 characters in my zero-results sample is crap, but I need to check for queries that get results and see if any of them are that long.

TL;DR:
Generous: 256 characters or 400 bytes
Strict: 200 characters or 300 bytes

Is this for enwiki, or all wikis? Is it bytes or unicode characters? The reason I ask is that counting bytes is a little rough on non-ASCII character sets (particularly Thai).

If it's all wikis and bytes, 500 is a very generous cutoff. 300 would be a bit aggressive.

If it's all wikis and characters, 300 is reasonable and 256 is aesthetically and computationally pleasing.

The first things I see that look like real queries are around 550 and 275 characters: One is 28 intitle: clauses (557 characters), and the other is 10 phrases OR'd together (273 characters). There are shorter |'d searches in Thai that are fewer characters but lots of bytes.

Pretty much everything over 200 characters is crap, and everything over 100 characters is crap except for these multi-phrase advanced queries. Can we distinguish advanced queries (intitle:, OR, etc)?

The longest prefix search is 256 bytes. That doesn't seem random.

If we want try being very strict, limit to 256 bytes, and see who complains.

Change 230646 had a related patch set uploaded (by Tjones):
Set hard character limit for searchText queries

https://gerrit.wikimedia.org/r/230646

Considering that the maximum legal title is 256 bytes (IIRC) and that one can combine prefix and incategory operators each with an argument approaching that limit, perhaps 300 is a bit little.

Considering that the maximum legal title is 256 bytes (IIRC) and that one can combine prefix and incategory operators each with an argument approaching that limit, perhaps 300 is a bit little.

Given that we looked at the queries that users are actually running, and found almost no such queries of such length, this would seem to be more of a theoretical problem than a practical one.

I've re-checked my sample. I found 72,261 examples of 150+ byte searches in one day's logs.

When I limited it by characters (i.e., 150+ characters), it dropped to 49,077 examples.

There is one intitle: search over 300 characters (557 characters).

There are nineteen distinct incategory: searches over 300 characters (378-576 characters): 16 are variations on 4 basic searches.

There are five distinct prefix: search over 300 bytes (395-671 characters): 3 are variations on each other.

There are no additional well-formed queries over 300 characters with AND or OR in them.

Below is a table with the number of queries we'd be excluding per day for a given length limit. So, there are 7,463 queries that are 301 characters or longer.

300: 7463
400: 5241
500: 4020
600: 3142
700: 2478
800: 2025
900: 1590
1000: 1330

Increasing the limit to, say, 600, would allow us to pick up 17 of the 25 special search syntax queries at the cost of processing an additional 4000+ queries of length 300-600 characters.

Change 230646 merged by jenkins-bot:
Set hard character limit for searchText queries

https://gerrit.wikimedia.org/r/230646

Hi there,

we at Wikimedia Germany developed a gadget [1] for our community allowing users to do subcategory searches. Meaning you can search for sites in categories and there subcategories with the help of CatGraph. The gadget works as follows:

We parse the search string with the keyword e.g. deepcat:Music Berlin run a JSONP request to get subcategories of Music from CatGraph and exchange the term deepcat:Music with inctegory:id:235486|id:235470|id:260000|id:273452|id:333958... as you can see the whole search term gets quite long that way. The limit would reduce the actual number of subcategories immensely.

Currently we are even looking for ways to avoid other upper bounds e.g. given by the max GET-parameter length etc. [2] [3]

[1] https://github.com/wmde/DeepCat-Gadget/
[2] https://phabricator.wikimedia.org/T101984
[3] https://phabricator.wikimedia.org/T105328

we at Wikimedia Germany developed a gadget [1] for our community allowing users to do subcategory searches. Meaning you can search for sites in categories and there subcategories with the help of CatGraph.

Hi!

Do you have any data on:

  1. How many users are actively using this gadget?
  2. How many of the queries that the gadget is running that are over the limit?
  3. What percentage of the queries that the gadget is running are over the limit?

Right now there are a lot of options on the table, but we'll need more information in order to make a determination as to what we should do.

Thanks!

Hi!
Thanks for the fast reaction. The gadget was recently developed and it went through one iteration and testing round with the German community so far. We were just about to communicate the updated version to the community this week so that they can start using it and decide whether to include it into the gadget list as default or opt-in. That's why we do not have usage numbers yet - it's just too early.

However the intersection- and subcategory search (which are the core features of the gadget) has been something the Community wished for for many years really strongly. It was one of the most upvoted wishes on the Technical Wishlist (German). Therefore we strongly believe that it will be heavily used.

We estimate that the the limit would negatively affect approximately 80-90% of the search queries thus making the gadget unusable. This is because we think that most people tend to search for and combine general terms e.g. the category "music" (and not some superspecific subcategories) which will almost always result in a query for at least 30 (sub)categories and exceed the limit you want to set. As @WMDE-Fisch mentioned, we even intended to raise the limits to optimize the search results for the users in the next iteration.

Bottom line is: if the limit gets deployed, the DeepCat gadget will be of no use for anybody. We would very much appreciate if we could look for ways how to work around this together. Thank you in advance!

As Kasia stated the gadget is really really wanted by our community and it would be great if we could still give them this feature.

But if the character limit is needed for the above reasons we could think of ways to make the gadget still work. E.g. Allowing longer search terms when certain keywords are included or introducing an extra parameter that allows bigger search queries.

Some more insides on what the gibberish / the 300+ queries are could also give ideas how to avoid them polluting the search without limiting valid queries.

I hope we can find a way to satisfy both needs here.
Thanks, Fisch

@WMDE-Fisch @KasiaWMDE The only objective way to evaluate this is to gather data on its usage and figure out how much of a problem the query limit is. As we did not know about your impending release and therefore did not factor that into our analysis, I would be happy to increase the query length limit to something not prohibitive for you while you gather the data I requested.

What kind of timeline are we looking at for you to be able to provide the data I requested in my above comment?

Change 231437 had a related patch set uploaded (by Deskana):
Temporarily increase maximum search query length to 2500.

https://gerrit.wikimedia.org/r/231437

Change 231437 merged by jenkins-bot:
Temporarily increase maximum search query length to 2500.

https://gerrit.wikimedia.org/r/231437

@Deskana Thanks! I will gather more info next week but I can already say that we might need at least 3 months to give you reliable data (among other things: the word needs to be spread among the community first). Also, thank you for rising the limit but would it be possible to make it 5000 bytes? We want to raise the limits for the gadget and 5000 bytes would fit what we are planning for (otherwise the data you are requesting would be automatically restricted by the 2500).
And in the meantime: have a great weekend!

I've added @Tfinc to this since I will be out all of next week and he will be standing in for me.

@KasiaWMDE I'm sorry, but given the performance implications of these extremely long queries, we cannot hold off deploying a fix for it for three months. I understand that this must be quite frustrating, but I must weigh the needs of the hundreds of thousands of users of our search against the small handful of users that requested this feature. We can keep the limit of 2500 characters for two or three weeks to gather the basic statistics I requested, but after that we will have to change the limit back to something more restrictive for performance reasons.

Since incategory operations are extremely expensive, chaining a lot of them together will likely cause the exact kind of performance issues that the query length limit is intended to prevent. Have you investigated other ways of solving this use case? The discussion that took place in T105328 sounds to me like an indication flag that the architecture of the proposed solution needs work; I would be happy to get a search engineer from my team to help you think through alternative solutions, if this would be helpful.

Have you investigated other ways of solving this use case? The discussion that took place in T105328 sounds to me like an indication flag that the architecture of the proposed solution needs work; I would be happy to get a search engineer from my team to help you think through alternative solutions, if this would be helpful.

I agree that T105328: [DeepCat] Switch from GET to POST requests is a better path forward here.

There have been a few mentions of performance issues in this task, but I'm not sure they're appropriately substantiated. That can be discussed elsewhere, though, as it's a bit tangential to this task. Even without the performance issues, it's reasonable to have a limit on search input. We already have a number of similarish limits elsewhere (maximum article size, maximum paginated results, etc.).

@Deskana @Tfinc
I work on this gadget and the supporting infrastructure at WMDE. I do not believe we currently have meaningful usage data. The reason is that the gadget is not yet installed in a way that it can be easily enabled for an average user. Only a few people have been trying it out.

We are already now limiting the search depth and number of categories to not exceed the GET request limit. I agree that the gadget solution is not perfect, but other approaches that we looked at, such as creating a new 'flattened' category index in Elastic, turned out not to be feasible.

Switching to POST will not help if the request is capped on the server side.

Users have been requesting recursive category search for years. Check out the various tools created for the purpose like Catscan, Catscan2, Catscan3 and various related ones, and how popular they are. This feature is one of the most requested ones in our Technical Wishlist survey (in german). Therefore I think it is very important that we find some way to solve this. One possibility might be to create an exception for a specific user-agent that the gadget sets. Given that the long nonsense queries you see are likely just random junk, not deliberate DoS attempts or something like that (?).

BTW, the suggestion to implement this feature in this way came from Nik Everett who also helped us with modifications on the Cirrus side.

Creating an exception like this would just allow a third party to take advantage of a loop whole and increases our exposure for DoS.
We need to make sure that at the end of the day we protect the cluster and maintain its performance. I'd like to explore other potential options for how to move forward.

@EBernhardson, when we decided on query size limit did we rule out a query runtime limit?

Well, we're already in the state where a third party can do that by not having _any_ protection. This is primarily being done for testing reasons, not because of security concerns.

queries already have a runtime limit, but they are very generous. 20s by default and 40s for regex's.

Change 235140 had a related patch set uploaded (by Deskana):
Decrease maximum query length back to 300.

https://gerrit.wikimedia.org/r/235140

@Deskana: We would really appreciate if you could wait with the patch until we find a solution for the DeepCat gadget (as you know, @jkroll is on it). Thank you in advance!

@Deskana: We would really appreciate if you could wait with the patch until we find a solution for the DeepCat gadget (as you know, @jkroll is on it). Thank you in advance!

As I stated above, this is about cluster stability and performance. Very long queries have a disproportionate impact on performance and can slow down queries for other users. We've held off on deploying the limit for three weeks already, and I can't hold off for an indeterminate time period while this gadget is improved. I cannot prioritise the request for this gadget from a few users over the stability and performance of the entire search cluster across all Wikimedia wikis.

I'm sorry that the work you've invested into this might be wasted. I wish we knew about it much earlier so that we could have helped you sooner. But, as I said, cluster stability and performance must come first. If you have users that are unhappy with this outcome, then I would be happy to explain it to them.

I'm sorry that the work you've invested into this might be wasted. I wish we knew about it much earlier so that we could have helped you sooner. But, as I said, cluster stability and performance must come first. If you have users that are unhappy with this outcome, then I would be happy to explain it to them.

Your colleagues at WMF that worked with us on this did know about it.

So we discussed two possible solutions here and via email: allowing long queries when the user agent is set to some special value; or allowing a long query when it contains only incategory:id:... keywords. The latter was proposed as an alternative by @EBernhardson. From a user perspective it would not be optimal because it prevents using deepcat searches with other keywords, but it would be better than the gadget not working at all.

@Deskana, which of those do you prefer? If none, do you have any other suggestions?

I'm sorry that the work you've invested into this might be wasted. I wish we knew about it much earlier so that we could have helped you sooner. But, as I said, cluster stability and performance must come first. If you have users that are unhappy with this outcome, then I would be happy to explain it to them.

Your colleagues at WMF that worked with us on this did know about it.

to elaborate on this: it was discussed in detail with @Manybubbles

When? Nik is great but also hasn't worked for us for a good month.

@Deskana @Tfinc @Ironholds

So what are your opinions on this? Do you think either of these would work? If not, do you see an alternative?

So we discussed two possible solutions here and via email: allowing long queries when the user agent is set to some special value; or allowing a long query when it contains only incategory:id:... keywords. The latter was proposed as an alternative by @EBernhardson. From a user perspective it would not be optimal because it prevents using deepcat searches with other keywords, but it would be better than the gadget not working at all.

@Deskana, which of those do you prefer? If none, do you have any other suggestions?

I'm an analyst, not a product manager; I think the user agent answer is entirely unusable, but the category:id situation sounds viable and I don't have a strong preference there.

Change 235140 merged by jenkins-bot:
Decrease maximum query length back to 300.

https://gerrit.wikimedia.org/r/235140

Change 236195 had a related patch set uploaded (by EBernhardson):
Bypass query length limit for incategory search

https://gerrit.wikimedia.org/r/236195

Change 236195 merged by jenkins-bot:
Bypass query length limit for incategory search

https://gerrit.wikimedia.org/r/236195