Page MenuHomePhabricator

Search query length limit in UI is wrong, short for non-Latin alphabets
Open, MediumPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Open https://en.wikipedia.org/wiki/Special:Search.
  • Copy this (with a space at the end): абвгдеёжзийклмнопрстуфхцчшщъыьэюя (it's the Russian alphabet).
  • Paste into the search form as many times as you can.

What happens?:
At the 4th paste, you will bump into a length limit. But it's a wrong one – 129 characters, which is obviously 255 bytes (Cyrillic characters are 2 bytes each).

What should have happened instead?:
The real limit is 300 characters, see T107947 and a 305 character query. Actually, you will be able to reach it if you type in the address bar, not the search bar.

Other information
I noticed mw.widgets.TitleWidget is used there with its 255 byte limit. So perhaps this limit is wrongly inherited by mw.widgets.SearchInputWidget.

Came across this bug when people started renaming long categories because their names don't fit in the search query.

Event Timeline

Some odd data points: I can put in 255 1-byte Latin characters (ABCDE, repeating), but only 127 2-byte Cyrillic characters (АБВГД, repeating). I can also put in 126 4-byte smiley faces (😀😃😄😁😆) or 4-byte CJK characters (丽丸𠄢乁你). However, I can only put in 85 3-byte Devanagari (कखगघङ) or Hangul (가각갂갃간) characters. Smiley faces are not a big concern, but Cyrillic, Devanagari, and Hangul are—something fishy is going on.

@DLynch and myself looked into this and it seems to be caused by mw.widgets.TitleInputWidget.cleanUpValue using trimByteLength to enforce the length limit (which is useful for strings that need to be stored in the databases, such as titles; but in this case trimCodePointLength would be more appropriate, as done in mw.widgets.TextInputWidget.cleanUpValue).

I can also put in 126 4-byte smiley faces (😀😃😄😁😆) or 4-byte CJK characters (丽丸𠄢乁你)

I think you can put 63 of them (tested in Chrome and Firefox), it's just that '😀'.length and '丽'.length returns 2.

I can also put in 126 4-byte smiley faces (😀😃😄😁😆) or 4-byte CJK characters (丽丸𠄢乁你)

I think you can put 63 of them (tested in Chrome and Firefox), it's just that '😀'.length and '丽'.length returns 2.

Hah! You are correct! The editor I used to count them reported the string lengths incorrectly, too. Apparently this is a common problem. 😆

OK, now I had an issue with the length limit, but from a different angle. Even if use English Wikipedia and type Latin characters, the UI length limit and the actual length limit are different (even if not so much – 248 vs 300). So, for power users who are interested in making complex queries using incategory:, intitle: and such, this difference is confusing. They can bump into the UI limit and then have to edit the query in the address bar to have a bigger limit.

And it's not only about power users – say, we can't easily exclude discussion pages from "General Help" searches in English Wikipedia. To do that, you would need to add to your query something hellish looking like this (which I just mention in the help for regular users). This is already 205 characters, so the user has not much left.

I believe the allowed query length in the UI should match the allowed query length in the back-end. Having a back-end limitation that is not supported by the UI the doesn't make sense.