Page MenuHomePhabricator

Search query length limit in UI is wrong, short for non-Latin alphabets
Open, MediumPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Open https://en.wikipedia.org/wiki/Special:Search.
  • Copy this (with a space at the end): абвгдеёжзийклмнопрстуфхцчшщъыьэюя (it's the Russian alphabet).
  • Paste into the search form as many times as you can.

What happens?:
At the 4th paste, you will bump into a length limit. But it's a wrong one – 129 characters, which is obviously 255 bytes (Cyrillic characters are 2 bytes each).

What should have happened instead?:
The real limit is 300 characters, see T107947 and a 305 character query. Actually, you will be able to reach it if you type in the address bar, not the search bar.

Other information
I noticed mw.widgets.TitleWidget is used there with its 255 byte limit. So perhaps this limit is wrongly inherited by mw.widgets.SearchInputWidget.

Came across this bug when people started renaming long categories because their names don't fit in the search query.

Event Timeline

MPhamWMF triaged this task as Medium priority.Dec 6 2021, 4:33 PM

Some odd data points: I can put in 255 1-byte Latin characters (ABCDE, repeating), but only 127 2-byte Cyrillic characters (АБВГД, repeating). I can also put in 126 4-byte smiley faces (😀😃😄😁😆) or 4-byte CJK characters (丽丸𠄢乁你). However, I can only put in 85 3-byte Devanagari (कखगघङ) or Hangul (가각갂갃간) characters. Smiley faces are not a big concern, but Cyrillic, Devanagari, and Hangul are—something fishy is going on.

@DLynch and myself looked into this and it seems to be caused by mw.widgets.TitleInputWidget.cleanUpValue using trimByteLength to enforce the length limit (which is useful for strings that need to be stored in the databases, such as titles; but in this case trimCodePointLength would be more appropriate, as done in mw.widgets.TextInputWidget.cleanUpValue).

I can also put in 126 4-byte smiley faces (😀😃😄😁😆) or 4-byte CJK characters (丽丸𠄢乁你)

I think you can put 63 of them (tested in Chrome and Firefox), it's just that '😀'.length and '丽'.length returns 2.

I can also put in 126 4-byte smiley faces (😀😃😄😁😆) or 4-byte CJK characters (丽丸𠄢乁你)

I think you can put 63 of them (tested in Chrome and Firefox), it's just that '😀'.length and '丽'.length returns 2.

Hah! You are correct! The editor I used to count them reported the string lengths incorrectly, too. Apparently this is a common problem. 😆