Feature summary (what you would like to be able to do and where):
The search results for Japanese kana search queries should be filtered and sorted based on whether (or, how much of) the query appears as a consecutive sub-string in the result. Ideally, one or two transpositions, substitutions, deletions or insertions should be forgiven when searching, though the resulting forms (the queries derived by performing these four transformations once or twice) should still be required to appear "mostly" consecutively within the pages or their titles. "Mostly" because sometimes, an intermediate kanji should be forgiven (e.g. searching for "ひさし振り" should still yield 久し振り, 久しぶり or ひさしぶり despite the query not being a perfect sub-string); so when the query and a page share some consecutive runs, that should still count for something.
Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
When searching for Japanese kana text, the advanced search returns many spurious results that merely contain the requested kana somewhere in the article individually with no consideration for consecutiveness. To give some examples:
When searching for the string "ちいさ" (https://en.wiktionary.org/w/index.php?title=Special:Search&limit=500&offset=0&profile=default&search=%E3%81%A1%E3%81%84%E3%81%95&ns0=1) which is the stem of ちいさい / 小さい, a real word, we observe the following in the search result page:
- 小さい is the first result (good)
- The 3 entries whose titles contain "ちいさ" as a substring appear as the 2nd, ≈250th and ≈800th result respectively. They should all be close to the top.
- Already the 3rd (あさいち) and 4th (いさちる) result don't contain the string "ちいさ" at all but merely a permutation of it.
When searching for the nonsensical string "るよふぃ" (https://en.wiktionary.org/w/index.php?limit=500&offset=0&profile=default&search=%E3%82%8B%E3%82%88%E3%81%B5%E3%81%83&title=Special:Search&ns0=1), we observe that, of the 131 search results, only 7 contain the mora "ふぃ" at all. Note that the small "ぃ" is a combining character, it combines with the preceding character to represent one mora together. Digraphs of normal character + combining character should be given extra weight in terms of consecutiveness.
When searching for a long, randomly generated string (https://en.wiktionary.org/w/index.php?search=%E3%81%A0%E3%82%89%E3%81%91%E3%81%AB%E3%81%A9%E3%81%A8%E3%82%80%E3%82%89%E3%81%82%E3%81%8A%E3%81%84%E3%81%BB%E3%81%97%E3%82%8A%E3%81%86%E3%81%BF%E3%81%8B%E3%81%9F+%E3%81%82%E3%82%8B%E3%81%8F%E3%81%82%E3%81%BE%E3%81%84%E3%81%AA%E3%81%A4&title=Special:Search&profile=advanced&fulltext=1&ns0=1), there's still hundreds of results.
Related discussion: https://en.wiktionary.org/wiki/Wiktionary:Grease_pit/2022/July#Search_result_ordering?
Benefits (why should this be implemented?):
Search results that are almost identical to the query are buried under hundreds or even thousands of spurious hits that merely contain a random permutation of the search query. Everybody who looks up Japanese words using kana would benefit from being presented with the near hits first because it's more likely to be what they were interested in.