Page MenuHomePhabricator

Add a new keyword to filter pages based on their "length"
Closed, ResolvedPublic3 Estimated Story Points

Description

A new keyword textbytes should be added to allow filtering pages based on the value of the text_bytes field.
The text_bytes field is populated from Content:getSize() which defines itself as

Returns the content's nominal size in "bogo-bytes".

What is behind bogo-bytes might remain mysterious but for a wikitext page this is the number of bytes of the wikitext source encoded in UTF-8.

The keyword will be usable the same way as other numeric keyword we support File measure.

  • comparison: textbytes:<1500 or textbytes:>1500 all pages with text_bytes greater and lesser than 1500
  • ranges: textbytes:1500,10000 all pages with text_bytes between 1500 and 10000
  • exact match are possible but probably useless, e.g. textbytes:10

AC: A search query can be issued that filters based on the number of bytes in the source text (text_bytes field of documents in elasticsearch)

Event Timeline

"maxlen" or "maxbytes" or "max<something>" if we don't want to deal with parsing <, >, or = in the value.

EBernhardson renamed this task from implement some kind of charlengthlesesthan:1500 search keyword to implement some kind of charlengthlessthan:1500 search keyword .Jan 30 2023, 4:47 PM
EBernhardson added a project: CirrusSearch.

I think it should be consistent with what we do for filesize:

Search for file of given size, in kilobytes (kilobyte means 1024 bytes). The syntax is:

filesize:{number} or filesize:>{number} - file with size at least given number
filesize:<{number} - file with size no more than given number
filesize:{number},{number} - file with size between given numbers
Gehel set the point value for this task to 3.

Change 889074 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Index the text_bytes field

https://gerrit.wikimedia.org/r/889074

Change 889075 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Add new textbytes keyword

https://gerrit.wikimedia.org/r/889075

Sadly I had to change the elasticsearch mapping to allow this and thus this will have to wait for a full re-index after the first patch is merged before enabling the new textbytes keyword.

dcausse renamed this task from implement some kind of charlengthlessthan:1500 search keyword to Add a new keyword to filter pages based on their "length".Feb 14 2023, 10:49 AM
dcausse updated the task description. (Show Details)

Change 889074 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Index the text_bytes field

https://gerrit.wikimedia.org/r/889074

Waiting for a re-index to enable this feature.

re-index is currently running, mappings look as expected for indices that have completed reindexing already. It's up to r, will probably take a few more days and we can ship the keyword in next weeks train with any luck.

Reindex is done for the production wikis, cloudelastic is not yet done but should not be a blocker for this patch.

Change 889075 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add new textbytes keyword

https://gerrit.wikimedia.org/r/889075

Thanks for the patches and reindexing work. Can we close this task?

@kostajh sure, @Gehel should take care of closing soon (once the feature is active on group2 wikis, hopefully today)