Page MenuHomePhabricator

Sortable search results
Open, LowPublic5 Estimated Story PointsFeature

Description

When getting the search results it is possible to see some data of the article. It would be great to be able to sort the article based on it, especially on date, but, size and alphabetical would also be useful.


Version: unspecified
Severity: enhancement
See Also:
T18237: Sort results by date
T64879: CirrusSearch: Allow users to search for pages modified "recently"

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@TheDJ Note the original request (and the request I made last year) referred to article space.

With that in mind, can you please provide further details on this this statement: Implementing sort filters (both backend and in UI) is going to be costly.

Just a small note concerning backend limitations.

From the backend perspective a real sort is going to be costly. What we usually do is a partial re-ranking. Re-ranking the top-N allows us to control how many pages we sort, this gives us the ability to prevent huge sort operation that could consume non negligible cluster resources to run forever.
I believe that in practice a partial re-ranking is sufficient for most queries, as of today we re-rank the top-8196 results per shard (a shard is a partition). For instance on english wikipedia we run with 7 shards (and assuming that shards are well balanced) it means that a query that returns more than 8196*7 results (57372) the sort operation is partial.
In short we can't really make the promise that the sort operation will be exact, but I believe that we can come up with a reasonable approach that would work for many cases.
It's why I like the sort + filter options vs sort only option, if a given query returns too many results to be sorted accurately one could set the time filter to limit and reduce the sort operation to pages modified in the last X days.
More broadly our search engine is not a DB, we can make it look like a DB with many search options and filters but we have to keep in mind that there will always be trade-offs (frequently for perf reasons) making the results not 100% accurate in all cases.

Implementing sort filters (both backend and in UI) is going to be costly.

Sure. As a volunteer, for me anything that takes more than a week of 1 persons time tends to translate into a problem that is costly.

But as a quick summary of actual costly elements that I can identify as a volunteer developer would be:
1: As far as I know (and I might be mistaken), we currently have no filtering at all. That means that this 'new' information needs to be passed through various layers of programming interfaces, without breaking existing stuff.
2: We have no idea about the performance cost and the impact this might cause See also @dcausse, already more than I suspected. :)
3: It's going to require a UI change. Any UI / workflow change has proven to be HIGHLY costly over the last couple of years for WMF. (aka, we have a critical audience)
4: Will require very active engagement with the community through liaisons etc.
This assumes a very limited implementation of 'last edited' and 'alpha sorting'. It does not include things like 're-indexing all content'. For instance, people assume that metadata that is presented after search is actually available during search, but it might not (yet) be available. A reindex of all content takes, if I remember correctly, in the order of months, not days (for english wikipedia). Nor does it take into account any increase in search size, adding old revisions etc.

And it presumes the people who can do this are available to work on it to begin with, which is not a given. The biggest reason things often don't happen, is because people are simply stuck working on more important things (for instance keeping search running to begin with) or we might have to wait 3 months before they become available, or a year before they are recruited.

At the scale of Wikipedia/Wikimedia, almost anything becomes complex and complexity is expensive. I'm confident that this particular solution will take months of work by multiple people.

@Jytdog i want to clarify a few things here if you don't mind.

Polarised starting points to simplify the next discussion:
1: Developers don't understand the editing process
2: Editors don't understand the intricacies of our technology stack

What you see in the discussion above (and what you describe) is an assortment of desires and wishes (usecases) mixed with technology solutions chosen by editors and developers. This caused a logical response by developers, of evaluating these two, and developers telling editors: 'the solution you picked is not actually solving your problem'. Consequently, developers either have ignored the original usecases, or come up with alternative solutions, which have proven to be prohibitive as well for a multitude of reasons (at this point in time).

A short summary:
1: Yes, having sort filters on the results would be very nice
2: Implementing sort filters (both backend and in UI) is going to be costly.
3: As such, priorities will have to be set. By asking to show usecases that can be met, developers (and managers in this case) try to ascertain if these costs will be offset by potential gain/benefits.
4: Editors present several usecases, but the usecases don't match the technical solution detailed in this ticket. (at least not in an effective way)
5: Inertia sets in, and frustration builds.

Your own use case is a perfect example: Talk archives.
1: We only index the contents of the current version of a page (to index everything would be an increase of 21x of the current infrastructure, which would likely also add some logarithmic increase in complexity of the overall process and technology). This is expensive and still several years out (We are NOT a search company, anything we have will be some 10-15 years behind in what Google is able to do).
2: As such any notion of date and 'recent' is only as recent as the last edit to a page
3: Talk archives are edited all the time (and thus their recent indicator that we currently have isn't useful to determine their age).
4: So while your use case is valid, the chosen technical solution detailed in this ticket probably won't help you solve this particular problem in any foreseeable timeframe.

Volunteer time is the lifeblood of Wikipedia and the search engine wastes it.

I think we all agree on that volunteer time is valuable. But that doesn't negate the cost of a potential solution.

I have always thought it would be amazing if there is was a way to get the diff underlying a comment in an archive by right clicking or something. I imagine that is impossible, but wanted to note that.

That would be amazing, but also non-trivial since a comment can be made up of multiple diffs, pages can be copy pasted and moved etc etc.. There are some tools like wikiblame, which do this to some degree, but to roll those out more widely is probably not realistic at this time. But it has nothing to do with this ticket.

I think the general answer to this is still: You need structured discussions if you want to do that (aka, something based on similar principals as LQT/Flow).

But I can't imagine how daunting and off-putting the process of finding and linking to, or getting diffs for, archived comments is, for new editors who are interested in joining the community. I reckon this is one reason why new editors so often fail to successfully raise issues on behaviour noticeboards, where diffs are essential, and complain that community noticeboards are byzantine.

I think developers and editors alike agree that our discussion system is not targeted at including newcomers. But again, that has little to do with this particular ticket.

One of the problems that I often see in discussion tickets like these, are the mixing of usecases (desires) and technical solutions (abilities/possibilities) within the same ticket. I personally think it would be much better to have usecases and technical solutions in separate tickets with these larger problems, for the sole reason that otherwise you get this back-and-forth between the two that is not constructive to solving problems (especially because the audiences are so different as well).

Regardless, I think that @MZMcBride's summary is really clear and correct (because he understands enough about both sides of the problems, to know what to exclude from his statement). We have 'recentism/last edited' information, we have the technology to order by that parameter, we have an advanced tab in the UI, so it should be possible to expose this information. And that is actually exactly what the german team is considering to be working on: T143310: Implement a way to access keywords such as "incategory", "intitle" via the Search special page and T154911: Redesign the special:search page (Session). But in the strictest sense, that is a just a subset of the request that this ticket started with and the usecases mentioned within it, which likely would require indexing the entire history of each and every page.

In order to solve YOUR problem, with the current technology, we should probably look into the direction of: "Is there some way, that we can special case a subset of Talk pages in the searchengine, to become more discoverable and useful to editors trying to find older discussion, without blowing up our current search solution". I'm thinking like a way to put a special parser key or something on an archive page/template (as we do for disambiguation pages), like for instance {{#TALKARCHIVE|index|date-range}}. But that would be a totally different ticket, because it's not directly related to this particular technical solution that this ticket deals with. I welcome anyone to create more tickets :)

Look I have no idea what this "phabricator" thing is actually for, what the norms are here, etc. I am not part of your world.

It is obvious that there is a long history of bad communication between editors and people who work on code. This thread is another example, I clearly did this wrong and posted things that wasted your time and are off topic.

I wanted to communicate to folks who work on code, that the search engine sucks for finding things in archives, which is an important thing for working in community.

I only mentioned the "newcomers" thing because I understand that some of the people who work on code are WMF employees, and making WP easier to navigate for newcomers is something I understand is important to WMF - so that is a hook you all could use if you need to convince your bosses to give you more resources to address this. Guess that was a waste of time for me to even think about.

The search engine sucking is not MY problem and I can't believe that was even stated. Out of here.

wow... like i said, two different worlds colliding and clearly not understanding eachother. I did my best to communicate that this is a complicated process and you clearly communicated "your complexity is not my problem" I guess.

The search engine sucking is not MY problem and I can't believe that was even stated. Out of here.

Hmm, i wonder if this means that Jytdog interpreted my line of "In order to solve YOUR problem" as referring to the search engine, whereas it was referring to his description of talk archives of course and meant exactly to set it apart from the complexity of the other search engine problems. So hard to communicate online at times.. :(

@Jytdog: Regarding the initial version of your comment T40403#2982342: There is no reason to get confrontational or even personal. Please respect https://www.mediawiki.org/wiki/Phabricator/Etiquette - thanks!

As I mentioned on-wiki, I hope that we can all work collaboratively together. So far, this hasn't really been the case here at all. I applaud @TheDJ and @dcausse for remaining level-headed throughout this, and explaining the difficulties. I also thank @CKoerner_WMF for correcting me and pointing out I was wrong about Google not having strict date sorting in a level-headed, rational, and kind manner; people defaulted to hostility rather than explaining to me calmly I was incorrect, which is disappointing.

I stand by my comments about this potentially degrading the experience for readers if done wrong; there is a reason we have relevance algorithms rather than sorting things by date, and it's to give people the most optimal results. The feature is buried deep in Google's UI to avoid people inadvertently triggering it for this very reason. This does not mean the feature does not have use cases or value for people, or that the Search Team should not work on it. Three years ago, in T40403#477041, I said that it did mean we should not work on it and that we should never add such a feature to the UI, and in retrospect I believe that I was wrong. Sorry about that. People change their minds over time. However, finding an appropriate solution that balances the needs of different user groups is costly (as @TheDJ eloquently explains in T40403#2981844), which affects prioritisation.

In summary, this is something that I think is useful for some users, but the use cases are smaller relative to the greater mission of improving relevance for readers, so this is not prioritised for the Search Team in Discovery to work on right now. We'd like to take a good long look at power user features like this someday, but to be clear that is not happening soon.

I am going to try to get help to solve this through a user script. I'll report back with a link to it, perhaps the usefulness can be seen then.

For what it's worth, the IFLA Statement of Principles 7.2 state:

When searching retrieves a large number of [results], results should be displayed in some logical order convenient to the [search] user [...]. The user should be able to choose among different criteria: date of publication, alphabetical order, relevance ranking, etc.

Hi @Ainali - were you able to solve this issue with a user script? If so, we might be able to put that information somewhere for others to use.

In the API this is now partially working!

@EBernhardson How complicated is it to make these URL parameters available for the default website search? (ideally for eventual inclusion in the Advanced Search interface)

@Quiddity I already hacked in a query parameter, sort, that is usable in Special:Search. I'm not sure how much work it would be to wire up AdvancedSearch with this, but there is a ticket already: T197525

@Quiddity I already hacked in a query parameter, sort, that is usable in Special:Search. I'm not sure how much work it would be to wire up AdvancedSearch with this, but there is a ticket already: T197525

Fabulous! Thanks. I've mentioned it in the docs, and will mention it explicitly in that other task.

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:14 AM
Aklapper removed a subscriber: bzimport.

Sorting by alphabetical would be great.

@karl-police: See previous comments here why that is not technically feasible.

@karl-police: See previous comments here why that is not technically feasible.

which one? and does it talk about alphabetical sort order?

Sorting by alphabetical would be great.

If what you're trying to do is something like "give me all the pages that contain x, and order them alphabetically" then you're better off using a query engine like Quarry. Then you have even more power than search can provide you. The site search is mostly intended for simple user searches, with some advanced syntax thrown in to serve use cases before easily accessible query engines like Quarry existed.

Gehel set the point value for this task to 5.

Copying comment from merged task:

I suspect the underlying technology is now sufficient to support alphabetical sorts (although we would have to evaluate it to be sure). The main sticking point in Cirrus today is going to be that the way we define keyword fields does not allow doc_values to be enabled. We would need to migrate all the existing fields with the keyword tokenizer to instead use the keyword type along with normalizers, which then allows us to enable doc_values on appropriate keyword fields. Once the index mapping is in place the new sort is only a few lines of configuration in Cirrus.

Change #1197705 had a related patch set uploaded (by DCausse; author: DCausse):

[search/extra@master] Add truncate_norm

https://gerrit.wikimedia.org/r/1197705

Change #1198210 had a related patch set uploaded (by DCausse; author: DCausse):

[search/extra@master] Make the term_freq token filter normalizer compatible

https://gerrit.wikimedia.org/r/1198210

Change #1197705 merged by jenkins-bot:

[search/extra@master] Add truncate_norm

https://gerrit.wikimedia.org/r/1197705

Change #1198210 merged by jenkins-bot:

[search/extra@master] Make the term_freq token filter normalizer compatible

https://gerrit.wikimedia.org/r/1198210

Change #1202084 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Simplify natural sort settings

https://gerrit.wikimedia.org/r/1202084

Change #1202084 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Simplify natural sort settings

https://gerrit.wikimedia.org/r/1202084

Echoing the wish in T403775 (merged here) for sort by page name.

Change #1205130 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] cirrus: index field to sort on title

https://gerrit.wikimedia.org/r/1205130

Change #1198219 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Make keyword fields actual opensearch keyword fields

https://gerrit.wikimedia.org/r/1198219

Change #1198219 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Make keyword fields actual opensearch keyword fields

https://gerrit.wikimedia.org/r/1198219

Change #1205130 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: index field to sort on title

https://gerrit.wikimedia.org/r/1205130

Mentioned in SAL (#wikimedia-operations) [2025-11-19T08:04:11Z] <dcausse@deploy2002> Started scap sync-world: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]]

Mentioned in SAL (#wikimedia-operations) [2025-11-19T08:09:32Z] <dcausse@deploy2002> dcausse: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-11-19T08:17:52Z] <dcausse@deploy2002> Finished scap sync-world: Backport for [[gerrit:1205130|cirrus: index field to sort on title (T40403)]] (duration: 13m 42s)