Page MenuHomePhabricator

Sortable search results
Open, LowPublic

Description

When getting the search results it is possible to see some data of the article. It would be great to be able to sort the article based on it, especially on date, but, size and alphabetical would also be useful.


Version: unspecified
Severity: enhancement
See Also:
T18237: Sort results by date
T64879: CirrusSearch: Allow users to search for pages modified "recently"

Details

Reference
bz38403

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:11 AM
bzimport added a project: CirrusSearch.
bzimport set Reference to bz38403.
bzimport added a subscriber: Unknown Object (MLST).
Ainali created this task.Jul 14 2012, 7:26 PM

Oh, I meant especially on 'date'.

[Merging "MediaWiki extensions/Lucene Search" into "Wikimedia/lucene-search2", see bug 46542. You can filter bugmail for: search-component-merge-20130326 ]

May be worth including in CirrusSearch.

Would need some product/design team involvement mostly because sorting on anything other than score is likely to give the user garbage results. Better might be allowing the user to deem recently changed articles more important and give them a boost. This will bring more recent articles before those that would otherwise better match. The problem is, any other requested sort key would need similar design.

misc2006 wrote:

(In reply to Nik Everett from comment #4)

Would need some product/design team involvement mostly because sorting on
anything other than score is likely to give the user garbage results.

I don't see how sorting by date (i.e. last modified date) can give garbage results if the user explicitly chooses to sort by date. Similar for title and size. Technically, this should be easy to implement, as ElasticSearch should be able to sort on any field in the documents in its index.

What is the use case for wanting to order search results by last modified date? Knowing that would help us figure out what the best solution to that problem is.

misc2006 wrote:

Use case: I'm the author of LanguageTool (a style and grammar checker) and I'd like to know how people use it for Wikipedia. As this information can be in any namespace I can only find it with a search. But there are 50 or so matches, and I run this search (just "languagetool") regularly. Thus without a sort by date, I have to walk through all the matches to find the new ones.

Thanks for the explanation, but I'm a little unclear on how you're using CirrusSearch to find individual edits. Could you provide me with the URL of the search query you use to do this?

misc2006 wrote:

I don't - it would be great of course if that was possible. I'm using https://de.wikipedia.org/w/index.php?title=Spezial:Suche&search=languagetool&fulltext=Suche&profile=all&redirs=1 which is good enough, only that I'm too lazy to look through all the results as most are old. Having them sortable I'd only have to look at the first 2-5 or so results.

Well, my use case was simpler. I am an engineer. I see search results with meta data. I want to sort it based on the meta data to see what pops up. It would be awesome to find outliers, such as pages referring to new stuff but with old dates or extremely large or small pages. It would also be great to create an alphabetical list based on a search for use on an edit-a-thon.

(In reply to Daniel Naber from comment #9)

I don't - it would be great of course if that was possible. I'm using
https://de.wikipedia.org/w/index.php?title=Spezial:
Suche&search=languagetool&fulltext=Suche&profile=all&redirs=1 which is good
enough, only that I'm too lazy to look through all the results as most are
old. Having them sortable I'd only have to look at the first 2-5 or so
results.

Search fundamentally can't satisfy your use case. Say, for example, someone added "LanguageTool" in a JS file in their user space five years ago, then made a copyedit to a comment yesterday; in this case, the search would show up with this as the most recent result.

A better thing for you to do would be to make the LanguageTool insert some comment into the edit summary of anyone who used it (e.g. "Copy edited using LanguageTool"), then query a database dump to find what you're looking for. [[WP:AWB]] does something similar to this.

(In reply to Jan Ainali from comment #10)

Well, my use case was simpler. I am an engineer. I see search results with
meta data. I want to sort it based on the meta data to see what pops up. It
would be awesome to find outliers, such as pages referring to new stuff but
with old dates or extremely large or small pages.

You've told me you want to sort by date because you want to sort by meta-data. That isn't a use case, because it doesn't actually help me understand the problem you're trying to solve.

It would also be great to
create an alphabetical list based on a search for use on an edit-a-thon.

Due to the way search works, for most queries this would generate a list of articles with nothing in common. For example, searching for [[Barack Obama]] and sorting alphabetically would generate nonsensical results such as [[Aaron McGruder]], an American cartoonist, followed shortly thereafter by [[Abbottabad]], a city in northeastern Pakistan.

In this case, browsing a category would be a better idea if you're looking for related articles. In fact, for editathons, I've personally found much more success by giving participants a list of articles where the topic is notable by definition but the articles don't exist. For example, a Fellow of the Royal Society is unquestionably notable (see point 3 in [[Wikipedia:Notability_(academics)#Criteria]]), yet many Fellows do not have articles.

misc2006 wrote:

I know I might get false positives. But still, having 'sort by date' means I only have to look at the first few results instead of walking through a list of 50 matches. Searching a database dump doesn't seem like a viable alternative, considering how huge it is and how much time it would take to set this up.

(In reply to Dan Garry from comment #11)>

You've told me you want to sort by date because you want to sort by
meta-data. That isn't a use case, because it doesn't actually help me
understand the problem you're trying to solve.

Yes, you are right about that is not the trouble I am trying to solve. My trouble is that I become very frustrated when I see that the search results are presented in a way that I do not expect on a website 2014. I expect to be able to sort results in some ways, and I am given visual clues that the underlying data to perform the sorting is there.

Looking at what this actually could be used for as an editor is to find outliers or odd articles as I explaind before without having to rely on external tools like catscan.

Adding in a feature to sort by date or alphabetically by title will, for the reasons explained above, result in degraded performance for the vast majority of users. It's for this reason that search engines like Google don't allow you to sort by date or alphabetically by title; it degrades the quality of the service. I'm WONTFIXing this bug accordingly, as I cannot justify adding features to CirrusSearch that degrade the experience for the vast majority of its users.

That said, there may be a use case in here that does make sense, namely the ability to discard results from the search that have not been edited "recently" (for some definition of "recently"). It's possible to do that in such a way that the scoring algorithm isn't totally ignored, and therefore may not degrade user performance. See bug 62879.

misc2006 wrote:

The reason Google doesn't offer sort by date is that they usually don't even know the date of last modification. The reason Google doesn't offer alphabetical sort is that they are indexing the web, not an encyclopedia. Anyway, I'm looking forward to a fix for bug 62879.

(In reply to Dan Garry from comment #14)

Adding in a feature to sort by date or alphabetically by title will, for the
reasons explained above,

I am sorry, but you have not stated any reasons at all. In what way will this degrade performance for users? My wish is that the search is presented as it is today, with the option to sort it *after* the first results has been showed. So for users not wanting to sort it, there will be no difference at all.

Here is another use case one user came up with during a discussion. You search for a term and you want to fix something in all these articles. By being able to sort it by last modified, all the ones you just fixed will go to the end of the list and it is to see what is left to do.

(In reply to Daniel Naber from comment #15)

The reason Google doesn't offer sort by date is that they usually don't even
know the date of last modification.

That's incorrect. I will unashamedly admit that I got the idea for bug 62789 from looking through Google's search settings to see how they address this problem. The definitions of "recent" that they allow are anytime, past 24 hours, past week, past month, and past year.

The reason Google doesn't offer
alphabetical sort is that they are indexing the web, not an encyclopedia.

That doesn't make a difference, honestly. Sorting a search alphabetically does not make sense for a search engine, whether it is a MediaWiki search engine or general search engine, as it simply makes the search engine provide less relevant results.

(In reply to Dan Garry from comment #17)

That doesn't make a difference, honestly. Sorting a search alphabetically
does not make sense for a search engine, whether it is a MediaWiki search
engine or general search engine, as it simply makes the search engine
provide less relevant results.

Your assumption here is that there is only one hit that might be interesting for the user. For many editors, lists of articles that are starting points for their editing behavior make perfect sense. Sure, editors are a very small number of users compared to readers. That is why sorting should be a secondary action, after the first results have been shown.

misc2006 wrote:

(In reply to Dan Garry from comment #17)

(In reply to Daniel Naber from comment #15)

The reason Google doesn't offer sort by date is that they usually don't even
know the date of last modification.

That's incorrect. I will unashamedly admit that I got the idea for bug 62789
from looking through Google's search settings to see how they address this
problem. The definitions of "recent" that they allow are anytime, past 24
hours, past week, past month, and past year.

Only that it doesn't work properly because they need to do a lot of guesswork (which Wikipedia wouldn't, thanks to proper meta data). (Example for where it doesn't work properly: search for >site:de.wikipedia.org "languagetool"< on google.de and filter by 'last year' and see how the 'Apache OpenOffice' result disappears from the list although it was modified recently).

BTW, "sort by date" and "sort by relevance" links *do* appear on Google once you have limited results by date. Anyway, I don't think we should care what Google does, the use case of web search if too different from a site-wide search.

(In reply to Jan Ainali from comment #16)

Here is another use case one user came up with during a discussion. You
search for a term and you want to fix something in all these articles. By
being able to sort it by last modified, all the ones you just fixed will go
to the end of the list and it is to see what is left to do.

Implementing search sorting for this use case is implementing a solution to a problem that does not exist. In CirrusSearch, the search index is updated within seconds of changes to articles, so any articles you've fixed will be removed from the search results if you were to rerun the query.

(In reply to Jan Ainali from comment #18)

Your assumption here is that there is only one hit that might be interesting
for the user. For many editors, lists of articles that are starting points
for their editing behavior make perfect sense. Sure, editors are a very
small number of users compared to readers. That is why sorting should be a
secondary action, after the first results have been shown.

I make the assumption that being the user expects to be presented with results relevant to the query that is typed in. If you are not expecting that, then you should not be trying to change the intended function of the search engine, you should be using some other tool (like database dumps). I'm genuinely sorry if that's harder and more inconvenient for you, because I do not like making things inconvenient for people, but that does not change my stance about the product that is search.

(In reply to Jan Ainali from comment #16)

I am sorry, but you have not stated any reasons at all. In what way will
this degrade performance for users? My wish is that the search is presented
as it is today, with the option to sort it *after* the first results has
been showed. So for users not wanting to sort it, there will be no
difference at all.

I've already outlined that sorting by date will, for the vast majority of users, generate meaningless results. Putting it behind a button and expecting people not to press that button does not make that okay.

Just to be clear, you reopening the bug does not change our work priorities, or change that any patch which attempts to implement this functionality will be met with a -2 by the engineers that work on the extension. It just leaves the bug in a status that does represent our ongoing priorities, which is confusing and nothing more.

I will close the bug once more to rectify that confusion, but after that any I'll just leave it because I don't want to spend my time in a bug status revert war. Any ramifications for the incorrect status of the bug will be your responsibility.

Thanks a lot for elaborating here, Dan!

As per comment 20, "In CirrusSearch, the search index is updated within seconds of changes to article" is a convincing reason why nobody should actively work on fixing this ("WONTFIX"). Still, anybody is free to hack the MediaWiki code to implement such a feature for themselves on their own wiki, if wanted.

In general, resolutions and priorities in bug reports are supposed to reflect reality but do not cause it; also see [[mw:Bug management/Bugzilla etiquette]].
Thanks everybody for your understanding and I'm sorry that Wikimedia developers will not fulfil the request in this ticket.

demon added a comment.EditedMar 21 2014, 2:04 AM

[removed ancient comment that doesn't make sense 2 years later. I was dumb]

(In reply to Dan Garry from comment #20)

(In reply to Jan Ainali from comment #16)

Here is another use case one user came up with during a discussion. You
search for a term and you want to fix something in all these articles. By
being able to sort it by last modified, all the ones you just fixed will go
to the end of the list and it is to see what is left to do.

Implementing search sorting for this use case is implementing a solution to
a problem that does not exist. In CirrusSearch, the search index is updated
within seconds of changes to articles, so any articles you've fixed will be
removed from the search results if you were to rerun the query.

No, you are assuming I am removing what I search for. I could very well search for one string, because I know articles with this string has something ese that need to be fixed. So they will still show up, but with a sorting function, I could easily move them to the end of the results.

(In reply to Jan Ainali from comment #18)

Your assumption here is that there is only one hit that might be interesting
for the user. For many editors, lists of articles that are starting points
for their editing behavior make perfect sense. Sure, editors are a very
small number of users compared to readers. That is why sorting should be a
secondary action, after the first results have been shown.

I make the assumption that being the user expects to be presented with
results relevant to the query that is typed in. If you are not expecting
that, then you should not be trying to change the intended function of the
search engine, you should be using some other tool (like database dumps).
I'm genuinely sorry if that's harder and more inconvenient for you, because
I do not like making things inconvenient for people, but that does not
change my stance about the product that is search.

Hmm, perhaps this should not be in product search? Perhaps this should be a new special page, Special:SortArticles where you in different ways, perhaps even through categories or templates or search can sort articles. I would also see this sorting useful in other places, such as Special:WhatLinksHere (whose sort order by the way is not clearly conveyed on that special page). Having to go to catscan for that is, as you say, inconvenient for people, but even worse, might hinder a new interested user from becoming a 100+edits/month user because they do not know catscan exists and find it to tedious digging for the articles they want to improve.

(In reply to Jan Ainali from comment #16)

I am sorry, but you have not stated any reasons at all. In what way will
this degrade performance for users? My wish is that the search is presented
as it is today, with the option to sort it *after* the first results has
been showed. So for users not wanting to sort it, there will be no
difference at all.

I've already outlined that sorting by date will, for the vast majority of
users, generate meaningless results. Putting it behind a button and
expecting people not to press that button does not make that okay.
Just to be clear, you reopening the bug does not change our work priorities,
or change that any patch which attempts to implement this functionality will
be met with a -2 by the engineers that work on the extension. It just leaves
the bug in a status that does represent our ongoing priorities, which is
confusing and nothing more.
I will close the bug once more to rectify that confusion, but after that any
I'll just leave it because I don't want to spend my time in a bug status
revert war. Any ramifications for the incorrect status of the bug will be
your responsibility.

I apologize for my reopening before, I was not aware of that meaning of the status.

Regarding the placement, I would suggest the sorting is hidden under the advanced menu, where power users and the curious ones will find it when they need it without it disturbing the casual reader (who probably will be scared away by all the namespace boxes if they dare open the advanced menu :) ). But then again, if it is a separate special page, it might even be prominently displayed, because the user going there is only expecting to be able to sort.

Would it be better if I created a new enhancement ticket and link to this one from the description or should this be updated to reflect the change from search to something else?

TheDJ added a subscriber: TheDJ.Jul 28 2015, 7:26 PM

@Ainali, I think what you are describing, is what I have been calling a "work queue" or "work list", others have called it a todo list. It's a list with a subset of articles fulfilling certain criteria, that you "work through" in order to complete a task. That's not what search is for though, since it's not what it is good at. You can however envision that a search result could "fill" such a list, be a generator for it, but it requires a totally different technology stack to efficiently go through a list for that purpose, and the technology for that should not be put INTO search, because that's not what search is for.

Traditionally, we have used categories and manually build lists on a wikipage a lot for this kind of work. Bot generated lists are another form.

See also: https://meta.wikimedia.org/wiki/Community_Tech_project_ideas#Work_queues

Restricted Application added a project: Discovery. · View Herald TranscriptJul 28 2015, 7:26 PM
demon removed a subscriber: demon.Jul 28 2015, 10:21 PM
Restricted Application added a project: Discovery-Search. · View Herald TranscriptJan 28 2017, 3:20 AM
MZMcBride reopened this task as Open.Jan 28 2017, 7:41 AM

I'm re-opening this task for further consideration. The primary argument against adding sort options seems to be that Google search doesn't do this. Who cares and how is that relevant?

YouTube, Amazon, Kayak, and a million other sites with search forms have sort options. Typically the default behavior is to sort by score, usually labeled as relevance in the user interface. Then there's often a drop-down menu or links provided to change the sort order. The options themselves are dependent on the type of results; e.g., Kayak and Amazon allow sorting by price, while YouTube allows sorting by view counts.

In our case, we want to support at least sorting by page size and last edited date, both of which seem to be readily available.

If the concern is user interface clutter, we already have an "Advanced" tab at Special:Search with an ugly set of form controls: https://en.wikipedia.org/w/index.php?title=Special:Search&profile=advanced&search=&fulltext=1. Adding a sort order drop-down menu or series of links is no big deal.

I don't think people asking for sort options need to be the ones justifying them. Query languages include "order by" functionality to sort results in ascending or descending order based on field values. On the Web, having sort options in a search form is an incredibly common design pattern. Given the widespread prevalence of sort options, if someone wants to argue that this is a "we won't even accept patches" type of request, he or she is going to need to provide a clear and rational case for why this search form is so special that it should defy expected behavior and actively not provide sort ordering options.

I'm re-opening this task for further consideration. The primary argument against adding sort options seems to be that Google search doesn't do this. Who cares and how is that relevant?

Ah, looking at T156493: Make search results filterable (or sortable) by date, it turns out even Google search provides sort options.

The repeated "degrade the experience for readers" comments are unhelpful and untrue. It would be like someone saying "if we provided a namespace search selector, a user could select only the "Help talk" namespace to search and would then suffer a degraded user experience because they would have no search results." What? This line of reasoning makes no sense. Having advanced search options that are not activated by default doesn't degrade the experience for readers. It's actually the opposite: the people who want to do an advanced search currently have a degraded user experience because we don't provide sort options.

Qgil removed a subscriber: Qgil.Jan 28 2017, 9:48 AM
NeilN added a subscriber: NeilN.Jan 28 2017, 3:49 PM
Jytdog added a subscriber: Jytdog.EditedJan 28 2017, 7:30 PM

I recently opened at thread on this at Village Pump technical. I have over 100K edits in en-wiki since 2008.

I've read the thread above and clearly devs commenting here do not understand the problem.

Wikipedia is built and maintained by a community of editors. Working in community requires finding, reading, and linking to old discussions. Old discussions are often in archives.

Look at the following search result, which comes from entering "search engine" into the search box at the top of Village pump (technical): search results

Those search results are entirely random. Not even sorted by archive number. Some pages have hundreds of pages of archives, and the time it takes to find things, is ridiculous. Volunteer time is the lifeblood of Wikipedia and the search engine wastes it.

And again, working in community means that finding, reading, and linking to old discussions is essential; something we have to do.

There are often times when I know I should go find and read old discussions about something - where providing a link to a section or diff would help resolve a dispute, or provide evidence in a discussion on a noticeboard, and don't, because of the time it will take and the frustration of dealing with these search results.

I'll further note that finding a section in a page's archive and reading it is one thing; once it is found, it not hard to link to the section. But finding and providing diffs that generated any specific comments is then an entirely separate process, where you have to go back to the page, and search through its history to find the relevant dif. I have always thought it would be amazing if there is was a way to get the diff underlying a comment in an archive by right clicking or something. I imagine that is impossible, but wanted to note that.

But I can't imagine how daunting and off-putting the process of finding and linking to, or getting diffs for, archived comments is, for new editors who are interested in joining the community. I reckon this is one reason why new editors so often fail to successfully raise issues on behavior noticeboards, where diffs are essential, and complain that community noticeboards are byzantine.

TheDJ added a comment.EditedJan 30 2017, 10:23 AM

@Jytdog i want to clarify a few things here if you don't mind.

Polarised starting points to simplify the next discussion:
1: Developers don't understand the editing process
2: Editors don't understand the intricacies of our technology stack

What you see in the discussion above (and what you describe) is an assortment of desires and wishes (usecases) mixed with technology solutions chosen by editors and developers. This caused a logical response by developers, of evaluating these two, and developers telling editors: 'the solution you picked is not actually solving your problem'. Consequently, developers either have ignored the original usecases, or come up with alternative solutions, which have proven to be prohibitive as well for a multitude of reasons (at this point in time).

A short summary:
1: Yes, having sort filters on the results would be very nice
2: Implementing sort filters (both backend and in UI) is going to be costly.
3: As such, priorities will have to be set. By asking to show usecases that can be met, developers (and managers in this case) try to ascertain if these costs will be offset by potential gain/benefits.
4: Editors present several usecases, but the usecases don't match the technical solution detailed in this ticket. (at least not in an effective way)
5: Inertia sets in, and frustration builds.

Your own use case is a perfect example: Talk archives.
1: We only index the contents of the current version of a page (to index everything would be an increase of 21x of the current infrastructure, which would likely also add some logarithmic increase in complexity of the overall process and technology). This is expensive and still several years out (We are NOT a search company, anything we have will be some 10-15 years behind in what Google is able to do).
2: As such any notion of date and 'recent' is only as recent as the last edit to a page
3: Talk archives are edited all the time (and thus their recent indicator that we currently have isn't useful to determine their age).
4: So while your use case is valid, the chosen technical solution detailed in this ticket probably won't help you solve this particular problem in any foreseeable timeframe.

Volunteer time is the lifeblood of Wikipedia and the search engine wastes it.

I think we all agree on that volunteer time is valuable. But that doesn't negate the cost of a potential solution.

I have always thought it would be amazing if there is was a way to get the diff underlying a comment in an archive by right clicking or something. I imagine that is impossible, but wanted to note that.

That would be amazing, but also non-trivial since a comment can be made up of multiple diffs, pages can be copy pasted and moved etc etc.. There are some tools like wikiblame, which do this to some degree, but to roll those out more widely is probably not realistic at this time. But it has nothing to do with this ticket.

I think the general answer to this is still: You need structured discussions if you want to do that (aka, something based on similar principals as LQT/Flow).

But I can't imagine how daunting and off-putting the process of finding and linking to, or getting diffs for, archived comments is, for new editors who are interested in joining the community. I reckon this is one reason why new editors so often fail to successfully raise issues on behaviour noticeboards, where diffs are essential, and complain that community noticeboards are byzantine.

I think developers and editors alike agree that our discussion system is not targeted at including newcomers. But again, that has little to do with this particular ticket.

One of the problems that I often see in discussion tickets like these, are the mixing of usecases (desires) and technical solutions (abilities/possibilities) within the same ticket. I personally think it would be much better to have usecases and technical solutions in separate tickets with these larger problems, for the sole reason that otherwise you get this back-and-forth between the two that is not constructive to solving problems (especially because the audiences are so different as well).

Regardless, I think that @MZMcBride's summary is really clear and correct (because he understands enough about both sides of the problems, to know what to exclude from his statement). We have 'recentism/last edited' information, we have the technology to order by that parameter, we have an advanced tab in the UI, so it should be possible to expose this information. And that is actually exactly what the german team is considering to be working on: T143310: Implement a way to access keywords such as "incategory", "intitle" via the Search special page and T154911: Redesign the special:search page (Session). But in the strictest sense, that is a just a subset of the request that this ticket started with and the usecases mentioned within it, which likely would require indexing the entire history of each and every page.

In order to solve YOUR problem, with the current technology, we should probably look into the direction of: "Is there some way, that we can special case a subset of Talk pages in the searchengine, to become more discoverable and useful to editors trying to find older discussion, without blowing up our current search solution". I'm thinking like a way to put a special parser key or something on an archive page/template (as we do for disambiguation pages), like for instance {{#TALKARCHIVE|index|date-range}}. But that would be a totally different ticket, because it's not directly related to this particular technical solution that this ticket deals with. I welcome anyone to create more tickets :)

NeilN added a comment.Jan 30 2017, 1:42 PM

@TheDJ Note the original request (and the request I made last year) referred to article space.

With that in mind, can you please provide further details on this this statement: Implementing sort filters (both backend and in UI) is going to be costly.

Just a small note concerning backend limitations.

From the backend perspective a real sort is going to be costly. What we usually do is a partial re-ranking. Re-ranking the top-N allows us to control how many pages we sort, this gives us the ability to prevent huge sort operation that could consume non negligible cluster resources to run forever.
I believe that in practice a partial re-ranking is sufficient for most queries, as of today we re-rank the top-8196 results per shard (a shard is a partition). For instance on english wikipedia we run with 7 shards (and assuming that shards are well balanced) it means that a query that returns more than 8196*7 results (57372) the sort operation is partial.
In short we can't really make the promise that the sort operation will be exact, but I believe that we can come up with a reasonable approach that would work for many cases.
It's why I like the sort + filter options vs sort only option, if a given query returns too many results to be sorted accurately one could set the time filter to limit and reduce the sort operation to pages modified in the last X days.
More broadly our search engine is not a DB, we can make it look like a DB with many search options and filters but we have to keep in mind that there will always be trade-offs (frequently for perf reasons) making the results not 100% accurate in all cases.

TheDJ added a comment.EditedJan 30 2017, 2:29 PM

Implementing sort filters (both backend and in UI) is going to be costly.

Sure. As a volunteer, for me anything that takes more than a week of 1 persons time tends to translate into a problem that is costly.

But as a quick summary of actual costly elements that I can identify as a volunteer developer would be:
1: As far as I know (and I might be mistaken), we currently have no filtering at all. That means that this 'new' information needs to be passed through various layers of programming interfaces, without breaking existing stuff.
2: We have no idea about the performance cost and the impact this might cause See also @dcausse, already more than I suspected. :)
3: It's going to require a UI change. Any UI / workflow change has proven to be HIGHLY costly over the last couple of years for WMF. (aka, we have a critical audience)
4: Will require very active engagement with the community through liaisons etc.
This assumes a very limited implementation of 'last edited' and 'alpha sorting'. It does not include things like 're-indexing all content'. For instance, people assume that metadata that is presented after search is actually available during search, but it might not (yet) be available. A reindex of all content takes, if I remember correctly, in the order of months, not days (for english wikipedia). Nor does it take into account any increase in search size, adding old revisions etc.

And it presumes the people who can do this are available to work on it to begin with, which is not a given. The biggest reason things often don't happen, is because people are simply stuck working on more important things (for instance keeping search running to begin with) or we might have to wait 3 months before they become available, or a year before they are recruited.

At the scale of Wikipedia/Wikimedia, almost anything becomes complex and complexity is expensive. I'm confident that this particular solution will take months of work by multiple people.

Jytdog added a comment.EditedJan 30 2017, 4:40 PM

@Jytdog i want to clarify a few things here if you don't mind.
Polarised starting points to simplify the next discussion:
1: Developers don't understand the editing process
2: Editors don't understand the intricacies of our technology stack
What you see in the discussion above (and what you describe) is an assortment of desires and wishes (usecases) mixed with technology solutions chosen by editors and developers. This caused a logical response by developers, of evaluating these two, and developers telling editors: 'the solution you picked is not actually solving your problem'. Consequently, developers either have ignored the original usecases, or come up with alternative solutions, which have proven to be prohibitive as well for a multitude of reasons (at this point in time).
A short summary:
1: Yes, having sort filters on the results would be very nice
2: Implementing sort filters (both backend and in UI) is going to be costly.
3: As such, priorities will have to be set. By asking to show usecases that can be met, developers (and managers in this case) try to ascertain if these costs will be offset by potential gain/benefits.
4: Editors present several usecases, but the usecases don't match the technical solution detailed in this ticket. (at least not in an effective way)
5: Inertia sets in, and frustration builds.
Your own use case is a perfect example: Talk archives.
1: We only index the contents of the current version of a page (to index everything would be an increase of 21x of the current infrastructure, which would likely also add some logarithmic increase in complexity of the overall process and technology). This is expensive and still several years out (We are NOT a search company, anything we have will be some 10-15 years behind in what Google is able to do).
2: As such any notion of date and 'recent' is only as recent as the last edit to a page
3: Talk archives are edited all the time (and thus their recent indicator that we currently have isn't useful to determine their age).
4: So while your use case is valid, the chosen technical solution detailed in this ticket probably won't help you solve this particular problem in any foreseeable timeframe.

Volunteer time is the lifeblood of Wikipedia and the search engine wastes it.

I think we all agree on that volunteer time is valuable. But that doesn't negate the cost of a potential solution.

I have always thought it would be amazing if there is was a way to get the diff underlying a comment in an archive by right clicking or something. I imagine that is impossible, but wanted to note that.

That would be amazing, but also non-trivial since a comment can be made up of multiple diffs, pages can be copy pasted and moved etc etc.. There are some tools like wikiblame, which do this to some degree, but to roll those out more widely is probably not realistic at this time. But it has nothing to do with this ticket.
I think the general answer to this is still: You need structured discussions if you want to do that (aka, something based on similar principals as LQT/Flow).

But I can't imagine how daunting and off-putting the process of finding and linking to, or getting diffs for, archived comments is, for new editors who are interested in joining the community. I reckon this is one reason why new editors so often fail to successfully raise issues on behaviour noticeboards, where diffs are essential, and complain that community noticeboards are byzantine.

I think developers and editors alike agree that our discussion system is not targeted at including newcomers. But again, that has little to do with this particular ticket.
One of the problems that I often see in discussion tickets like these, are the mixing of usecases (desires) and technical solutions (abilities/possibilities) within the same ticket. I personally think it would be much better to have usecases and technical solutions in separate tickets with these larger problems, for the sole reason that otherwise you get this back-and-forth between the two that is not constructive to solving problems (especially because the audiences are so different as well).
Regardless, I think that @MZMcBride's summary is really clear and correct (because he understands enough about both sides of the problems, to know what to exclude from his statement). We have 'recentism/last edited' information, we have the technology to order by that parameter, we have an advanced tab in the UI, so it should be possible to expose this information. And that is actually exactly what the german team is considering to be working on: T143310: Implement a way to access keywords such as "incategory", "intitle" via the Search special page and T154911: Redesign the special:search page (Session). But in the strictest sense, that is a just a subset of the request that this ticket started with and the usecases mentioned within it, which likely would require indexing the entire history of each and every page.
In order to solve YOUR problem, with the current technology, we should probably look into the direction of: "Is there some way, that we can special case a subset of Talk pages in the searchengine, to become more discoverable and useful to editors trying to find older discussion, without blowing up our current search solution". I'm thinking like a way to put a special parser key or something on an archive page/template (as we do for disambiguation pages), like for instance {{#TALKARCHIVE|index|date-range}}. But that would be a totally different ticket, because it's not directly related to this particular technical solution that this ticket deals with. I welcome anyone to create more tickets :)

Look I have no idea what this "phabricator" thing is actually for, what the norms are here, etc. I am not part of your world.

It is obvious that there is a long history of bad communication between editors and people who work on code. This thread is another example, I clearly did this wrong and posted things that wasted your time and are off topic.

I wanted to communicate to folks who work on code, that the search engine sucks for finding things in archives, which is an important thing for working in community.

I only mentioned the "newcomers" thing because I understand that some of the people who work on code are WMF employees, and making WP easier to navigate for newcomers is something I understand is important to WMF - so that is a hook you all could use if you need to convince your bosses to give you more resources to address this. Guess that was a waste of time for me to even think about.

The search engine sucking is not MY problem and I can't believe that was even stated. Out of here.

Jytdog removed a subscriber: Jytdog.Jan 30 2017, 4:40 PM
TheDJ added a comment.EditedJan 30 2017, 8:00 PM

wow... like i said, two different worlds colliding and clearly not understanding eachother. I did my best to communicate that this is a complicated process and you clearly communicated "your complexity is not my problem" I guess.

TheDJ added a comment.Jan 30 2017, 8:38 PM

The search engine sucking is not MY problem and I can't believe that was even stated. Out of here.

Hmm, i wonder if this means that Jytdog interpreted my line of "In order to solve YOUR problem" as referring to the search engine, whereas it was referring to his description of talk archives of course and meant exactly to set it apart from the complexity of the other search engine problems. So hard to communicate online at times.. :(

Aklapper added a subscriber: Jytdog.EditedJan 30 2017, 8:59 PM

@Jytdog: Regarding the initial version of your comment T40403#2982342: There is no reason to get confrontational or even personal. Please respect https://www.mediawiki.org/wiki/Phabricator/Etiquette - thanks!

As I mentioned on-wiki, I hope that we can all work collaboratively together. So far, this hasn't really been the case here at all. I applaud @TheDJ and @dcausse for remaining level-headed throughout this, and explaining the difficulties. I also thank @CKoerner_WMF for correcting me and pointing out I was wrong about Google not having strict date sorting in a level-headed, rational, and kind manner; people defaulted to hostility rather than explaining to me calmly I was incorrect, which is disappointing.

I stand by my comments about this potentially degrading the experience for readers if done wrong; there is a reason we have relevance algorithms rather than sorting things by date, and it's to give people the most optimal results. The feature is buried deep in Google's UI to avoid people inadvertently triggering it for this very reason. This does not mean the feature does not have use cases or value for people, or that the Search Team should not work on it. Three years ago, in T40403#477041, I said that it did mean we should not work on it and that we should never add such a feature to the UI, and in retrospect I believe that I was wrong. Sorry about that. People change their minds over time. However, finding an appropriate solution that balances the needs of different user groups is costly (as @TheDJ eloquently explains in T40403#2981844), which affects prioritisation.

In summary, this is something that I think is useful for some users, but the use cases are smaller relative to the greater mission of improving relevance for readers, so this is not prioritised for the Search Team in Discovery to work on right now. We'd like to take a good long look at power user features like this someday, but to be clear that is not happening soon.

Jytdog removed a subscriber: Jytdog.Jan 31 2017, 3:29 AM

I am going to try to get help to solve this through a user script. I'll report back with a link to it, perhaps the usefulness can be seen then.

For what it's worth, the IFLA Statement of Principles 7.2 state:

When searching retrieves a large number of [results], results should be displayed in some logical order convenient to the [search] user [...]. The user should be able to choose among different criteria: date of publication, alphabetical order, relevance ranking, etc.

debt added a subscriber: debt.Feb 3 2017, 7:44 PM
Scott added a subscriber: Scott.Feb 11 2017, 12:17 AM
debt added a comment.May 30 2017, 8:06 PM

Hi @Ainali - were you able to solve this issue with a user script? If so, we might be able to put that information somewhere for others to use.

Njk added a subscriber: Njk.Oct 16 2017, 1:36 PM
Quiddity updated the task description. (Show Details)Feb 6 2019, 12:42 AM
Quiddity edited subscribers, added: Quiddity; removed: wikibugs-l-list, Manybubbles.

In the API this is now partially working!

@EBernhardson How complicated is it to make these URL parameters available for the default website search? (ideally for eventual inclusion in the Advanced Search interface)

@Quiddity I already hacked in a query parameter, sort, that is usable in Special:Search. I'm not sure how much work it would be to wire up AdvancedSearch with this, but there is a ticket already: T197525

TheDJ added a comment.Feb 7 2019, 8:36 AM

Possible issue with this new functionality logged as T215487: search sorted by creation date missing some items

@Quiddity I already hacked in a query parameter, sort, that is usable in Special:Search. I'm not sure how much work it would be to wire up AdvancedSearch with this, but there is a ticket already: T197525

Fabulous! Thanks. I've mentioned it in the docs, and will mention it explicitly in that other task.