Provide a list of unsuccessful searches
Open, LowPublic

Description

Author: dejan.papez

Description:
I'd like to propose the addition of a new special page that would display all the
searches that were unsuccessful (failed searches, no results searches, zero results searches). This would be a log of recently searched for but
unfound items. By sorting them on the basis of times they were searched for and not
found, we could more easily fill the gaps that still remain in the databases. This
would be complementary to Special:Wantedpages and Requested articles that are part of
some Wikimedia projects.

The log available to all users would contain the following data:

  • terms that have been looked for
  • their frequency (how many times they have been looked for)
  • the date of the first and of the last search for each unfound term.

It would also be very useful to be able to limit the displayed time period and to set
the sorting order (ascending/descending) and whether to include all results or only
those that have been searched for more than once in this time.

This would be very useful for smaller projects to know where to focus their limited
manpower, but also for larger ones to know which areas are not covered well and which
redirects should be created but have not been yet.

We've had a discussion here:

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_%28proposals%
29&oldid=59410881#List_of_recently_not_found_results

Three problems were mentioned:

  • possible server overload or impracticality due to server setup
  • privacy issues - hereby my opinion is that if the log does not show IPs or

usernames, the privacy is protected

  • someone would have to take his time to code this ;)

Version: unspecified
Severity: enhancement

Details

Security
None
Reference
bz6373
bzimport added a subscriber: Unknown Object (MLST).
bzimport set Reference to bz6373.
bzimport created this task.Jun 19 2006, 9:43 AM
brion added a comment.Dec 28 2008, 9:31 PM
  • Bug 7969 has been marked as a duplicate of this bug. ***
demon added a comment.Dec 16 2010, 4:50 PM
  • Bug 26308 has been marked as a duplicate of this bug. ***

dejan.papez wrote:

I think, in addition to the reasons posted to the duplicates (using the system to improve the search system etc.), the feature would contribute most to the development not only of Wikipedia but particularly of Wiktionaries.

Tgr added a comment.Apr 2 2012, 6:58 PM

There was some discussion about that here:
http://www.gossamer-threads.com/lists/wiki/wikitech/186831

Also, Wikisticks provided a list of top missing search queries at one point (by rewriting search URLs so that they get collected in the squid logs, then removing existing pages), but it is broken now:
http://wikistics.falsikon.de/2008/wikipedia/en/wanted/

For the time being, User:West.andrew.g's list can be used to get this information: https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks

Nemo_bis edited the task description. (Show Details)Nov 28 2015, 8:27 PM
Nemo_bis edited projects, added MediaWiki-Search; removed MediaWiki-Special-pages.
Nemo_bis set Security to None.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2015, 8:27 PM
Restricted Application added a project: Discovery. · View Herald Transcript
Deskana added a subscriber: Deskana.Dec 5 2015, 9:52 PM

This was actually one of the very first things that the Discovery Department discussed when we began our work on search.

Search data, and in particular the search queries that users enter, is assumed to contain personally identifying information unless proven otherwise. This is because we're storing arbitrary text input by users, and if that arbitrary input is surfaced publicly then there are a variety of ways that malicious people could game the system and do nefarious things. Even stripping all metadata, and simply presenting a flat list of queries, does not resolve this issue. In light of this, the information is subject to the privacy policy unless it can be anonymised, which is very difficult to do in an automated fashion and very time consuming to do manually.

In summary, this is a good idea, but it isn't going to happen any time soon because of the above complexities.

Deskana moved this task from Needs triage to Search on the Discovery board.Dec 5 2015, 9:57 PM

I believe that privacy concerns should not and do not prohibit the use of failed search query data for the benefit of the projects searched.

As to spamming et cetera, counting repeated data from clustered requests, per time, per originator, and so on, can be collapsed into one counted request. That would considerably reduce the impact of such unfaithful uses.

We could also introduce a threshold below which data is not made public. That guarantees that at least several users have been searching for the same. This guarantees at least some sort of anonymity.

Last not least older searches have to be re-evaluated (are they successful now?) and may be dropped from the statistics, or their relevance discounted if they were not repeated.

Peachey88 added a subscriber: Peachey88.EditedDec 6 2015, 10:32 AM

I believe that privacy concerns should not and do not prohibit the use of failed search query data for the benefit of the projects searched.

The privacy aspect is probably more of a separate bug report if this feature could/should be enabled on the wmf cluster (which will probably need some comment from WMF-Legal) where as this bug is more about implementing the feature in MediaWiki core.

I believe that privacy concerns should not and do not prohibit the use of failed search query data for the benefit of the projects searched.

Understandable. That said, your belief does not change the privacy policy which we are bound to follow, and therefore also does not change what I have written above. :-)

As you pointed out, there are ways of reducing the risk, but we're not ready to work on this yet because those questions have not been fully resolved and this isn't bubbling up in priority for us.

The privacy aspect is probably more of a separate bug report if this feature could/should be enabled on the wmf cluster (which will probably need some comment from WMF-Legal)

The information I gave above about when this data could be released was the result of a consensus between WMF-Legal and Security when I asked them about it, so that bit is (fortunately!) already done.

where as this bug is more about implementing the feature in MediaWiki core.

That works with me. At this stage though, I should point out that the Discovery Department's primary focus is the users of the Wikimedia sites; we would not work on a feature unless we thought it had a reasonable chance of being used on the Wikimedia wikis. Of course, the above does not preclude someone else working on this, but it would mean that the support that Discovery could give would be minimal at best.

When I got the news of this extension the first thing came to my mind is that if we get the list of most search keywords on my wiki. We will be able to work on the missing articles. It would be a huge help to get more readers.

If we store all the user data along with the users then it might be an privecy issue. But if we only store the keywords then I do not think it should be a problem to anyone. To be safe we can add an option in the preferenves to 'exclude myself from this feature'.

If we store all the user data along with the users then it might be an privecy issue. But if we only store the keywords then I do not think it should be a problem to anyone.

Keywords could contain anything, including sensitive information. I've seen it in the logs with my own eyes. Unfortunately, your expectation does not map to the complicated reality of the situation. We should respect the advice of the legal and security experts that caution us about this, because they're right. :-)

As I've already said above, there are strategies that can mitigate this risk, but the Discovery Department cannot prioritise working on them at the expense of other work right now.

To be safe we can add an option in the preferenves to 'exclude myself from this feature'.

That barely improves safeness at all. More preferences buried in our complex preference system might make us feel better, and pat ourselves on the back, but it wouldn't help editors that don't notice the preference, or any readers who choose not to log in. Besides, building out such a feature would represent even more development effort, which as I've said, Discovery cannot prioritise right now.

Add Comment