Page MenuHomePhabricator

Implement personalized search for logged-in wiki users
Closed, DeclinedPublic

Description

Personalized search (https://en.wikipedia.org/wiki/Personalized_search) is becoming more and more common. AFAIK we haven't implemented anything in that direction for any of our wiki's (and couldn't find a task). We should offer this option to our logged-in users. You would build a (search) profile for every user and based on several strategies you should be able to provide better search results.

Of course this might have privacy implications if done wrong. Should probably be a option in the user preferences to enable/disable it, to enable/disable collecting search data and a big delete button to delete the gathered profile data. Data shouldn't be exposed or only on a very high level of aggregation.

The user benefit would be better search results and maybe suggestions for users about new content (articles/files) in areas of interest. Quite a few people use Spotify, like the "discover" menu in their app.

This is a (very) high level task. Should probably have a bunch of subtasks if people actually going to work on this feature.

Event Timeline

debt subscribed.

Hi @Multichill - this is a very interesting idea, but it goes against all of our privacy concerns with the Wikipedia community and the movement in general. I believe that this idea has come up several times over the years, but because of privacy concerns, nothing has been done with it (marking as declined).

Hi @Multichill - this is a very interesting idea, but it goes against all of our privacy concerns with the Wikipedia community and the movement in general. I believe that this idea has come up several times over the years, but because of privacy concerns, nothing has been done with it (marking as declined).

That's just FUD. Can you provide links where this has been discussed before? Can you explain exactly what the privacy concerns are? I'm not buying a declined without a decent conversation about this.

Hi @Multichill - this is a very interesting idea, but it goes against all of our privacy concerns with the Wikipedia community and the movement in general. I believe that this idea has come up several times over the years, but because of privacy concerns, nothing has been done with it (marking as declined).

That's just FUD. Can you provide links where this has been discussed before? Can you explain exactly what the privacy concerns are? I'm not buying a declined without a decent conversation about this.

Hi @Multichill - here's several items to read through. They don't all necessarily exactly match what you're proposing, but they're all pretty close.

It's true that publishing zero-result searches (or any searches) is a more obvious potential privacy leak than personalization, because the data for personalization should be hidden away internally. However, if there ever were a data breach of any kind, personalization data would be a concentrated source of information on users. I'm not an ops engineer, but storing and securing personalized data properly for hundreds of thousands to millions of users could entail more hardware, and seems likely to require more general complexity in our infrastructure.

Good personalization would require a fair amount of data for each user, and might have to go beyond our 90-day data retention policy to be effective. Users could opt in, but it's not clear that we want to even give people the option to violate the data retention policy. Given the importance of privacy—otherwise why not use site search from Google?—it seems like this kind of data retention (even within the 90 day limit) would require opt-in, which could significantly limit its use.

As you mentioned, this is a very high level task that would require a lot of work, which means it has a high opportunity cost and would require prioritization over other ongoing projects, which currently include supporting search for Structured Data on Commons, improving our new machine learning–based ranking (T174064), improving search in multiple languages (T174065), paying off a lot of technical debt for our query parser (T185108) as prelude for general improvements there, and supporting Wikidata and Wikibase search (T189736, etc.). You can see our goals (and goals for the whole Technology Department) on MediaWiki—our team's goals are here and our support for Structured Data on Commons is here.

So, the conversation I'd like to see would be something like what happens during the Community Wishlist Survey—though the next one is a ways away—or a proposal on a relevant Village Pump (I checked English Wikipedia's VP and couldn't find anything relevant). It would require a lot of conversation to come up with a set of desirable use cases that give the right balance of privacy and features. I fear additional complications when trying to plan for something across communities, too. Sometimes features that seem generically and obviously beneficial to one groups can become contentious within a different community, and something like this would require broad consensus across many wiki communities. Users being able to individually disable it may not be enough for some communities; I can't give their arguments because I don't know them, but I've come to expect them—and for something as ambitious and broad as this, I assume there could be significant pushback in some quarters.

As it stands, I have concerns about privacy, infrastructure, usage uptake, community acceptance, consensus on scope, and opportunity costs—and that's before getting into the actual technical implementation details, which are also potentially worrisome. We are constrained in our ability to gather search quality data—Google and Bing pay big bucks for it—we already have difficulty assessing some kinds of changes, especially on smaller wikis. Assessing personalized search results could be very difficult, and building something so complex without proper feedback is risky.

Personalized search would be a huge undertaking and it needs much more than one phab ticket to set it underway.

EBjune triaged this task as Lowest priority.Apr 26 2018, 5:27 PM
EBjune moved this task from needs triage to search-icebox on the Discovery-Search board.
MPhamWMF subscribed.

Closing out low/est priority tasks over 6 months old with no activity within last 6 months in order to clean out the backlog of tickets we will not be addressing in the near term. Please feel free to reopen if you think a ticket is important, but bare in mind that given current priorities and resourcing, it is unlikely for the Search team to pick up these tasks for the indefinite future. We hope that the requested changes have either been addressed by or made irrelevant by work the team has done or is doing -- e.g. upgrading Elasticsearch to a newer version will solve various ES-related problems -- or will be subsumed by future work in a more generalized way.

RhinosF1 removed a project: Discovery-Search.
RhinosF1 subscribed.

Re-opening tasks and removing from team workboard per IRC feedback given yesterday and discussion with MPham.

Declining this after re-reading the above comments regarding privacy, etc