Page MenuHomePhabricator

What can the Search API do for you?
Closed, ResolvedPublic

Description

Wikimedia currently has a very powerful search API https://www.mediawiki.org/wiki/API:Search_and_discovery that does some things well and could do some things better. The Discovery department would like to have a user focused session to hear about what works well, what works poorly, and what confuses our consumers.

We'd especially would like to hear from Reading, Editing, Android, and iOS teams to better understand how we can improve relevance, query complexity, and search performance.

Agenda: Gather feedback and requests for the search API from those who consume the data it provides. Brief introduction (~1 minute), then open discussion.

See also T89733: Allow ContentHandler to expose structured data to the search engine.

Event Timeline

Tfinc raised the priority of this task from to Needs Triage.
Tfinc updated the task description. (Show Details)
Tfinc edited subscribers, added: Wwes; removed: Krenair.

@bearND @dr0ptp4kt @BGerstle-WMF @Jdlrobson @Catrope - would love to get feedback from you guys about what improvements discovery can make to the search API to make development of products easier and faster

This could range from fewer calls to get something done, performance improvements, or other.

@Tfinc Awesome! @dr0ptp4kt do you think it would be a good idea to make this an agenda item for the next reading tech lead sync?

@Tfinc Awesome! @dr0ptp4kt do you think it would be a good idea to make this an agenda item for the next reading tech lead sync?

Let's discuss, for real.

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

@dr0ptp4kt: I'd like to meet this week to discuss. Can you make sure you have your team discussion before?

@Jdlrobson @Catrope - do you have any blocking bugs around search? Eager to hear from any community devs as well

@Jdlrobson @Catrope - do you have any blocking bugs around search? Eager to hear from any community devs as well

The Collaboration team doesn't need anything related to search this quarter. The VisualEditor team, however, may have bugs/requests related to the opensearch API (used for link target suggestions). @Jdforrester-WMF would be the person to ask.

One VisualEditor request that jumps to mind is T113110: Replace VE's use of prefix search for link/template/image suggestions with full text search; @Jdforrester-WMF and I have discussed that and agreed it's not really high priority, and the work to do it on Discovery's end is ongoing anyway, so it's also not super interesting to discuss. I'm curious to hear if there are any other needs either from a product or engineering perspective.

Another VisualEditor request I was just reminded of is improving the media search API so T95223: VisualEditor media search results often include irrelevant PDFs due to full-text search can be fixed.

+1 media search. It is currently impossible to search for a specific media type (e.g. images/audio/video only)

@Catrope @Esanders (and perhaps also @neilpquinn): What proportion of editing sessions in VisualEditor involve the user trying to add media?

@Deskana I'm not sure how we would calculate that.

I'm hoping this summit topic will also include some details on the long-term planning of the discovery team.

Added T119644, which talks a little bit about configurable handling of disambiguation pages and redirects. There are a few user stories that would benefit from exclusion of both redirects & disambiguation pages:

  • Random article
  • Recommendations

Basically, anything more in the "discovery" category (as opposed to search). AFAICT disambiguation pages appear to be handled via extensions, (according to T8754 are only handled via an extension, not core).

@Deskana pinged me on IRC about this a couple of hours ago. I think it's best that I respond here.

I'm not clear about what the goals of this session are. Furthermore, at this point, one of the area leaders would need to be a champion for this. The area this most cleanly fits in is T119029: WikiDev 16 working area: Content access and APIs, so I think the best reason why this session might happen is for @GWicke to clearly articulate why this session should happen instead of the multitude of others that this is competing against. That debate might best happen on T119029.

We'll host this as an unconference session at 2pm on Monday 4th January.

Could we get a rough agenda for the session? I'm already double-booked for the 2pm session, I suspect others might be as well. Good agendas for sessions help us evaluate where our time is best spent.

@cscott Good point. I'll add that to the description.

Etherpad copy


Search API Discussion

Quick welcome from Dan Garry: Hoping to hear what people would like the search API to do, or to do better.

I have been very excited about improvement to autocomplete. Question: What is the timeframe for adding the new functionality as a generator to the API? Could we add that feature to the apps and do AB Tests on it?
Answer: Dan's hope is to get the existing feature into production by March 2016. Geting it into a generator requires a bit more thought. We need proper API integration.

Q: For the mobile case, how important is it to support other namespaces?
A: Not important.

Q: Do you have/plan to have support for searching image files and other media?
A: Not on current roadmap (which looks out 9-12 months). We have been collecting ideas, but it's very complex.
Follow-up: One older idea was to translate category keys, and use those as a keyword search.
Q: Can we filter search by media type (Audio/Image/Document) (T95223)?
A: We could look at small, specific improvements sooner, whereas a big overhaul will have to wait.
Dan: Using wikidata to assist searching is also something we're looking at (eventually)
Follow-up: File descriptions could have any/many language(s).

Q: When we search, are we searching in a specific language?
A: What does it mean to search in French on Commons? Searching a language in wikipedia is clear, but commons less so.
A: I would expect searching tags to be more effective.

Q: For search across languages, could you translate the search query and run the results?
A: Discovery is already working on cross-wiki searches based on language. Some experiments were promising. But the tests only measured that we were serving more results--not that they were great results.
A: Short query strings make it difficult to detect which language was used. Detection tends to be stats-based, but with a short query.
A: But language detection is different from translating the query into multiple languages, and searching that.
Dan: Would that be useful on wikipedia, or just on commons?
Response: both.
A: Would need translation library with appropriate licensing.
A: We're focused on A/B tests, to ensure that any changes actually have a positive effect

Dan: In addition to tracking "did the user click on something in the search results", we also track bounces (where they clicked back out right away, implying it wasn't actually a good result).
We are continuing to refine that relevancy testing, combined with zero results rates.

Q: What about analyzing the articles that a user has read, and serving articles that relate to those?
A: Isn't that a "more like.."?
A: "similar" in terms of word frequencies would be easier than "similar" based on concepts
A: Might make sense to create a new API for "like" queries.
A: Discovery is focused on the search box in wikipedia. Perhaps mobile could do richer analysis of user behavior?

Q: Do we have a way to understand what is common between 2 articles, and quantify/qualify that? And could we use that to offer related content?
A: Ellery built a model that does that, for the translation recommender. If you give it your edit history, it can offer similar pages. Gave up on that, and just used a google search instead.
Simple "similar" seemed anecdotally better than the complex analysis.

Looking at the top 10 articles in a given week is interesting. Sometimes there are single things like "death" or "facebook". Other times you'll see a cluster of actors in a hit movie, or similar.

Q: Using the android app, search results (on enwiki) aren't always helpful. "Obarta" might offer something like "Obajma", which doesn't help because we don't have a page for that either. Sometimes helps, but rare.
A: Dan is building up a list of queries that give weird results, so please send him additional examples.
A: The suggestions are based on words, not phrases. It's difficult to tweak parameters because each one will have lots of side effects.
Q: Would this be fixed by the new auto-complete suggestions?
A: Not entirely. Typos of 2 or more characters probably won't be corrected. We display "did you mean" from text, not from queries. Earlier searches by anyone can affect later results.
A: Privacy concerns about revealing searches to other people, but can be mitigated by requiring multiple instances from different IP addresses.
A: Searches on small wikis like office can produce odder results because the text on the site is specialized and small.

Q: Some "did you mean" aren't even meaningful words
A: There is a confidence parameter, which could be looser...but returning "more results" is not always a good thing.

Q: For cross-wiki searches, will there be php API endpoints, or some other API?
A: Industry word is "federated search". Current inter-wiki search is search, then detect language, then search the other wiki.
A: Elastic allows us to send multiple searches at once, but load is a concern. We're looking at ways to do federated search on a smaller scale, but that's a ways out. Just ideating now.
A: Pre-computing causes storage concerns, as well as when to refresh.

Q: Are you doing anything with natural language search?
A: Not actively. Some volunteers have developed prototypes to do natural langauge searches on Wikidata, by translating to SPARQL. We are interested in it, at some point.
A: Another option is to strip out "stop words" like is, or, etc. We did a test, but the particular attempted tweak wasn't effective.
A: Any search that includes a question mark ends up as a regexp, so the ? causes odd effects.
A: Would be nice to give users a specific answer to a specific question, rather than only offering article links.
A: Maps is another case where a quick answer could help, but we are intentionally moving slowly, plus only have limited developers.

Q: How does mobile search?
A: Prefix first, and if that fails, then try a full text search <--- Not sure I got this right

Q: A while ago, Discovery was looking at zero results. One plan was to ask "was this helpful". Did you start, and what did you learn?
A: Discovery had hoped to do that last quarter, using quick surveys, but it didn't work out. We will be doing it this quarter.
A: We hope to use that data to fine-tune the arbitrary 10-second bounce threshold

Q: Geosearch: Would be really useful if it worked at various zoom levels
A: Max hopes to use the pageviews/pagerank-driven results to improve that. It's hard to know what to show/hide as you zoom out.
A: Currently, if you search "near", and then zoom out, it still only shows things near you, all clustered in a tiny area on the screen
Q: Google earch and wiki wand shows wikipedia results from all areas of the map. How are they doing this? We don't know.

Q: Does mobile plan to add a search box to maps?
A: Reading was just thinking about that. Combining zoom level with completion suggester seems very promising.

Completion suggester has experimental features allowing mixing in geosearching
If you're in Texas, a search for "Paris" should probably favor "Paris Texas" over "Paris France"

Q: On mobile, what zoom levels are common?
A: Currently goes to 10km. People probably aren't zooming out super far. Probably mostly at a city level.

Q: How are you going to compute pageranks?
A: Elastic search already has all the outgoing links, so can go through hadoop.
Q: Will it include cross-wiki links?
A: Not for now, no.
Dan A wants to bring cross-wiki data into hadoop, so would like to work with Discovery.

Q: In new search API, if I search around a specific point, like categories and files, and limit the result set to 10 (of 100 available). Are the results the best? They aren't ordered within themselves.
A: Results are all ordered by distance from center. Geodata API doesn't have continuation to encourage making a single request instead of multiple.
Q: Using it as a generator, there is a distance field. Within the 10 results, the distances are not ordered
A: When used as a generator, order is not preserved. This is a known issue. But the 10 will be the most relevant, just not ordered within themselves.

Q: Are there things users were asking for that couldn't be done on the server and should be on the client?
A: Most of the feedback we get is "I'm using insource (or intitle) and I'm not getting results I want". The rest are odd results for searches, like the examples above.
When people search on wikipedia, they think our search is as smart as google, so they are asking questions (not just entering keywords).
Some search queries are non-encyclopediac, like "what's a good romatic date?" They might think they are searching in google but entered in the wrong text box.
Ideally for encyclopedic questions, we would provide answers. In some cases, we might have the info, like a list of lists of romantic places.

Examples of bad/difficult queries:

  • Lots of queries for specific media (songs, movies), including in other languages
  • Repeated searches for an excerpt from a book

We should use query stats to generate lists of potential articles that perhaps should be written.
But there are a lot of searches that really aren't meaningful. Can we pull out the gems from among all those.

Jurrasic World was a common zero results. The completion suggester would fix that now.

What about if the user searches a few times in a row, try to figure out what they might have meant.
Zero result once isn't frustrating, but repeated zero results is.

Q: Can we leverage our community to find gems among the common zero results?
A: The tail is very long, and the most common often come from bots.

Part of the goal of the completion suggester is to reduce the need for redirects just to fix typos.
Human-curated redirect pages have value. Completion suggester works whether they exist or not.

Q: Has anyone experimented with running queries through our search, and through google's? Anecdote: Supposedly at one point Bing was displaying Google results directly.
A: Trey ran a bunch of queries where page views came from google, and re-ran them through our enwiki search. Turns out google is a bit tricky, so some results were hard to replicate in google.
A: Around half of the queries through our search got the same page that the user had actually landed on from google.
Q: Could we set up a test cluster where searches are served by calls to google? For test purposes, we might learn from what they have done.
Q: We could analyze referrer logs to see what search terms got users to which pages, and use that to train.
A: Google says there is no "secret sauce" that you could detect from a black box. There are just a large collection of small features that combine into a whole.
A: Google says in Europe, searches for trains are for where they are going and when. In the US, searches are where people want to know when the train will actually arrive. Just one example of a small feature.

Discovery has made some changes which help, but each change is likely to only affect a small number of users/queries. Back to the long tail.

Q: Are redirects used to power completions?
A: 2 different cases: Did you mean (does a per-word basis) vs. Completion suggester, which would look at redirect pages, but might just get you there directly <--- not sure I got this right

Goal is to apply completion suggester logic to prefix search (which mobile uses).
Completion suggester doesn't reorder words

Q: For mobile, with small screen, would we still display results, or would we switch to suggestions?
A: In a perfect world, should be fast enough that autocorrect are search results. "Results" should be "Suggestions". Hopefully even on mobile.
A: It would still use prefix search.
A: Full-text search is really slow, so prefix is necessary.

Mobile shows images in search results.
Discovery is about to launch an A/B test on the wikipedia portal which shows image very similar to what mobile already does.

Q: Would the completion suggester actually be more useful than "did you mean?"
A: We don't know yet. Over time, some of our older stuff (workarounds) should go away.

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

Deskana claimed this task.

This session took place as an unconference session. Etherpad notes are above. The session was primarily meant to be informative for Discovery, and was successful at that. The one action item that comes to mind is to look in to T95223, as there may be a quick win to be had for it that doesn't involve substantially reworking media search from the ground up.