Page MenuHomePhabricator

Reader searches for a topic
Closed, ResolvedPublic

Description

"As a Reader, I want to get a list of pages that match a search term, so that I can find pages about a topic I’m interested in."

GET /search/page?q={search term}

Searches for pages in the main namespace by a text search term. Returns pages that match the term either in their title or in their text. Pages that are unreadable by the current user are not returned.

Request headers: none
Request body: none

Status:
200 – OK

Headers: none

Body: JSON
object including a single property, "pages".

  • pages: an array of pages in relevance order, as with search in the Web interface. Maximum of 50 results. Each page result includes:
    • id: id of the page
    • key: prefixed DB key of the page, like "Main_Page"
    • title: title for display, like "Main Page"
    • excerpt: excerpt that shows the search term in the text of the page, in HTML, or null if only a title match

Event Timeline

@EvanProdromou is there any other status code that might make sense? For example if there are no matches for the search?

@kchapman I think the result would be a 200 OK, with an empty list as the body of the results.

So, there's a question of whether we use ?q={search term} or /search/page/{search term}

Here's what I see for other APIs:

O'Reilly's "Designing Web APIs" calls out the q={search term} style in particular (page 12). I haven't found an example of using the search term in the URL if it's not a search

Another question that came up during our kickoff was what to do with empty search results. I'm going to try to catalog what different search APIs do in this situation.

Twitter returns a status code 200 OK, with an empty array for the "statuses" property.

version_id should be revision_id, or {"revision": {"id": ...}}.

The SearchResult interface does provide a timestamp, which may be the revision timestamp, so I suppose it would be possible to fetch the revision IDs from the database given that information. If the revision is missing, I suppose it would be most consistent to filter out the result, although it's unclear whether getTimestamp() would be robust enough to allow that for all search engines.

Regarding the excerpt, SearchResult provides the following methods:

  • getTextSnippet()
  • getTitleSnippet()
  • getRedirectSnippet()
  • getSectionSnippet()
  • getCategorySnippet()

There are a bunch of them because the search term may match in various fields of the document. In order to leave space for change, and for consistent terminology with the rest of MediaWiki, perhaps instead of "excerpt" we could have an object called "snippets" containing an element called "text": {"snippets": {"text": "The highlighted text snippet"}}

SearchResult has isFileMatch(), reflecting the ability of CirrusSearch to index files with a text layer, such as PDFs. Should such results be filtered out, or should we find a way to display them? For file results, there is a title, but not a revision ID.

I did a quick review and couldn't find an API that returned anything but 200 OK for empty results. I'm going to stand firm that that's the right status, not 404.

eprodromou updated the task description. (Show Details)

I updated the output based on @tstarling 's suggestion that we follow the OWASP recommendation of only using objects for output.

Is revision_id meant to be the current revision ID of the page, or the revision ID at the time of indexing? My previous comment related to the latter since there are easier ways to get the current revision of a page.

@tstarling I removed the revision ID. I was trying to keep the schema for "short info about a page" about the same across endpoints. Definitely not worth worrying about here.

GET /search/page?q={search term}

It seems like at least in /revision/{id}/compare/{other_id} we have the structure of /resource_type/action while here we do wise versa. What's the standard?

title_key: prefixed DB key of the page, like "Talk:Main_Page"
display_title: title for display, like "Talk:Main Page"

It was noted in some other ticket, that it's better to have a 'title' and object. Also, display_title has a very specific meaning in MW and I don't think you mean it. Did you mean display title generated by https://www.mediawiki.org/wiki/Extension:Display_Title ?

perhaps instead of "excerpt" we could have an object called "snippets" containing an element called "text"

This comment from @tstarling was not incorporated.

SearchResult has isFileMatch(), reflecting the ability of CirrusSearch to index files with a text layer, such as PDFs. Should such results be filtered out, or should we find a way to display them? For file results, there is a title, but not a revision ID.

This question still holds.

Do we have a sense of whether we're standardizing on snake case or camel case? For example, the spec for this endpoint has properties in snake case, while the spec for the get media links endpoint has properties in camel case.

Do we have a sense of whether we're standardizing on snake case or camel case?

Snake case. I'll put it in the design principles document.

GET /search/page?q={search term}

It seems like at least in /revision/{id}/compare/{other_id} we have the structure of /resource_type/action while here we do wise versa. What's the standard?

I don't think we have a standard, but q={search term} is widely used, as I mentioned in a previous comment.

title_key: prefixed DB key of the page, like "Talk:Main_Page"
display_title: title for display, like "Talk:Main Page"

Fixed.

There are a bunch of them because the search term may match in various fields of the document. In order to leave space for change, and for consistent terminology with the rest of MediaWiki, perhaps instead of "excerpt" we could have an object called "snippets" containing an element called "text": {"snippets": {"text": "The highlighted text snippet"}}

I'm not sure I understand this. Could we maybe have something like a "best" snippet property, snippet, for the casual developer, and then we could add more_snippets or something similar for the more in-depth display?

SearchResult has isFileMatch(), reflecting the ability of CirrusSearch to index files with a text layer, such as PDFs. Should such results be filtered out, or should we find a way to display them? For file results, there is a title, but not a revision ID.

Yikes. So, my first instinct is to filter them out for now, but it would be a breaking change to add a different kind of search result later. I'll make sure we've got it clear by the time we kickoff.

Could we maybe have something like a "best" snippet property

How do we know what's the 'best' snippet? The longest? The shortest? The first non-empty one in a somehow ordered list of snippets? The possibilities are endless and totally depend on the use case. I have a feeling that hiding complexity here will just introduce more complexity.

key: prefixed DB key of the page, like "Talk:Main_Page"

I have commented this on the schema document, but will reiterate here since this will get to be implemented sooner. I think 'key' is way too non-specific a key, Also, having display title and key in separate properties go against our own design principle about combining connected properties into subject.

Could we maybe have something like a "best" snippet property

How do we know what's the 'best' snippet?

Let's start with the snippet that appears on the Web UI for search results. That might not be the "best" but it's probably good enough for a first pass.

Screen Shot 2019-10-30 at 10.44.52 AM.png (1×1 px, 336 KB)

key: prefixed DB key of the page, like "Talk:Main_Page"

I think 'key' is way too non-specific a key

I don't think we're going to have multiple keys, so I think it's probably fine.

Also, having display title and key in separate properties go against our own design principle about combining connected properties into subject.

I appreciate that, and it's a good point. I am reluctant to make something as intrinsic to the page as its title be part of a sub-object, though.

If it's bugging you a lot, I'd suggest we drop key from the schema, and just leave title.

Let's start with the snippet that appears on the Web UI for search results.

:) Replicating this would be quite a bit of complexity but in the end the 'snippet' that is shown is a text snippet from the article. However, as you can see, if the end goal is for the API client to build a page with user experience comparable to wiki, having a text snippet is not enough at all.

There's a couple of decisions still to think about:

  1. In MW 'SearchEngine' we have searchTitle and searchText methods. Action API allows you to specify what to search, the SpecialSearch apparently searches both and can indicate where the match was found - in title or in text with title matches being higher in the page. Do we want to search both like SpecialSearch? Do we want to maybe indicate what kind of a match that was in the result?
  2. If the user doesn't have the rights to read the page - what do we return? Omit the page from the result? Return a title match with empty snippet?

Let's start with the snippet that appears on the Web UI for search results.

:) Replicating this would be quite a bit of complexity

That does look complex! Am I wrong in thinking that there's one line that actually gets the text snippet?

Do we want to search both like SpecialSearch?

Yes. I'll put that in the user story. If we decide to build search only in the title, or only in the text, in the future, we'll either add a query parameter to this endpoint, or create a different endpoint.

  1. If the user doesn't have the rights to read the page - what do we return? Omit the page from the result?

Yes.

@eprodromou Should this endpoint be searching all namespaces? (Main, Talk: User:, User Talk:, etc) by default? Or, should it accept a param that indicates which to search (like the current "opensearch" endpoint does?) .

@eprodromou Should this endpoint be searching all namespaces? (Main, Talk: User:, User Talk:, etc) by default?

Let's say it searches only the main namespace by default. At some point in the future, we'll add a way to search other namespaces.

@eprodromou In the case of a match in title and text on a single page, my assumption is we only return 1 page for both matches. In that case, would we want to return only the title excerpt HTML or both title and text HTML tags?

Change 549238 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/core@master] Basic Search endpoint

https://gerrit.wikimedia.org/r/549238

@eprodromou In the case of a match in title and text on a single page, my assumption is we only return 1 page for both matches. In that case, would we want to return only the title excerpt HTML or both title and text HTML tags?

Ideally, I'd like to get the same results as this page:

https://en.wikipedia.org/w/index.php?sort=relevance&search=Prodromou&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

I don't think we should be manually determining that title matches should go above text matches. I think it's up to the search engine to do that, or interleave them.

It seems strange that we have to do two searches. Does the "text search" return text and title matches, or only text matches?

@nnikkhoui I just checked; SpecialSearch does in fact do two searches (text and title) and then interleaves them. Do it like SpecialSearch, please!

However, if there's no text match, just leave excerpt null.

Pages that are unreadable by the current user are not returned.

hey @eprodromou - Can you clarify whats meant by that requirement? I was learning a bit about page restrictions and @Clarakosi pointed me here which suggests per page restrictions aren't typical? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing_of_certain_specific_pages.

Unless you meant more like restricting view to all pages? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing

hey @eprodromou - Can you clarify whats meant by that requirement? I was learning a bit about page restrictions and @Clarakosi pointed me here which suggests per page restrictions aren't typical? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing_of_certain_specific_pages.

Unless you meant more like restricting view to all pages? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing

I think on private wikis like office.wikimedia.org you have to check if the user can read the page. There are other configurations of MW that can make it more complicated.

@Pchelolo did this for a few of the endpoints; maybe he can help out with code snippets?

I see this line:

http://phase3.link/includes/Rest/Handler/RevisionHandler.php#55

Does that help?

I did see that line in @Pchelolo 's code and used it in my current implementation, but was writing the tests and was just wondering use cases for when that would actually be possible. So if I understand correctly, this would only occur for completely private wikis then, not that certain pages within a wiki would be private"?

Been testing with CirrusSearch locally and have noticed different behavior between what is delivered for the text snippet from the default MediaWiki search engine vs. CirrusSearch.

If there is only a title match (no text match), CirrusSearch text snippet will return the first few sentences of the page, not in HTML. Default MediaWiki search engine will return null.

Since the behavior between these is inconsistent with what the task requirement is, i will modify endpoint to handle both cases and always spit out null if there is no match in text.

On second thought, might rethink the above ^^ @eprodromou. After talking with @Anomie it might be better to just default to spitting back whatever the search engine gives us from "getTextSnippet()". So if the search terms exists in the text snippet, the snippet will be html. Otherwise, a plain string of the beginning snippet of the page will be returned.

Change 549238 merged by jenkins-bot:
[mediawiki/core@master] Basic Search endpoint

https://gerrit.wikimedia.org/r/549238

I'm re-opening so I can go over the endpoint and make sure it meets the acceptance criteria. @nnikkhoui not a problem, but I try to do this for all the user stories.