Reader searches for a topic
Closed, ResolvedPublic
Actions

Description

"As a Reader, I want to get a list of pages that match a search term, so that I can find pages about a topic I’m interested in."

GET /search/page?q={search term}

Searches for pages in the main namespace by a text search term. Returns pages that match the term either in their title or in their text. Pages that are unreadable by the current user are not returned.

Request headers: none
Request body: none

Status:
200 – OK

Headers: none

Body: JSON
object including a single property, "pages".

pages: an array of pages in relevance order, as with search in the Web interface. Maximum of 50 results. Each page result includes:
- id: id of the page
- key: prefixed DB key of the page, like "Main_Page"
- title: title for display, like "Main Page"
- excerpt: excerpt that shows the search term in the text of the page, in HTML, or null if only a title match

Details

	Subject	Repo	Branch	Lines +/-
	Basic Search endpoint	mediawiki/core	master	+182 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T229661 Core REST API in MediaWiki
Resolved	None	T229662 Minimal client REST API
Resolved	• nnikkhoui	T230844 Reader searches for a topic
Invalid	None	T234392 Implement search endpoint
Invalid	apaskulin	T234393 Document search endpoint
Resolved	• nnikkhoui	T236168 Implement basic search endpoint

Event Timeline

EvanProdromou created this task.Aug 20 2019, 11:25 PM

EvanProdromou moved this task from Epic Backlog to User Stories Needing EM Sign Off on the Platform Team Workboards (Epics) board.Aug 20 2019, 11:31 PM

@EvanProdromou is there any other status code that might make sense? For example if there are no matches for the search?

@kchapman I think the result would be a 200 OK, with an empty list as the body of the results.

• eprodromou moved this task from Epics to Green on the Platform Team Workboards board.Aug 21 2019, 11:40 AM

• eprodromou edited projects, added Platform Team Workboards (Green); removed Platform Team Workboards (Epics).

So, there's a question of whether we use ?q={search term} or /search/page/{search term}

Here's what I see for other APIs:

Twitter - q parameter
Google Custom Search API - q parameter
Slack - query parameter
Facebook Pages Search - q parameter
Github Search - q parameter
Google Maps - input parameter
Dropbox - POST JSON, "query" parameter

O'Reilly's "Designing Web APIs" calls out the q={search term} style in particular (page 12). I haven't found an example of using the search term in the URL if it's not a search

• eprodromou updated the task description. (Show Details)Aug 21 2019, 1:31 PM

Another question that came up during our kickoff was what to do with empty search results. I'm going to try to catalog what different search APIs do in this situation.

Twitter returns a status code 200 OK, with an empty array for the "statuses" property.

WDoranWMF edited projects, added Platform Team Workboards (User Stories); removed Platform Team Workboards (Green).Sep 3 2019, 5:07 PM

version_id should be revision_id, or {"revision": {"id": ...}}.

The SearchResult interface does provide a timestamp, which may be the revision timestamp, so I suppose it would be possible to fetch the revision IDs from the database given that information. If the revision is missing, I suppose it would be most consistent to filter out the result, although it's unclear whether getTimestamp() would be robust enough to allow that for all search engines.

Regarding the excerpt, SearchResult provides the following methods:

getTextSnippet()
getTitleSnippet()
getRedirectSnippet()
getSectionSnippet()
getCategorySnippet()

There are a bunch of them because the search term may match in various fields of the document. In order to leave space for change, and for consistent terminology with the rest of MediaWiki, perhaps instead of "excerpt" we could have an object called "snippets" containing an element called "text": {"snippets": {"text": "The highlighted text snippet"}}

SearchResult has isFileMatch(), reflecting the ability of CirrusSearch to index files with a text layer, such as PDFs. Should such results be filtered out, or should we find a way to display them? For file results, there is a title, but not a revision ID.

I did a quick review and couldn't find an API that returned anything but 200 OK for empty results. I'm going to stand firm that that's the right status, not 404.

I updated the output based on @tstarling 's suggestion that we follow the OWASP recommendation of only using objects for output.

• eprodromou mentioned this in T234393: Document search endpoint.Oct 1 2019, 10:47 PM

• eprodromou moved this task from Backlog to Waiting to be Scheduled on the Platform Team Workboards (User Stories) board.Oct 1 2019, 10:59 PM

Is revision_id meant to be the current revision ID of the page, or the revision ID at the time of indexing? My previous comment related to the latter since there are easier ways to get the current revision of a page.

@tstarling I removed the revision ID. I was trying to keep the schema for "short info about a page" about the same across endpoints. Definitely not worth worrying about here.

• eprodromou claimed this task.Oct 8 2019, 4:00 PM

WDoranWMF moved this task from Waiting to be Scheduled to Scheduled in Upcoming Sprint on the Platform Team Workboards (User Stories) board.Oct 11 2019, 5:26 PM

WDoranWMF mentioned this in T236168: Implement basic search endpoint.Oct 22 2019, 2:13 PM

WDoranWMF added a subtask: T236168: Implement basic search endpoint.

GET /search/page?q={search term}

It seems like at least in /revision/{id}/compare/{other_id} we have the structure of /resource_type/action while here we do wise versa. What's the standard?

title_key: prefixed DB key of the page, like "Talk:Main_Page"
display_title: title for display, like "Talk:Main Page"

It was noted in some other ticket, that it's better to have a 'title' and object. Also, display_title has a very specific meaning in MW and I don't think you mean it. Did you mean display title generated by https://www.mediawiki.org/wiki/Extension:Display_Title ?

perhaps instead of "excerpt" we could have an object called "snippets" containing an element called "text"

This comment from @tstarling was not incorporated.

SearchResult has isFileMatch(), reflecting the ability of CirrusSearch to index files with a text layer, such as PDFs. Should such results be filtered out, or should we find a way to display them? For file results, there is a title, but not a revision ID.

This question still holds.

Do we have a sense of whether we're standardizing on snake case or camel case? For example, the spec for this endpoint has properties in snake case, while the spec for the get media links endpoint has properties in camel case.

CCicalese_WMF triaged this task as Medium priority.Oct 23 2019, 7:33 PM

In T230844#5595892, @apaskulin wrote:

Do we have a sense of whether we're standardizing on snake case or camel case?

Snake case. I'll put it in the design principles document.

• eprodromou updated the task description. (Show Details)Oct 25 2019, 6:59 PM

• eprodromou updated the task description. (Show Details)

In T230844#5595811, @Pchelolo wrote:

GET /search/page?q={search term}

It seems like at least in /revision/{id}/compare/{other_id} we have the structure of /resource_type/action while here we do wise versa. What's the standard?

I don't think we have a standard, but q={search term} is widely used, as I mentioned in a previous comment.

title_key: prefixed DB key of the page, like "Talk:Main_Page"
display_title: title for display, like "Talk:Main Page"

Fixed.

In T230844#5537227, @tstarling wrote:

There are a bunch of them because the search term may match in various fields of the document. In order to leave space for change, and for consistent terminology with the rest of MediaWiki, perhaps instead of "excerpt" we could have an object called "snippets" containing an element called "text": {"snippets": {"text": "The highlighted text snippet"}}

I'm not sure I understand this. Could we maybe have something like a "best" snippet property, snippet, for the casual developer, and then we could add more_snippets or something similar for the more in-depth display?

SearchResult has isFileMatch(), reflecting the ability of CirrusSearch to index files with a text layer, such as PDFs. Should such results be filtered out, or should we find a way to display them? For file results, there is a title, but not a revision ID.

Yikes. So, my first instinct is to filter them out for now, but it would be a breaking change to add a different kind of search result later. I'll make sure we've got it clear by the time we kickoff.

Could we maybe have something like a "best" snippet property

How do we know what's the 'best' snippet? The longest? The shortest? The first non-empty one in a somehow ordered list of snippets? The possibilities are endless and totally depend on the use case. I have a feeling that hiding complexity here will just introduce more complexity.

key: prefixed DB key of the page, like "Talk:Main_Page"

I have commented this on the schema document, but will reiterate here since this will get to be implemented sooner. I think 'key' is way too non-specific a key, Also, having display title and key in separate properties go against our own design principle about combining connected properties into subject.

In T230844#5618087, @Pchelolo wrote:

Could we maybe have something like a "best" snippet property

How do we know what's the 'best' snippet?

Let's start with the snippet that appears on the Web UI for search results. That might not be the "best" but it's probably good enough for a first pass.

Screen Shot 2019-10-30 at 10.44.52 AM.png (1×1 px, 336 KB)

In T230844#5618087, @Pchelolo wrote:

key: prefixed DB key of the page, like "Talk:Main_Page"

I think 'key' is way too non-specific a key

I don't think we're going to have multiple keys, so I think it's probably fine.

Also, having display title and key in separate properties go against our own design principle about combining connected properties into subject.

I appreciate that, and it's a good point. I am reluctant to make something as intrinsic to the page as its title be part of a sub-object, though.

If it's bugging you a lot, I'd suggest we drop key from the schema, and just leave title.

• nnikkhoui claimed this task.Oct 31 2019, 6:27 PM

Let's start with the snippet that appears on the Web UI for search results.

:) Replicating this would be quite a bit of complexity but in the end the 'snippet' that is shown is a text snippet from the article. However, as you can see, if the end goal is for the API client to build a page with user experience comparable to wiki, having a text snippet is not enough at all.

There's a couple of decisions still to think about:

In MW 'SearchEngine' we have searchTitle and searchText methods. Action API allows you to specify what to search, the SpecialSearch apparently searches both and can indicate where the match was found - in title or in text with title matches being higher in the page. Do we want to search both like SpecialSearch? Do we want to maybe indicate what kind of a match that was in the result?
If the user doesn't have the rights to read the page - what do we return? Omit the page from the result? Return a title match with empty snippet?

In T230844#5624443, @Pchelolo wrote:

Let's start with the snippet that appears on the Web UI for search results.

:) Replicating this would be quite a bit of complexity

That does look complex! Am I wrong in thinking that there's one line that actually gets the text snippet?

Do we want to search both like SpecialSearch?

Yes. I'll put that in the user story. If we decide to build search only in the title, or only in the text, in the future, we'll either add a query parameter to this endpoint, or create a different endpoint.

If the user doesn't have the rights to read the page - what do we return? Omit the page from the result?

Yes.

• eprodromou updated the task description. (Show Details)Nov 1 2019, 3:07 PM

• eprodromou moved this task from Scheduled in Upcoming Sprint to Doing on the Platform Team Workboards (User Stories) board.Nov 5 2019, 3:35 PM

@eprodromou Should this endpoint be searching all namespaces? (Main, Talk: User:, User Talk:, etc) by default? Or, should it accept a param that indicates which to search (like the current "opensearch" endpoint does?) .

In T230844#5641345, @nnikkhoui wrote:

@eprodromou Should this endpoint be searching all namespaces? (Main, Talk: User:, User Talk:, etc) by default?

Let's say it searches only the main namespace by default. At some point in the future, we'll add a way to search other namespaces.

• eprodromou updated the task description. (Show Details)Nov 6 2019, 5:41 PM

sounds good thanks!

@eprodromou In the case of a match in title and text on a single page, my assumption is we only return 1 page for both matches. In that case, would we want to return only the title excerpt HTML or both title and text HTML tags?

Change 549238 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/core@master] Basic Search endpoint

https://gerrit.wikimedia.org/r/549238

gerritbot added a project: Patch-For-Review.Nov 7 2019, 1:21 AM

In T230844#5642767, @nnikkhoui wrote:

@eprodromou In the case of a match in title and text on a single page, my assumption is we only return 1 page for both matches. In that case, would we want to return only the title excerpt HTML or both title and text HTML tags?

Ideally, I'd like to get the same results as this page:

https://en.wikipedia.org/w/index.php?sort=relevance&search=Prodromou&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1

I don't think we should be manually determining that title matches should go above text matches. I think it's up to the search engine to do that, or interleave them.

It seems strange that we have to do two searches. Does the "text search" return text and title matches, or only text matches?

@nnikkhoui I just checked; SpecialSearch does in fact do two searches (text and title) and then interleaves them. Do it like SpecialSearch, please!

However, if there's no text match, just leave excerpt null.

• eprodromou updated the task description. (Show Details)Nov 7 2019, 2:46 PM

Thanks for clarifying @eprodromou !

Pages that are unreadable by the current user are not returned.

hey @eprodromou - Can you clarify whats meant by that requirement? I was learning a bit about page restrictions and @Clarakosi pointed me here which suggests per page restrictions aren't typical? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing_of_certain_specific_pages.

Unless you meant more like restricting view to all pages? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing

• eprodromou closed subtask T234393: Document search endpoint as Invalid.Nov 14 2019, 9:10 PM

• eprodromou closed subtask T234392: Implement search endpoint as Invalid.

• eprodromou mentioned this in T229662: Minimal client REST API.Nov 14 2019, 9:34 PM

In T230844#5664170, @nnikkhoui wrote:

hey @eprodromou - Can you clarify whats meant by that requirement? I was learning a bit about page restrictions and @Clarakosi pointed me here which suggests per page restrictions aren't typical? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing_of_certain_specific_pages.

Unless you meant more like restricting view to all pages? https://www.mediawiki.org/wiki/Manual:Preventing_access#Restrict_viewing

I think on private wikis like office.wikimedia.org you have to check if the user can read the page. There are other configurations of MW that can make it more complicated.

@Pchelolo did this for a few of the endpoints; maybe he can help out with code snippets?

I see this line:

http://phase3.link/includes/Rest/Handler/RevisionHandler.php#55

Does that help?

I did see that line in @Pchelolo 's code and used it in my current implementation, but was writing the tests and was just wondering use cases for when that would actually be possible. So if I understand correctly, this would only occur for completely private wikis then, not that certain pages within a wiki would be private"?

Been testing with CirrusSearch locally and have noticed different behavior between what is delivered for the text snippet from the default MediaWiki search engine vs. CirrusSearch.

If there is only a title match (no text match), CirrusSearch text snippet will return the first few sentences of the page, not in HTML. Default MediaWiki search engine will return null.

Since the behavior between these is inconsistent with what the task requirement is, i will modify endpoint to handle both cases and always spit out null if there is no match in text.

On second thought, might rethink the above ^^ @eprodromou. After talking with @Anomie it might be better to just default to spitting back whatever the search engine gives us from "getTextSnippet()". So if the search terms exists in the text snippet, the snippet will be html. Otherwise, a plain string of the beginning snippet of the page will be returned.

Great

Change 549238 merged by jenkins-bot:
[mediawiki/core@master] Basic Search endpoint

https://gerrit.wikimedia.org/r/549238

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.8; 2019-11-26).Nov 20 2019, 2:00 PM

• nnikkhoui closed this task as Resolved.Nov 25 2019, 9:16 PM

I'm re-opening so I can go over the endpoint and make sure it meets the acceptance criteria. @nnikkhoui not a problem, but I try to do this for all the user stories.

• eprodromou closed this task as Resolved.Jan 10 2020, 2:44 PM

• eprodromou closed subtask T236168: Implement basic search endpoint as Resolved.

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:41 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 16 2020, 6:38 PM