Page MenuHomePhabricator

TextExtracts extension frequent slows down opensearch API by several seconds
Closed, ResolvedPublic

Description

This API generally responds in under 50 ms, but there is a "slow case" that is consistently hit every single minute for dozens of queries causing timeout fatals and other errors in production.

Grafana: https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?orgId=1&refresh=5m&var-metric=p95&var-module=opensearch

  • 50th percentile (per minute): ~ 50ms
  • 75th percentile (per minute): ~ 70ms
  • 95th percentile (per minute): ~ 700ms (!)
  • 99th percentile (per minute): ~ 1.2 minute (!)

I looked at Logstash fatals with ApiOpenSearch in their trace, and collected a few debug profiles in XHGui.

It seems that the main distinguishing factor between a fast response and a 10-20s+ slow response is whether TextExtracts is doing work.


Example query: https://commons.wikimedia.org//w/api.php?format=json&action=opensearch&search=Ice
XHGui profile: https://performance.wikimedia.org/xhgui/run/view?id=5df3ac043f3dfa6e273f3ba3

This took 19.9 seconds to respond. With 19.7 seconds spent in TextExtracts\ApiQueryExtracts::execute.

This method in turn spends most of its time synchronously parsing wikitext and executing Lua modules for dozens of pages in a row.

The slowest endpoints we have in terms of perf budget are the POST requests for saving edits, which do this once. Doing this dozens of times in a single request is far more than ever meant to happen during a single web request. Especially a GET request to a high-traffic endpoint like the OpenSearch API.

It is also unclear to me why this doesn't use ParserOutput objects from the ParserCache. Instead, it appears to be fetching raw wikitext and feeding it uncached to the Parser, and then caching the result in a custom Memcached key (no warmup in place, no WANCache protections in place).

Event Timeline

From what I can tell, this hook populates an (optional) property in the OpenSearch output which should make it fairly easy to disable without breaking clients.

Having said that, I could not find any major uses of it.

  • On desktop (Vector) the search suggestions don't use the caption (and can't realistically, because it's not a standard part of the OpenSearch output as core generates it).
  • On mobile (Minerva) a different API is used (prefixsearch, not opensearch) and it uses Wikibase descriptions as captions, not text extracts.

As such, I would recommend immediate undeployment, at least over the holidays, until there is time and resourcing to get a better understanding of:

  • What this was/is meant to be used for.
  • Whether it is meant to be included in all OpenSearch output by default (e.g. no prop= parameter, like we normally do for custom properties like this).
  • How we can accomodate it performance-wise. The currently implementation is quite inefficient, but, this is by no means required to yield its current result. Fundamentally this feature is very do-able in a tight budget (e.g. using existing ParserOutput, non-DOM fallback if cache miss, possibly a post-save primer towards page_props, or ParserCache).

Having said that, I could not find any major uses of it.

  • On desktop (Vector) the search suggestions don't use the caption (and can't realistically, because it's not a standard part of the OpenSearch output as core generates it).
  • On mobile (Minerva) a different API is used (prefixsearch, not opensearch) and it uses Wikibase descriptions as captions, not text extracts.

Note that the OpenSearch endpoint is also used by browsers' search bars. I have no idea whether any of them make use of that data either though.

I don't know why those skins are using it instead of prefixsearch, since prefixsearch provides better formatting and error reporting.

As such, I would recommend immediate undeployment, at least over the holidays, until there is time and resourcing to get a better understanding of:

I hope you don't mean undeployment of the whole extension. Just set $wgExtractsExtendOpenSearchXml = false;.

OTOH, is this something new, or just something you just happened to notice recently?

  • What this was/is meant to be used for.

OpenSearch in general is for supporting browser search bars and other things that use the OpenSearch protocol for generically searching "sites".

As for the extracts specifically, r40100 hints that IE 8 might have used them. No idea of Edge still does. No idea about any other clients.

  • Whether it is meant to be included in all OpenSearch output by default (e.g. no prop= parameter, like we normally do for custom properties like this).

The trick would be in updating the URL in whatever browser search plugins exist for browsers that actually use the data.

As such, I would recommend immediate undeployment, at least over the holidays, until […]

I hope you don't mean undeployment of the whole extension. Just set $wgExtractsExtendOpenSearchXml = false;.

Yeah, just removing it from the critical path of OpenSearch, which is a lot more widely exposed. TextExtract has its own endpoint and features where it is explicitly requested by end-users, which isn't as urgent.

OTOH, is this something new, or just something you just happened to notice recently?

I try to triage Logstash from time to time. I skipped in October and November during which train operators were on their own mostly (whom hadn't seen or reported it). I noticed it this week and don't recall having seen it before, at least this common. We're constantly growing as a platform and lots of small tinkering and optimisations in different places could have indirectly led to this becoming as prominent an issue as it is today.

  • What this was/is meant to be used for.

OpenSearch in general is […]

I was under the (false) impression that the extra "description" field was a non-standard field we added for an internal use case of our own (e.g. for mobile web or app search suggestions), but that perhaps we stopped using it and forgot about it.

Looking at the OpenSearch spec in more detail, I see that it is actually part of the spec but is simply set to false (and cast to empty string) by default without TextExtracts.

The spec actually comes in two parts. There is the "Discovery" spec which is used for detecting that a site provides a search engine, at to which url it should submit searches. E.g. it tells browsers like Chrome that if you want your address bar to be a Wikipedia search engine, that pressing return will route you to https://en.wikipedia.org/w/index.php?title=Special:Search&search=<query>.

There is also a secondary spec for "Suggestions", which specifies that the Discovery in addition to providing a URL template for viewing a search result page, it can also provide a URL template for where the browser itself can internally fetch XML or JSON data, and the format for that data. And this includes in the response the query, the completion suggestions, and an optional description to show below each suggestion.

  • Whether it is meant to be included in all OpenSearch output by default (e.g. no prop= parameter, like we normally do for custom properties like this).

The trick would be in updating the URL in whatever browser search plugins exist for browsers that actually use the data.

My thinking was that the rare non-standard consumer that specifically wants this would opt-in (if we want it exposed through this API at all). But given it is part of the spec, I'm less certain. Depending on how common it is for browsers to actually use this (see below) and how valuable we think this is, it might make sense to not add in general. Any specific consumer interested in these could still get it through prefixsearch+prop=extracts.

As for the extracts specifically, r40100 hints that IE 8 might have used them. No idea of Edge still does. No idea about any other clients.

TL;DR:

  • Firefox supports suggestions (titles only, no descriptions).
  • Safari supports suggestions (titles only, no descriptions).
  • Chrome supports suggestions (title only, no descriptions).
  • IE 11 and Edge do not support any suggestions (only direct submission to Special:Search).
  • IE 8 and IE 10 (both end-of-life) did support suggestions and this included images and descriptions. IE 9 was broken.

Here goes…

Mozilla Firefox
  • Contains "Wikipedia (en)" by default as one of the 8 pre-installed opensearch templates (gecko source code)
  • Uses our Special:Search URL for direct submissions (text/html).
  • Type-ahead suggestions can be toggled from the browser preferences.
  • When enabled, it uses api.php?action=opensearch (JSON) for search suggestions.
  • Titles only. Descriptions not used.

Google Chrome
  • Does not ship with a Wikipedia by default.
  • Discovers the "Wikipedia (en)" search engine when browsing the site and registers it as an option in the preferences if and only if you've explicitly navigated to https://en.wikipedia.org/ at some point (e.g. going to a specific article directly from another website or search engine does not make it added). – https://dev.chromium.org/tab-to-search
  • Uses our Special:Search URL for direct submissions (text/html).
  • Type-ahead suggestions can be toggled from the browser preferences.
    • While this preference exists, it is quite confusing to find because it is placed under "Google services". Also, even when this option is turned on, Chrome only shows at most three suggestions. Usually fewer because past searches and visited URLs take precedence and eat into that quota.
  • Titles only. Descriptions not used.

Apple Safari
  • Does not ship with a Wikipedia by default.
  • Has an explicit browser setting that controls whether it will pick up a website's search field ("Enable Quick Website Search")
  • These "quick search" options can't be used as the default for address bar queries. That mechanism is limited to an immutable list of four hardcoded general search engines (DuckDuckGo, Bing, Google, and Yahoo). Instead, the "quick search" options can be triggered by typing part of the website's title or domain name. It ignores our custom "Wikipedia (en)" title though, using the title of your past visit instead (e.g "Wikipedia, the free encyclopedia". Also, unlike Chrome and Firefox, Safari doesn't tell the user they have to press Tab to enter the search suggestion mode, but does require this. Once you press tab, what you typed is removed in favour of en.wikipedia.org: and the user can continue to get suggestions.
  • Type-ahead suggestions are enabled by default for all "Quick Website Search" engines the browser has discovered. I couldn't find a way to opt-out.
  • Titles only. Descriptions not used.
  • See also: webkit bug #16030.

IE 8 on Windows 7 (end-of-life)
  • Does not ship with a Wikipedia by default.
  • Discovers the "Wikipedia (en") engine while browsing the site, but it is not remembered by default (unlike Chrome/Firefox/Safari). Instead, the search box's arrow menu turns orange inviting the user to remember it.
  • When choosing to remember it, the browser asks (on a per-website basis) whether to provide suggestions as-you-type.
  • Uses titles. Uses descriptions. And also uses images. The images are not part of the OpenSearch spec, but rather a non-standard Microsoft extension that is also specific to the XML format (and we implement that in core).

IE 9 on Windows 7 (end-of-life)
  • Does not ship with a Wikipedia by default.
  • Does not discover our engine while browsing the site. IE 9 joined other modern browsers in no longer have a dedicated search box. It seems that with this, it also removed support for OpenSearch.
  • Through some deep browser settings, there is an "Add-ons" feature which has a subsection for "Search Providers". There is an option there about allowing programs to suggest new search engines, but best I can tell this really is about "programs" (e.g. native apps) and not websites. However there is a Microsoft-owned list of popular search engines one can install. On that page, Wikipedia is promoted as well although it does not (directly) use our API. It seems to route through a microsoft.com url when submitting search queries. However, it only submits searches upon hitting return. From what I could find in a few minutes, it does not appear to support suggestions. The chosen option also did not appear in the "Search providers" list, which meant I couldn't see the full url it uses.
  • No search suggestions. Descriptions not used.

IE 10 on Windows 7 (end-of-life)
  • Does not ship with a Wikipedia by default.
  • Does not discover our engine while browsing the site (same as IE 9).
  • Through "Add-ons > Search Providers" one can add a "Wikipedia" option, provided by Microsoft. The full url it registers is api.php?action=opensearch (XML) for search suggestions.
  • The user interface asks for search suggestions upon installation (same as IE 8 originally did). It also tells you every time it is being used that suggestions are coming from Wikipedia with a contextual way to immediately opt-out.
  • The suggestions are off by default for Bing, with a contextual way to opt-in.
  • Uses titles. Uses descriptions. And also uses images (same as IE 8).

IE 11 on Windows 7 or Windows 8.1
  • Discovers our engine while browsing the site and remembers it automatically. This is new in IE 11 (it was absent in IE9/IE10).
  • However unlike IE8, there isn't a way within the toolbar to select it. Instead you have to make it the default via the "Search providers" settings. It correctly registers api.php?action=opensearch (XML) for search suggestions. But, the suggestions do not appear to actually work. No matter how much I try. They are configured, but don't appear. It behaves similar to IE 9: It submits the query upon pressing return to Special:Search.
  • Confirmed both with IE 11 on Windows 7 (end-of-life) and on Windows 8.1
  • Also confirmed that when deleting the auto-discovered search provider, and then manually install the one from the add-on store, it is similarly limited and without suggestions.
  • No search suggestions. Descriptions not used.
IE11-Win7
IE11-Win8.1

Edge 18 on Windows 10

(This is the latest stable release of Edge as of writing. Note that this version is not yet based on Chromium.)

  • Discovers our engine while browsing the site and remembers it automatically. (same as IE 11)
  • There isn't an (obvious) way to use it unless it is made the default.
  • Only submits to Special:Search.
  • No search suggestions. Descriptions not used.

Chrome seems to have supported this at some point (although it's not clear if it was really OpenSearch-Based) but now I can't see any behavior like that.

Firefox intends to deprecate OpenSearch, although it's not clear if they also plan to abandon the search bar support or just the method for installing custom search plugins (their stated reason makes more sense for the latter).

For the few browsers that do use it, I am skeptical that text extracts in their current form are useful. They are pretty low quality and have been replaced in our own products with the extracts in the RESTBase summary API (related: T213505: RfC: OpenGraph descriptions in wiki pages which is about making those available within MediaWiki). And in any case, descriptions (as returned by the query+description API or wikidata) are probably a better fit for a search API; our own mobile search interfaces also use that for suggestions.

Those come from page_props and should be reasonably fast, so how about using replacing TextExtracts with those instead?

  • On mobile (Minerva) a different API is used (prefixsearch, not opensearch) and it uses Wikibase descriptions as captions, not text extracts.

That's a route we might go, actually: have MediaWiki-extensions-WikibaseClient (or whichever) use the ApiOpenSearchSuggest hook to provide the Wikibase description for this data.

Firefox intends to deprecate OpenSearch, although it's not clear if they also plan to abandon the search bar support or just the method for installing custom search plugins (their stated reason makes more sense for the latter).

... Sigh, Mozilla. I find the comment there interesting.

They are pretty low quality and have been replaced in our own products with the extracts in the RESTBase summary API (related: T213505: RfC: OpenGraph descriptions in wiki pages which is about making those available within MediaWiki).

IOW, people made "TextExtracts version 2" as a nodejs service, and now they're considering porting it back to PHP.

Jdlrobson added a subscriber: Jdlrobson.

FWIW I think we should sunset TextExtracts but that's a difficult decision to make given some gadget developers use it and I'm not sure who would make it. There is no product support in WMF for maintaining it.

Chrome seems to have supported this at some point (although it's not clear if it was really OpenSearch-Based) but now I can't see any behavior like that.

Firefox intends to deprecate OpenSearch, although it's not clear if they also plan to abandon the search bar support or just the method for installing custom search plugins (their stated reason makes more sense for the latter).

For the few browsers that do use it, I am skeptical that text extracts in their current form are useful. They are pretty low quality and have been replaced in our own products with the extracts in the RESTBase summary API (related: T213505: RfC: OpenGraph descriptions in wiki pages which is about making those available within MediaWiki). And in any case, descriptions (as returned by the query+description API or wikidata) are probably a better fit for a search API; our own mobile search interfaces also use that for suggestions.

Those come from page_props and should be reasonably fast, so how about using replacing TextExtracts with those instead?

Big +1
FWIW the reading web team do not actively support TextExtracts.

IOW, people made "TextExtracts version 2" as a nodejs service, and now they're considering porting it back to PHP.

We are not planning to port this back to PHP as far as I'm aware. Using Node.js was a conscious decision to share logic with mobile apps and to make use of the complicated query selectors we use (e.g. flattening nodes). I think reimplementing this in PHP would not be the best of ideas. You'll have to rewrite a lot of the libraries and then keep them in sync.

IOW, people made "TextExtracts version 2" as a nodejs service, and now they're considering porting it back to PHP.

We are not planning to port this back to PHP as far as I'm aware. Using Node.js was a conscious decision to share logic with mobile apps and to make use of the complicated query selectors we use (e.g. flattening nodes). I think reimplementing this in PHP would not be the best of ideas. You'll have to rewrite a lot of the libraries and then keep them in sync.

The first heading in T213505 is "Porting the summary logic in the Page Content Service to MediaWiki"...

The headings in that post are just possible options ("Using Page Content Service data in MediaWiki page HTML" and "Add a "functional" mode to the PCS summary endpoint, where it takes all data" are the others)

In the discussion there's been a lot of discussion about whether porting to PHP is practically possible and the current proposal there is to use the job queue with the existing REST service (T213505#5713196)

AIUI PCS (or MCS back then) was written in node because of the uncacheability of the action API, and the poor state of HTML5 support in PHP. Those have to be fixed for the Parsoid port anyway, and largely have been; the two things that are still blockers are proper DOM standard support in PHP (T217867) and a RESTBase-like cache (T227776, I think?).
The summary endpoint in PCS shares a lot of logic with other endpoints, so porting might mean a lot of code duplication; OTOH, that code sharing is the reason for it being pretty slow (the way it is written heavily prioritizes maintainability over performance) so maybe a significantly rewritten PHP port makes sense at some point (but not now, per above). I think that is the current conclusion of the RfC as well.

In any case, the RfC would make extracts available in MediaWiki, in one form or another. At that point, the TextExtracts API should probably just drop its own extraction logic and start using them. I still wouldn't use them for opensearch, where IMO Wikidata descriptions make way more sense.

So, any thoughts about switching to Wikidata descriptions (via DescriptionLookup in Wikibase Client, see the current API) as a way of moving forward? Given the very limited support for opensearch descriptions, I think that could move forward without any product discussion.

FWIW I think we should sunset TextExtracts but that's a difficult decision to make given some gadget developers use it and I'm not sure who would make it. There is no product support in WMF for maintaining it.
[…] the reading web team do not actively support TextExtracts.

Can you clarify what you mean by "usage" here? Are you referring to API requests to action=query&prop=extracts, or the description field it injects into action=opensearch? This task is about the latter.

As steward, I need your team to make a call on this. Note that in terms of compatibility, it won't break any consumers because the field is optional and defaults to a string and we already return empty string in some cases. This would be about turning off the TextExtracts hook that enhances/replaces it with the PHP-backed extraction logic. Which is incurring a major performance cost right now for all search suggestions from Vector, browsers and other apps - which afaik don't use it, but we compute it regardless. I suppose maybe this was intended for use in Minerva, but that didn't happen given it uses action=prefixsearch now, with optional wikibase descriptions, not text extracts.

Krinkle triaged this task as High priority.Dec 18 2019, 8:43 PM

(Again, if any of our products use action=opensearch, that's a bug and should be fixed. It is specifically provided for OpenSearch clients such as web browsers, which expect a very specific response format. For any code that's specific to MediaWiki and can deal with a JSON response, prefixsearch should be superior.)

Okay in the case of action=opensearch we don't currently use that for anything as far as I'm aware. Possibly it was added for an extension or gadget? Can we turn it off by default queries that do not ask for explicitly for now? Would that suffice?

The opensearch API provides a single format. No client can or is asking for any field of it in particular. Its array format contains 3 values like [title, description, url] where description defaults to an empty string in MediaWiki. As far as I can tell, both the internal use of it by default in core for all search suggestions in skins, as well as external consumers, don't use that field, and either way can handle the empty string already.

Turning it off would mean opensearch produces the empty string for that field like it did before TextExtracts was deployed. The dedicated API for querying extracts would remain as-is.

Turning it off would mean opensearch produces the empty string for that field like it did before TextExtracts was deployed. The dedicated API about querying extracts isn't in question and would remain as-is.

That sounds fine to me.

It is specifically provided for OpenSearch clients such as web browsers, which expect a very specific response format. For any code that's specific to MediaWiki and can deal with a JSON response, prefixsearch should be superior.

  • Mar 2006 (86c655d7ab, r13335): add ajaxsearch.js, ajax.php, and wfSajaxSearch().
  • Aug 2007 (ab6222084d, r16137): add <link rel=application/opensearchdescription+xml> and opensearch_desc.php (registers Special:Search as submission endpoint for HTML search results page)
  • Oct 2006 (b56d23ed46, r17005): add ApiOpenSearch.
  • Oct 2006 (5b3ca07293, r17040): extend opensearch_desc to also register ApiOpenSearch as suggestions endpoint.
  • Apr 2008 (d6fd8e7c13, r33400): replace ajaxsearch.js by mwsuggest.js and use api.php/opensearch instead of ajax.php/wfSajaxSearch.
  • Apr 2014 (af6d9aba6d0f): add ApiQueryPrefixSearch.

Change 559224 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[operations/mediawiki-config@master] Disable wgExtractsExtendOpenSearchXml

https://gerrit.wikimedia.org/r/559224

Change 559224 merged by jenkins-bot:
[operations/mediawiki-config@master] Disable wgExtractsExtendOpenSearchXml

https://gerrit.wikimedia.org/r/559224

Mentioned in SAL (#wikimedia-operations) [2019-12-20T16:27:18Z] <krinkle@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Ia9190a4e5, T240691: Disable wgExtractsExtendOpenSearchXml (duration: 00m 55s)

Krinkle claimed this task.

Overall CPU load from this endpoint on the api cluster was cut by 2/3rd (from ~ 43s concurrent per second, down to ~17s/s)

Latencies for end-users using suggestions on desktop:

  • 98% at the p99 (1min to 1sec).
  • 60% at the p95 (from 150ms to 60ms).
  • 40% at the p75 (from 75ms to 43ms).