There is currently a noticeable difference in performance when an image suggestion response includes MediaSearch results. While some difference is probably unavoidable, we should see if we can do better than we currently are. Investigate possibilities for improving this and make any resulting adjustments.
Description
Related Objects
Event Timeline
@BPirkle were you able to do any look into performance of the actual request to MS itself vs. how long the API takes to process?
I did a bit of profiling today.
I logged into toolforge and used curl to hit mediasearch. Using curl's built-in timing mechanisms (see https://curl.se/docs/manpage.html, search for "time_total") I was seeing total times ranging from 0.3 seconds to 0.7 seconds. The urls I was hitting were of the form:
This is reasonably representative of our current usage in the service. The actual gsrlimit will vary depending on the number of results that were found in the image matching algorithm (or up to 10, in the cases where either there are no ima results, or source filtering specified only ms).
This is not good news. Requesting results for 7 pages at half a second each is 3.5 seconds. We're already executing our calls asynchronously rather than linearly (which is why we're not seeing 3.5 second response times). And maybe there's something we can improve in our code. But unless we change how we're using MediaSearch, this is still going to be slow.
For fun, I tried cranking gsrlimit up to 100. That caused me to more consistently get total times in the 0.6-0.9 second range, so the number of results did impact the times. However, the times did not scale linearly with the number of results (or else we'd have seen much larger total times).
I see no way in the Action API to make multiple requests in one API call. I can add additional keywords to the call, but then I get results that apply to that combination of keywords, not separate results for each keyword. I can use other tricks to add additional generators to the api call. So I could get additional info about the same pages in one call. That would be great if we were having to make one "search" API call to get the list of pages then a separate API call to get additional details about those same pages. But that's not our use case.
I did notice that the mediasearch response time dropped SIGNIFICANTLY if I added more keywords. So searching for "taco" took about half a second, while searching for "fish taco tamale cheese" took less than a tenth of a second. (I guess I was hungry while I was doing this...) However, the only info we currently have to search by is the page title, which doesn't give much opportunity for additional keywords.
This may point to a larger issue with our architecture. If we were using the recommendations flag in Search to determine which pages had recommendations (instead of using the .tsv files), we could potentially also get back, as part of the same Search query, more information that we could use as MediaSearch keywords. This might give us both better and faster results from MediaSearch. But that's not where we are today.
Will keep digging. Some things I've thought about exploring:
- see how much of the slowdown might be in our service code, and if there's any way to improve it (probably at best a small improvement)
- rework the entire PoC architecture to use Search flags as the source of unillustrated pages (almost certainly too big a change for right now)
- make an API call to some endpoint (which?) to get better keywords, then do the MediaSearch query with those keywords instead of the page title. If both calls were really fast, the total time for the two combined might be less than the current time for just the MediaSearch query
- pregenerate better keywords (how?) and store them somewhere (where?), so that when we make the MediaSearch call it is both good and fast
- make MediaSearch calls some other way than via the Action API (not sure how, or if that would actually help)
I'll see if I can do some profiling related to the first and last items. I suspect that won't be very helpful. But maybe I'm wrong. At a minimum, it'll confirm that the slowdown is where we think it is.
Let me know if you see something obvious I missed, some mistake I made in the analysis, or another path I should explore.
IDK whether this would be relevant for & fit into what you're building, but \MultiHttpClient supports concurrent requests.
There's no way that we'll be able to tweak or optimize what elasticsearch does, so best we can do is eliminate stuff in MediaSearch.
Part of what MediaSearch does is find Wikidata statements that match the search term (via an API call, which does another search - this should eventually end up being replaced (T268648) by a more efficient lookup)
Data for all relevant pages (statements, synonyms, ...) will then be added to the elastic query, so that it has more signals to go find relevant files with - but they make the whole thing more complex and slower.
Unlike a random search term, we'd have page titles here, from which we can easily find exactly what Wikidata ids it's linked with (Wikibase\Lib\Store\SiteLinkLookup::getItemIdForLink)
This would replace an API & search call to find the relevant Wikidata entities, and also reduces the amount of relevant entities down from potentially a dozen to 1, making the final elastic query more slim.
IDK exactly what the impact will be, but I expect "significant enough" based on the statement that "more keywords" (= probably ceases to find, or drastically narrows down the amount of relevant wikidata entities) greatly improves the response.
AIUI, that's work that the SD team was already planning to do after the PoC.
Thanks @matthiasmullie !
At least for this phase of the project, the Image Suggestion API is implemented as a nodejs service. So MultiHttpClient doesn't apply, because we're doing this outside MediaWiki. But we are making requests asynchronously per the usual node patterns. So it isn't like we're iteratively making multiple requests to MediaSearch one after the other.
I am not very familiar with MediaSearch, and I'm concerned that I'm not using it optimally in the Image Suggestion API. Are you and @Cparle the right people to answer questions? If so, do you prefer Phab, IRC, Slack, email, Google Meet, etc.? If not, who would you recommend?
For starters:
- is making an Action API request with generator=search the right way to call MediaSearch? Specifically, if I'm searching for images for a page titled "fish", I'm doing queries (from the node service) that look like this:
Is that the right way to invoke MediaSearch, or is there a better way?
- given that searching for "fish" is slower than searching for "fish tacos" and "fish street tacos" is even faster, are there established usage patterns or suggestions that you may have on how best to get image recommendations for a page via MediaSearch? Right now, all we have to work with is a Page Title and a Page Id, but maybe there are other things we work out. Part of the PoC is learning, improving, and iterating.
- are there any rate limits we should know about? For the PoC, at least, we expect there to be a lot of redundancy in the suggestions served, so we've talked about precaching some MediaSearch results. That could be bursty. I'd be surprised if the PoC traffic were problematic, but better to ask.
Thanks!
I am not very familiar with MediaSearch, and I'm concerned that I'm not using it optimally in the Image Suggestion API. Are you and @Cparle the right people to answer questions? If so, do you prefer Phab, IRC, Slack, email, Google Meet, etc.? If not, who would you recommend?
Yeah myself and @matthiasmullie are probably the best people to talk to. Here is fine, or Slack, or whereever - we're not choosy :) There's nothing very special about MediaSearch really, it's just a CirrusSearch profile that we have tuned for images. So from the outside it's just the same as any other call to the action api (which is the correct way to invoke it), and there are no rate limits ... it's interesting that "fish street tacos" is faster than just "fish", but I don't think there's anything you can do from your end to use this to speed up your queries. T268648 will, we hope, improve response times, and implementation is beginning, but it's not going to be done super-quick
That API call looks good, it's exactly what it should look like.
I suspect that "fish street tacos" is faster simply because that term find fewer (if any) relevant matching Wikidata entities than "fish" does, thus MediaSearch with fewer data, thus resulting in a simpler/faster search query. We expect T268648 will optimize that a bit. There should be ways that we can use the page ids to optimize which Wikidata entities are used (faster & more relevant) but they don't yet exist. That might be something for us to build after the POC. The best that we can do with what we have ATM is simply using the page titles as search terms, as you are already doing.
@Cparle and @matthiasmullie , thank you for your replies!
@sdkim , I'll do a bit more benchmarking to make sure there aren't any slowdowns on the the Image Suggestion API side that we can improve upon. But it sounds like in the short term, responses that include MediaSearch results may not be quite as snappy as we'd all wish. This could affect out implementation choices for T276993: Filter out pages with no suggestions . It also has implications for the proposal to raise our maximum limit on pages returned.
It also sounds as if my thoughts about generating "better" keywords were at best misguided and at at worst would actually do harm. Per Matthias' explanation, adding additional keywords may speed up MediaSearch because it can't find any relevant matching Wikidata entities. Therefore we may get lower quality results. Bad data fast isn't exactly a win.
The best idea I have at the moment is caching on the Image Suggestion API side. Given that we expect a non-negligible percentage of the PoC traffic to be redundant in terms of pages requested, caching could offer significant gains. It probably isn't a sufficient long-term solution: once client are mainly receiving randomized results our cache hit percentage would probably drop dramatically. But given that T268648 may improve things post-PoC, the need for caching may decline as well.
I'll do a bit more investigation, then let's have a conversation about how to proceed.
Further investigation did not yield any short-term easy gains. Caching seems to be our best immediate solution for a speedup. Should we choose to add caching (and I can't imagine that we won't at some point), it would best be implemented under its own task.
I suggest we close the current task as resolved. I've created T278263: Consider caching for the Image Suggestions service specifically for considering caching.