morelike queries take up a significant load on the cluster (~20%) vs the number of queries we actually perform. Some initial analysis in hive suggests we can cut the number served to 1/5 of the current load by caching responses for 24 hours:
A relatively naive query in hive suggests in the span of 24h we could cut morelike queries to the backend from 7.3M to 1.7M:
select sum(total), sum(deduplicated) from (select count(1) as total, count(distinct requests[0].query) as deduplicated from wmf_raw.cirrussearchrequestset where year=2016 and month=1 and day=10 and requests[0].querytype = 'more_like' group by wikiid) x;
_c0 _c1
7331659 1726091
This tries to get a rough estimate on how that compares to the variance in the way uri's are sent. I'm not sure how good of an approximation this is, but the totals are similar enough it might be a good guess:
select sum(total), sum(deduplicated) from (select count(1) as total, count(distinct uri_query) as deduplicated from wmf.webrequest where year=2016 and month=1 and day=10 and uri_query LIKE '%search=morelike%' group by uri_host) x;
_c0 _c1
7383599 2214332
The typical API request looks like: https://en.m.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=pageimages%7Cpageterms&piprop=thumbnail&pithumbsize=80&wbptterms=description&pilimit=3&generator=search&gsrsearch=morelike%3ACome_Share_My_Love&gsrnamespace=0&gsrlimit=3