Page MenuHomePhabricator

[Spike 1day] Morelike Query traffic estimation
Closed, ResolvedPublic2 Estimated Story PointsSpike

Description

We are considering using the morelike api as part of our work on inter-article recommendations. This API is currently used for RelatedArticles but this is on-demand / requires people to scroll to the bottom of the article. Consequently, this API is not yet ready to be used on a wider scale. One potential treatment of this API is to have it activate when users activate search, which would result in significantly higher traffic. So! We need to get a better sense of current traffic as well as expected future traffic, that we can then pass on to the Search team to help them know how they might scale this API.

Question we are trying to answer

    • What is the current level of traffic this API experiences?
  • What is the expected level of traffic it might experience? Investigate: how much traffic the search widget currently generates, how likely we are to have this content be cached / other caching implications, how many requests we experience per unit time and the characteristics of this traffic (eg are there surges / other interesting aspects?)
    • For Morelike results to show in the search bar for anonymous users
    • For Morelike results to show on pageload for anonymous users

Acceptance Criteria

  • Work with Data Analyst on obtaining information about current and future potential usage
  • Work with Search team where necessary to get additional context
  • Perform analysis and obtain concrete numbers of potential future traffic, keeping in mind the above factors

Event Timeline

ovasileva updated the task description. (Show Details)
ovasileva raised the priority of this task from Medium to High.
ovasileva subscribed.

After discussion, we decided it makes the most sense for this to be driven by an analyst with engineering supporting where necessary.

NBaca-WMF renamed this task from [Spike] Morelike API traffic estimation to [Spike] Morelike Query traffic estimation .Jul 12 2024, 5:53 PM
ovasileva renamed this task from [Spike] Morelike Query traffic estimation to [Spike 1day] Morelike Query traffic estimation .Jul 15 2024, 5:46 PM
ovasileva set the point value for this task to 2.

What is the current level of traffic this API experiences?
At a quick glance, we can use the webrequest table and Turnilo to query the number of hits with gsrsearch=morelike in the URL. The results show about 0.7m hits per day.

Screenshot 2024-07-16 at 11.56.46 AM.png (1×2 px, 420 KB)

Turnilo link

The morelike results are part of the Action API which sends a cache-control: s-maxage=86400, max-age=86400, public header, so results are cached on the client for 24 hours.

Additionally, it would be useful to have some idea on the kind of queries we expect. For example: do we expect MoreLike requests to be based just on the article being viewed? Or do we want to further filter based on some user preference (their preferred topics for example)? This would help get an idea of the fragmentation of caching we can expect.

@Gehel Right now we're mainly interested in content related suggestions (suggestions based on the current article).

Reading over the morelike documentation, it is interesting that the API supports multiple titles, potentially enabling suggestions based on things like a browsing session (previous titles) or watchlist, but I can certainly see how such queries would substantially fragment the cache and increase load.

What is the expected level of traffic it might experience? Investigate: how much traffic the search widget currently generates, how likely we are to have this content be cached / other caching implications, how many requests we experience per unit time and the characteristics of this traffic (eg are there surges / other interesting aspects?)

Since we're currently thinking of targeting the empty search state for some experiments, it would be interested to know more about the search box behaviour itself. Some interesting metrics could be:

  • search-box activation (how many people click on the search box, this could be distinct from the amount of queries we receive)
  • search-box dwell time (how long before someone starts typing into the search box)
  • search-box abandonment (how many people activate the search box but don't type a query).

That could give us an idea of the use-case for an enhanced empty search state, since presumably we'd be targeting the use-case of "people looking for something, but not entirely sure what".

What is the current level of traffic this API experiences?

Using the following Superset query, querying the sum of the num_morelike_req column from 2024-06-22 to 2024-07-22, we can see that the morelike API gets on average 97m requests per day.

num_morelike_req
SELECT 
    date_trunc('day', CAST(CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) AS TIMESTAMP)) AS "date", 
    SUM(num_morelike_req)
FROM "discovery"."webrequest_metrics"
WHERE CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) >= '2024-06-22 00:00:00.000000'
  AND CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) < '2024-07-22 00:00:00.000000'
GROUP BY date_trunc('day', CAST(CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) AS TIMESTAMP));
What is the expected level of traffic it might experience? Investigate: how much traffic the search widget currently generates...

Using the same approach, this time querying the autocomplete requests with num_ac_req, we can see that the autocomplete API generates about 77m requests per day.

num_ac_req
SELECT date_trunc('day', CAST(CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) AS TIMESTAMP)) AS "date",
      SUM(num_ac_req)
FROM "discovery"."webrequest_metrics"
WHERE CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) >= '2024-06-22 00:00:00.000000'
  AND CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) < '2024-07-22 00:00:00.000000'
GROUP BY date_trunc('day', CAST(CONCAT(CAST(year AS VARCHAR), '-', LPAD(CAST(month AS VARCHAR), 2, '0'), '-', LPAD(CAST(day AS VARCHAR), 2, '0')) AS TIMESTAMP));

These results are interesting because they show a higher volume of morelike traffic than autocomplete traffic.


Reading through the Search team's annual report, (which is awesome, thank you @EBernhardson for putting this together!) these results are validated. Related Articles on mobile represent more pageviews than autocomplete suggestions on both desktop and mobile:

Related articles:

As of 20-May-2024 we see that:

  • Around 10% of internal pageview actors (people who navigate to a page once on the site) are Related Articles users.
  • Around 6% of Related Articles presentations result in a click.

autocomplete:

As of 20-May-2024 we see that:

  • About 0.8% of mobile web pageview actors are autocomplete searchers.
  • About 3.4% of desktop pageview actors are autocomplete searchers.

Some takeaways from a meeting with the search team:

morelike capabilities

  • The morelike API is highly flexible. It can receive a single page or multiple page titles as input, and even free form text. It ranks the results based on the similarity of the article text.

morelike performance characteristics:

  • the morelike API is generally CPU intensive. It requires several steps to produce results:
    • It extracts significant terms from a given document
    • Does a full-text search across the corpus for documents containing similar words
    • Scores the results based on word frequency

Because of these multiple steps, this API can often be a bit slow. If multiple titles are given to morelike as inputs, it has to run this process across each title and then compare the results at the end. If the set of titles is highly unique (e.g. personalized for users) that also drastically limits the ability to cache results. For this reason, personalized morelike results would require a significant investment in scaling this API. Results that are more generic and focused on article content (like Related Articles) are more easily cached. Other performance optimizations could also be applied such as limiting the number of terms to match, enabling "term vectors" or even pre-computing article results (certain performance optimization may have an impact on quality).

morelike limitations
Since the morelike results are based on term frequency, they can sometimes produce unexpected results. There are also language specific issues if a language is written in different scripts (e.g Serbian) or has language variants (e.g Chinese).

For a more in-depth analysis of the morelike traffic estimates, the search team has performed their own high-level spike on the topic:
High level plan of how to scale MoreLike

Jdlrobson-WMF added subscribers: bwang, Jdlrobson-WMF.

@Jdrewniak @bwang let's talk about this before next sprint kick off since we will be increasing traffic in our next deploy.

Resolving. Traffic we're expecting with search suggestions is within our budget