Page MenuHomePhabricator

PCS caching and pregeneration when restbase is decommissioned
Open, Needs TriagePublic

Description

Currently, PCS sits behind two layers of caching:

  • the edge caches, which get filled by repeated external requests
  • restbase, which gets its contents pregenerated via changeprop

Given we have the goal of removing restbase from the equation, we set out to figure out how much all of the above actually benefit our users. We would like to keep the system as simple as possible.

In this task we want to determine if:

  • We can just get rid of the restbase pregeneration
  • We can get rid of pregeneration and caching/invalidation
  • We need to preserve both in PCS

Reference:

See also:

Event Timeline

Summarizing my results:

Edge cache
From turnilo I extracted the hourly webrequests for /api/rest_v1/page/(mobile-html|summary|media-list|mobile-html-offline-resources|references/mobile-sections|definition), which is a good approximation of what mobileapps does.

My results:

  • We have a generally good overall cache hit ratio at the edge of 90% for all endpoints
  • The vast majority of requests, around 98.5% of all requests to PCS are for page/summary, which has a slightly higher cache hit ratio
  • About 10% of the remaining requests are uncachable

Restbase
At the restbase layer, we don't have good observability, and more importantly we can't really distinguish the requests we make for pregeneration and the requests that are "organic".

However, the data restbase offers, restbase.router.external.v1_page_summary_-title-.GET.200.sample_rate (not separated by site) reports about 3x the total number of requests we actually calculate from turnilo (2.2-3k vs 700-1k). I am not sure of why this happens and will need to investigate it with @Jgiannelos or @MSantos, but given it seems inconsistent with the rest of the data we have, Including the fact that trafficserver just reports around 1.2k to 1.9k requests to restbase overall from all text caches.

Given how this metric seems dubious, I will ignore metrics coming from restbase from the most part.

PCS
What we can get, though, is the number of requests we actually make from restbase to mobileapps, combining both pre-generation and actually serving requests that are uncached both in restbase and in the backend: using the following expression on thanos:

sum(increase(envoy_cluster_upstream_rq{envoy_cluster_name=~"mobileapps", cluster=~"restbase", instance=~".*"}[1h]))by (envoy_cluster_name)

we were able to extract the hourly request rate overall to PCS, which currently sits between 300 and 900 rps, so about half the requests we'd get if there was no cache nor pregeneration.

image.png (339×578 px, 25 KB)

This would be totally acceptable to me as a request rate, if it wasn't for a few factors:

  1. PCS makes a lot of requests to the mediawiki API; on average 2 requests per PCS request (including the page summary, which seems odd to me). This would result in PCS making about 5k rps to the mediawiki API, which would be a large majority of the internal API requests we make and more than half of the total requests.
  2. PCS has very slow response rates. According to envoy's telemetry, which we can generally trust, PCS has a p99 of ~ 8 seconds, and a p90 of ~ 2 seconds, which would seem unacceptable.

My current recommendation would be to:

  1. try to turn off pregeneration for a few days to check what is the effect on backend requests; # if that looks worrisome, we should probably invest time in converting page/summary to be precomputed by MediaWiki (possibly?) in the jobqueue, and then respond to it from a REST endpoint, and just let PCS take care of the rest of its endpoints which are low-volume without any caching or pregeneration.
  2. if that looks scary/inconvenient, we should add at the very least caching to PCS to make it faster to respond to external requests, without the need of fanning out 2x the requests to the MW API
daniel updated the task description. (Show Details)
daniel updated the task description. (Show Details)

Change 840097 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] changeprop: Disable pregeneration on PCS

https://gerrit.wikimedia.org/r/840097

[…] doing proper language converter redirects (T240068) and handling flagged revisions (T209936) become a lot easier if this code is moved to core (in PHP). It will also avoid a round trip, since both the parsoid and legacy html are cached in core. […]

Note that an (inefficient) PHP implementation of this already exists in the form of the TextExtracts extension. This was originally developed for the Popups extension, and after a performance review found latencies upward of a second (due to no use of Varnish cache and no use of ParserCache) it was then re-implemented in Node.js, although this wasn't more performant per-se. It's just that RESTBase's pre-caching helped hide the latency (T117082: Cached REST endpoint for extracts requests, and T117082, T70861, T123445, T118147).

The Popups extension (afaik) still supports the TextExtracts extension API as source for its Hovercards, and remains also what we use locally during development and for third-parties. I'm not suggesting anything regarding its current code per-se, but that extension might make for a good place to host the new REST endpoint, and then for someone to take ownership over that extension (and e.g. replace its Action API module with a call to the same optimised code, to support current batch consumers). See also: T256505: TextExtracts extension: Code stewardship review and T231797.

This was originally developed for the Popups extension,

I don't think this is true from what I recall. TextExtract predates Popups by quite a few years and was made by Max Semenik a while back for mobile.

The TextExtracts extension has a slightly different use case - and these days it's primarily used by user tools for obtaining plain text. The first version of Popups was built using it, but when web team inherited Popups we quickly determined it was not fit for the use case of page previews. The only reason we still link Page previews to TextExtracts is 3rd party support. If a PHP implementation existed, we would remove the API calls to TextExtracts and associated code and this would also save quite a few bytes on the client.