Page MenuHomePhabricator

Investigate moving summary endpoint outside of mobileapps service
Closed, ResolvedPublic

Assigned To
Authored By
Jgiannelos
Oct 18 2022, 4:10 PM

Description

Currently summary endpoint is implemented inside mobileapps service and it internally reuses the MCS codebase (without outgoing requests to the MCS endpoints).
With RESTBase going away its worth investigating moving the summary endpoint outside of mobileapps.

Reasons:

  • It doesn't fall into mobileapps scope
  • Its heavily used by web clients not only apps (probably the main consumers)
  • It makes outgoing requests to both action API and Parsoid which made sense in an architecture where because of RESTBase services were designed to work closely interconnected but maybe its a performance overhead

Regarding outgoing requests, summary endpoint has the following:

  • Queries action API for flagged revisions
  • Queries Parsoid for page html
  • Queries action API for summary metadata
  • Queries action API for siteinfo

Event Timeline

For reference, here are some diagrams from the benchmarks.
I am running a local instance of PCS pointing to the production restbase.
The dataset is the top visited pages for some high traffic wikis (~7k pages)

With flagged revisions disabled:

benchmark - no flagged revisions enabled - count - per domain.png (598×3 px, 257 KB)

benchmark - no flagged revisions enabled - p95 latency - per domain.png (723×3 px, 201 KB)

benchmark - no flagged revisions enabled - p99 latency - per domain.png (676×3 px, 158 KB)

With flagged revisions enabled:

benchmark - flagged revisions enabled - req count - per domain.png (805×3 px, 281 KB)

benchmark - flagged revisions enabled - p95 latency - per domain.png (718×3 px, 115 KB)

benchmark - flagged revisions enabled - p99 latency - per domain.png (672×3 px, 122 KB)

After running some benchmarks for page summary performance here are some findings:

  • When page summary is not using flagged revisions and its always fetching latest revid for a given parsoid page latency is low:
    • p95 < 100ms
    • p99 < 350ms
  • When page summary is using flagged revisions things get a bit more complicated:
    • PCS queries parsoid for a page with a revid thats not the latest
    • That triggers a re-render every time we ask for a non-latest revid which is not stored in restbase storage.
    • We rely on upstream caching to avoid re-rendering the page every time
    • Things get worse for wikis that have flagged revisions enabled and there are many unconfirmed revisions
    • Latency
      • p95 < 2s
      • p99 < 8s
      • Eventually for wikis that are not heavily using flagged revisions caching kicks in so latency goes down
      • For wikis with a lot of flagged revisions caching doesnt help that much

I think I (mostly) figured out the problem we're observing, and it's a combination of restbase's behaviour and flagged revs.

When flaggedrevs is enabled on page Foo, /wiki/Foo will show the last "approved" revision, not the latest one.

This creates a problem with restbase, which does not respect flaggedrevs and will always show the latest revision at /page/html/Foo; this created problems to the page summary endpoint, see T279472

This also means that restbase will never cache the "approved" revision, so when PCS requests /wiki/Foo/<approved-id> it will always trigger a re-render of the page, which can be very expensive.

Now this gets worse because RestBASE does three IMHO very questionable things, when combined:

  • Sends must-revalidate in all its responses cache-control
  • Also sends only Etag for its URLs and NOT Last-Modified;
  • When calculating the Etag for an asset, it will produce a different one every time because IIRC the etag is a combination of revision ID and a timestamp because we don't want to serve "stale renders"; so the Etag will be consistent between requests only if restbase stores the asset in cassandra

So for the latest revision of a page, basically restbase decides to be in control of all caching including at the edge, but given we always return the same etag (it's in cassandra), the edges keep considering the object cached. When we ask for another revision though, basically we renounce completely the ATS layer of caching, because whenever it sends a revalidation request, restbase will see (with high probability) a non-matching etag and recompute a full response.

This means that for articles with flaggedrevs enabled we have not only little to no edge caching, but also no internal caching whatsoever.

I see two ways of fixing this:

  • Make restbase aware of flaggedrevs and behave like mediawiki does, caching the "verified" revision
  • Once we'll use parsercache for storing parsoid on the mediawiki side, we can completely bypass restbase for PCS and we'll get better caching at the edge and not suffer from such edge cases

I run once again everything but instead of pointing PCS to restbase parsoid, I used core page html responses:

/v1/page/{title}/html
/v1/revision/{revision}/html

Here are the results
p95 parsoid latency

benchmark - flagged revisions enabled - core page html - p99 latency - per domain.png (676×3 px, 260 KB)

p99 parsoid latency
benchmark - flagged revisions enabled - core page html - p95 latency - per domain.png (625×3 px, 250 KB)

page summary endpoint throughput
benchmark - flagged revisions enabled - core page html - summary responses per sec.png (586×3 px, 54 KB)

It looks like after cache gets populated responses get very fast and throughput much higher.
Compared to parsoid on restbase, latency is higher until caching kicks in.

Jgiannelos changed the task status from Open to In Progress.Nov 3 2022, 11:40 AM
Jgiannelos claimed this task.

From quarry some investigation on how many non-latest stable revisions we have in the wikis we run the benchmarks at:

SELECT COUNT(*) FROM page, flaggedpages
WHERE (page_id = fp_page_id) AND (page_latest <> fp_stable)
  • enwiki: 4
  • dewiki: 12611
  • eswiki: no FlaggedRevs enabled
  • frwiki: 0
  • itwiki: no FlaggedRevs
  • jawiki: no FlaggedRevs
  • ptwiki: no FlaggedRevs
  • ruwiki: 350544

It looks like for the scope of this ticket we have enough information about the bottleneck on summary endpoint and it looks like the problem is not necessarily the API calls but rather the flagged revisions handling.
It wouldn't make too much difference to move it to an extension performance wise.

We already have tickets for the flagged revision handling on MW REST api so I am closing this one for now.