Investigate moving summary endpoint outside of mobileapps service
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jgiannelos
	Oct 18 2022, 4:10 PM

Description

Currently summary endpoint is implemented inside mobileapps service and it internally reuses the MCS codebase (without outgoing requests to the MCS endpoints).
With RESTBase going away its worth investigating moving the summary endpoint outside of mobileapps.

Reasons:

It doesn't fall into mobileapps scope
Its heavily used by web clients not only apps (probably the main consumers)
It makes outgoing requests to both action API and Parsoid which made sense in an architecture where because of RESTBase services were designed to work closely interconnected but maybe its a performance overhead

Regarding outgoing requests, summary endpoint has the following:

Queries action API for flagged revisions
Queries Parsoid for page html
Queries action API for summary metadata
Queries action API for siteinfo

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T324931 Clean up open RESTBase related tickets
In Progress	None	T262315 <CORE TECHNOLOGY> API Migration & RESTbase Sunset
Stalled	Dbrant	T328943 Replace PCS lazy-loading logic with standard "loading=lazy" attribute
Open	None	T314025 [EPIC] Migrate PCS service away from restbase
Resolved	Jgiannelos	T321101 Investigate moving summary endpoint outside of mobileapps service

Event Timeline

Jgiannelos created this task.Oct 18 2022, 4:10 PM

Jgiannelos updated the task description. (Show Details)

Jgiannelos updated the task description. (Show Details)Oct 18 2022, 4:14 PM

Joe subscribed.Oct 18 2022, 4:54 PM

daniel added a project: Platform Team Workboards (MW Expedition).Oct 18 2022, 6:40 PM

daniel moved this task from Unsorted pile to PCS Service Pile on the Platform Team Workboards (MW Expedition) board.Oct 19 2022, 3:06 PM

Jgiannelos moved this task from Needs triage to PCS/RESTBase sunset on the Product-Infrastructure-Team-Backlog-Deprecated board.Oct 20 2022, 2:05 PM

For reference, here are some diagrams from the benchmarks.
I am running a local instance of PCS pointing to the production restbase.
The dataset is the top visited pages for some high traffic wikis (~7k pages)

With flagged revisions disabled:

With flagged revisions enabled:

benchmark - flagged revisions enabled - req count - per domain.png (805×3 px, 281 KB)

After running some benchmarks for page summary performance here are some findings:

When page summary is not using flagged revisions and its always fetching latest revid for a given parsoid page latency is low:
- p95 < 100ms
- p99 < 350ms

When page summary is using flagged revisions things get a bit more complicated:
- PCS queries parsoid for a page with a revid thats not the latest
- That triggers a re-render every time we ask for a non-latest revid which is not stored in restbase storage.
- We rely on upstream caching to avoid re-rendering the page every time
- Things get worse for wikis that have flagged revisions enabled and there are many unconfirmed revisions
- Latency
  - p95 < 2s
  - p99 < 8s
  - Eventually for wikis that are not heavily using flagged revisions caching kicks in so latency goes down
  - For wikis with a lot of flagged revisions caching doesnt help that much

I think I (mostly) figured out the problem we're observing, and it's a combination of restbase's behaviour and flagged revs.

When flaggedrevs is enabled on page Foo, /wiki/Foo will show the last "approved" revision, not the latest one.

This creates a problem with restbase, which does not respect flaggedrevs and will always show the latest revision at /page/html/Foo; this created problems to the page summary endpoint, see T279472

This also means that restbase will never cache the "approved" revision, so when PCS requests /wiki/Foo/<approved-id> it will always trigger a re-render of the page, which can be very expensive.

Now this gets worse because RestBASE does three IMHO very questionable things, when combined:

Sends must-revalidate in all its responses cache-control
Also sends only Etag for its URLs and NOT Last-Modified;
When calculating the Etag for an asset, it will produce a different one every time because IIRC the etag is a combination of revision ID and a timestamp because we don't want to serve "stale renders"; so the Etag will be consistent between requests only if restbase stores the asset in cassandra

So for the latest revision of a page, basically restbase decides to be in control of all caching including at the edge, but given we always return the same etag (it's in cassandra), the edges keep considering the object cached. When we ask for another revision though, basically we renounce completely the ATS layer of caching, because whenever it sends a revalidation request, restbase will see (with high probability) a non-matching etag and recompute a full response.

This means that for articles with flaggedrevs enabled we have not only little to no edge caching, but also no internal caching whatsoever.

I see two ways of fixing this:

Make restbase aware of flaggedrevs and behave like mediawiki does, caching the "verified" revision
Once we'll use parsercache for storing parsoid on the mediawiki side, we can completely bypass restbase for PCS and we'll get better caching at the edge and not suffer from such edge cases

Jgiannelos attached a referenced file: F35648355: benchmark - no flagged revisions enabled - p99 latency - per domain.png. (Show Details)Nov 1 2022, 3:36 PM

Jgiannelos attached a referenced file: F35648356: benchmark - flagged revisions enabled - p95 latency - per domain.png. (Show Details)

Jgiannelos attached a referenced file: F35648357: benchmark - flagged revisions enabled - p99 latency - per domain.png. (Show Details)

Jgiannelos attached a referenced file: F35648358: benchmark - flagged revisions enabled - req count - per domain.png. (Show Details)

Jgiannelos attached a referenced file: F35648359: benchmark - no flagged revisions enabled - p95 latency - per domain.png. (Show Details)

Jgiannelos attached a referenced file: F35648360: benchmark - no flagged revisions enabled - count - per domain.png. (Show Details)

I run once again everything but instead of pointing PCS to restbase parsoid, I used core page html responses:

/v1/page/{title}/html
/v1/revision/{revision}/html

Here are the results
p95 parsoid latency

p99 parsoid latency

page summary endpoint throughput

It looks like after cache gets populated responses get very fast and throughput much higher.
Compared to parsoid on restbase, latency is higher until caching kicks in.

Jgiannelos changed the task status from Open to In Progress.Nov 3 2022, 11:40 AM

Jgiannelos claimed this task.

KOfori subscribed.Nov 7 2022, 1:14 PM

Jgiannelos added a project: Content-Transform-Team-WIP.Nov 7 2022, 4:17 PM

Jgiannelos moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

akosiaris subscribed.Nov 9 2022, 6:01 PM

daniel mentioned this in T322426: Create REST endpoint for getting the latest "stable" HTML .Nov 9 2022, 6:05 PM

Jgiannelos moved this task from In Progress to Blocked on the Content-Transform-Team-WIP board.Nov 10 2022, 3:26 PM

From quarry some investigation on how many non-latest stable revisions we have in the wikis we run the benchmarks at:

SELECT COUNT(*) FROM page, flaggedpages
WHERE (page_id = fp_page_id) AND (page_latest <> fp_stable)

enwiki: 4
dewiki: 12611
eswiki: no FlaggedRevs enabled
frwiki: 0
itwiki: no FlaggedRevs
jawiki: no FlaggedRevs
ptwiki: no FlaggedRevs
ruwiki: 350544

It looks like for the scope of this ticket we have enough information about the bottleneck on summary endpoint and it looks like the problem is not necessarily the API calls but rather the flagged revisions handling.
It wouldn't make too much difference to move it to an extension performance wise.

We already have tickets for the flagged revision handling on MW REST api so I am closing this one for now.

	F35701371: benchmark - flagged revisions enabled - core page html - summary responses per sec.png
	Nov 3 2022, 10:25 AM

	F35701372: benchmark - flagged revisions enabled - core page html - p95 latency - per domain.png
	Nov 3 2022, 10:25 AM

	F35701373: benchmark - flagged revisions enabled - core page html - p99 latency - per domain.png
	Nov 3 2022, 10:25 AM

	F35648355: benchmark - no flagged revisions enabled - p99 latency - per domain.png
	Oct 28 2022, 12:01 PM

	F35648356: benchmark - flagged revisions enabled - p95 latency - per domain.png
	Oct 28 2022, 12:01 PM

	F35648357: benchmark - flagged revisions enabled - p99 latency - per domain.png
	Oct 28 2022, 12:01 PM

	F35648358: benchmark - flagged revisions enabled - req count - per domain.png
	Oct 28 2022, 12:01 PM

	F35648359: benchmark - no flagged revisions enabled - p95 latency - per domain.png
	Oct 28 2022, 12:01 PM

Investigate moving summary endpoint outside of mobileapps serviceClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate moving summary endpoint outside of mobileapps service
Closed, ResolvedPublic
Actions

Related Objects
Search...