Maniphest T319365

PCS caching and pregeneration when restbase is decommissioned
Open, HighPublic
Actions

Description

Currently, PCS sits behind two layers of caching:

the edge caches, which get filled by repeated external requests
restbase, which gets its contents pregenerated via changeprop

Given we have the goal of removing restbase from the equation, we set out to figure out how much all of the above actually benefit our users. We would like to keep the system as simple as possible.

In this task we want to determine if:

We can just get rid of the restbase pregeneration
We can get rid of pregeneration and caching/invalidation
We need to preserve both in PCS

Reference:

Details

Subject	Repo	Branch	Lines +/-
Orchestrate caching observability	mediawiki/services/mobileapps	master	+104 -6
changeprop: Enable PCS pregeneration without restbase	operations/deployment-charts	master	+30 -1
mobileapps: Re-enable caching after RESTBase sets the UA header	operations/deployment-charts	master	+1 -1
mobileapps: Re-enabling caching after adding missing credentials	operations/deployment-charts	master	+1 -1
mobileapps: Configure caching for production	operations/deployment-charts	master	+115 -1
Allow disabling caching per user-agent	mediawiki/services/mobileapps	master	+7 -1
caching: Enable middleware for media-list	mediawiki/services/mobileapps	master	+3 -2
changeprop: Disable restbase pregeneration on PCS	operations/deployment-charts	master	+2 -101

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Stalled	None	T324931 Clean up open RESTBase related tickets
In Progress	None	T262315 <CORE TECHNOLOGY> API Migration & RESTBase Sunset
Stalled	Dbrant	T328943 Replace PCS lazy-loading logic with standard "loading=lazy" attribute
Open	None	T314025 [EPIC] Migrate PCS service away from restbase
In Progress	None	T374135 Migrate RESTbase page content endpoints
Open	Jgiannelos	T319365 PCS caching and pregeneration when restbase is decommissioned
Resolved	cooltey	T322142 Measure network latency of page load requests (Android)
Resolved	SNowick_WMF	T327548 Report network latency of page load requests (Android)
Resolved	SNowick_WMF	T332769 pt.wiki latency P95 + P99 in apps pre- and post-disabling of pregeneration in PCS
Resolved	Eevans	T348993 Create new cassandra table data model for PCS
Resolved	Jgiannelos	T348995 Introduce PCS cache management layer
Stalled	hnowlan	T350507 Update mobileapps k8s deployment chart for Cassandra credentials
Open	Jgiannelos	T348996 Change changeprops rules to pre-generate/invalidate cache directly to PCS rather than in restbase
Resolved	Jgiannelos	T366819 Enable PCS to send resource change events to handle URL purges
Resolved	Jgiannelos	T368052 Allow connections from PCS to eventgate

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

dr0ptp4kt subscribed.Oct 6 2022, 8:04 PM

Change 840097 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] changeprop: Disable pregeneration on PCS

https://gerrit.wikimedia.org/r/840097

gerritbot added a project: Patch-For-Review.Oct 7 2022, 10:43 AM

daniel added a subscriber: • JMinor.Oct 7 2022, 12:46 PM

kostajh subscribed.Oct 7 2022, 1:03 PM

Tagging per conversation with Josh

Restricted Application added a project: Wikipedia-iOS-App-Backlog. · View Herald TranscriptOct 18 2022, 6:46 PM

Seddon subscribed.Oct 18 2022, 6:53 PM

LGoto moved this task from Epics in Progress to Blocked/Waiting on the Wikipedia-Android-App-Backlog (Android Release - FY2023-24) board.Oct 20 2022, 4:08 PM

LGoto moved this task from Needs Triage to Engineering Backlog on the Wikipedia-iOS-App-Backlog board.Oct 24 2022, 6:38 PM

LGoto added a project: ios-app-v7.0.

LGoto moved this task from Tasks from Product Backlog to Blocked or Waiting on the ios-app-v7.0 board.

In T319365#8289873, @daniel wrote:

Jdlrobson just pointed me to T214000: Evaluate difficulty of porting PCS summary logic to PHP and T213505: RfC: OpenGraph descriptions in wiki pages for context and consideration

In T214000#8290521, @cscott wrote:

[…] doing proper language converter redirects (T240068) and handling flagged revisions (T209936) become a lot easier if this code is moved to core (in PHP). It will also avoid a round trip, since both the parsoid and legacy html are cached in core. […]

Note that an (inefficient) PHP implementation of this already exists in the form of the TextExtracts extension. This was originally developed for the Popups extension, and after a performance review found latencies upward of a second (due to no use of Varnish cache and no use of ParserCache) it was then re-implemented in Node.js, although this wasn't more performant per-se. It's just that RESTBase's pre-caching helped hide the latency (T117082: Cached REST endpoint for extracts requests, and T117082, T70861, T123445, T118147).

The Popups extension (afaik) still supports the TextExtracts extension API as source for its Hovercards, and remains also what we use locally during development and for third-parties. I'm not suggesting anything regarding its current code per-se, but that extension might make for a good place to host the new REST endpoint, and then for someone to take ownership over that extension (and e.g. replace its Action API module with a call to the same optimised code, to support current batch consumers). See also: T256505: TextExtracts extension: Code stewardship review and T231797.

This was originally developed for the Popups extension,

I don't think this is true from what I recall. TextExtract predates Popups by quite a few years and was made by Max Semenik a while back for mobile.

The TextExtracts extension has a slightly different use case - and these days it's primarily used by user tools for obtaining plain text. The first version of Popups was built using it, but when web team inherited Popups we quickly determined it was not fit for the use case of page previews. The only reason we still link Page previews to TextExtracts is 3rd party support. If a PHP implementation existed, we would remove the API calls to TextExtracts and associated code and this would also save quite a few bytes on the client.

Ladsgroup subscribed.Oct 28 2022, 8:22 AM

Dbrant added a subtask: T322142: Measure network latency of page load requests (Android).Nov 1 2022, 3:22 PM

akosiaris subscribed.Nov 1 2022, 3:24 PM

LGoto removed a project: ios-app-v7.0.Nov 7 2022, 7:45 PM

Dbrant closed subtask T322142: Measure network latency of page load requests (Android) as Resolved.Dec 8 2022, 4:53 PM

VirginiaPoundstone mentioned this in T329419: Architect potential API Gateway Patterns in preparation for services migrated off RESTbase.Feb 11 2023, 9:34 AM

Just an update after experimenting with PCS/Summary not using pregenerated content in prod (ticket for reference: https://phabricator.wikimedia.org/T314770)

We run ptwiki for a few weeks without pregenerated content (RESTBase was passthrough to backend, only caching was edge)
Here are all the findings (mostly focusing on latency and cache efficiency):

I think its important to revisit the numbers of cache efficiency:

Caching efficiency on edge for non popular endpoints is not as expected when we started investigating how PCS/summary would perform without pregeneration
- page/summary has very good hit rates because its heavily used by bots/previews so it turns out that pregeneration is not a problem for it
- page/mobile-html on the other hand didn't perform well (our example in production was ptwiki)
  - There are more details in the tickets mentioned but roughly
    - enwiki ~50% hit ratio
    - ptwiki ~30% hit ratio
  - My assumption is that PCS traffic competing with the overall traffic on edge (cache_text) causes a lot of evictions on PCS side leading to poor cache efficiency on /page/mobile-html endpoints

There is one very promising optimization that we couldn't try in prod without causing issues in cassandra, to reduce the roundtrips to Action API for redirect information that is already available in existing parsoid output (for more information there are details on the reports above).
I would be interested to hear from serviceops/traffic if there is any improvements we can do in edge caching for the soon-to-be migrated restbase endpoints in order to improve cache efficiency.

Jgiannelos added a project: Traffic.Apr 14 2023, 1:54 PM

Maintenance_bot added a project: SRE.Apr 14 2023, 2:29 PM

Jgiannelos added a project: Content-Transform-Team-WIP.Apr 17 2023, 1:52 PM

Jgiannelos moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Jgiannelos claimed this task.Apr 24 2023, 3:14 PM

jijiki subscribed.May 2 2023, 1:27 PM

@KOfori, Could you please have someone from your team to help with consultation. Based on my chat with Frantz, his team is looking for someone from traffic team to help guide them through.

@Kappakayala indeed. Had a quick chat earlier with @FJoseph-WMF and briefly with the team. We'll set something up to discuss further.

Jgiannelos moved this task from In Progress to Blocked on the Content-Transform-Team-WIP board.May 16 2023, 2:15 PM

@KOfori Is there a meeting scheduled to followup on this? If not, when can we expect the discussion to start?

I've scheduled a meeting this week for followup

Fabfur subscribed.May 18 2023, 2:59 PM

MSantos added a project: RESTBase Sunsetting.Jun 19 2023, 11:29 AM

dr0ptp4kt unsubscribed.Jul 28 2023, 4:07 PM

MSantos moved this task from Blocked to In Progress on the Content-Transform-Team-WIP board.Jul 31 2023, 3:14 PM

MSantos moved this task from Unsorted to PCS Service Pile on the RESTBase Sunsetting board.Aug 18 2023, 3:10 PM

Jgiannelos moved this task from In Progress to Blocked on the Content-Transform-Team-WIP board.Sep 21 2023, 11:33 AM

jijiki edited projects, added User-jijiki; removed SRE.Sep 28 2023, 9:54 AM

MSantos triaged this task as High priority.Oct 16 2023, 2:47 PM

MSantos added a project: Epic.

MSantos moved this task from Blocked to Current Epics on the Content-Transform-Team-WIP board.

MSantos added a subtask: T348993: Create new cassandra table data model for PCS.

MSantos added a subtask: T348995: Introduce PCS cache management layer .

MSantos added a subtask: T348996: Change changeprops rules to pre-generate/invalidate cache directly to PCS rather than in restbase.

MSantos added a parent task: T314025: [EPIC] Migrate PCS service away from restbase.

Change 840097 abandoned by Jgiannelos:

[operations/deployment-charts@master] changeprop: Disable restbase pregeneration on PCS

Reason:

https://gerrit.wikimedia.org/r/840097

Maintenance_bot removed a project: Patch-For-Review.Nov 22 2023, 2:30 PM

Jgiannelos closed subtask T348993: Create new cassandra table data model for PCS as Resolved.Dec 7 2023, 3:03 PM

Change 1005456 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] caching: Enable middleware for media-list

https://gerrit.wikimedia.org/r/1005456

gerritbot added a project: Patch-For-Review.Feb 21 2024, 10:51 AM

Change 1005456 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] caching: Enable middleware for media-list

https://gerrit.wikimedia.org/r/1005456

Maintenance_bot removed a project: Patch-For-Review.Feb 21 2024, 4:31 PM

jgiannelos merged https://gitlab.wikimedia.org/repos/content-transform/nodejs-cassandra-storage/-/merge_requests/7

middleware: Use headers to describe storage status

jgiannelos opened https://gitlab.wikimedia.org/repos/content-transform/nodejs-cassandra-storage/-/merge_requests/8

middleware: Allow purging URLs on invalidation

MSantos closed subtask T348995: Introduce PCS cache management layer as Resolved.Apr 8 2024, 3:36 PM

jgiannelos merged https://gitlab.wikimedia.org/repos/content-transform/nodejs-cassandra-storage/-/merge_requests/8

middleware: Implement storage update hook

Jgiannelos closed subtask T366819: Enable PCS to send resource change events to handle URL purges as Resolved.Jul 3 2024, 10:46 AM

Change #1053904 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Allow disabling caching per user-agent

https://gerrit.wikimedia.org/r/1053904

JTannerWMF edited projects, added Wikipedia-Android-App-Backlog (Android Release - FY2024-25); removed Wikipedia-Android-App-Backlog (Android Release - FY2023-24).Jul 12 2024, 10:37 PM

JTannerWMF moved this task from Epics in Progress to Needs Eng. Manager on the Wikipedia-Android-App-Backlog (Android Release - FY2024-25) board.Jul 12 2024, 11:01 PM

Change #1053904 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Allow disabling caching per user-agent

https://gerrit.wikimedia.org/r/1053904

Tsevener moved this task from Engineering Backlog to Tracking on the Wikipedia-iOS-App-Backlog board.Jul 29 2024, 6:29 PM

Change #1063765 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Configure caching for production

https://gerrit.wikimedia.org/r/1063765

Change #1064013 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] changeprop: Enable PCS pregeneration without restbase

https://gerrit.wikimedia.org/r/1064013

Jgiannelos moved this task from Current Epics to Code Review on the Content-Transform-Team-WIP board.Aug 21 2024, 9:20 AM

Change #1063765 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Configure caching for production

https://gerrit.wikimedia.org/r/1063765

Jgiannelos moved this task from Code Review to To Deploy on the Content-Transform-Team-WIP board.Aug 26 2024, 10:34 AM

Reverted after high error rate:
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1068015

There must be something wrong in the cassandra config and authentication:

All host(s) tried for query failed. First host tried, 10.192.48.142:9042: AuthenticationError: Provided username REDACTED and/or password are incorrect
    at authResponseCallback (/srv/service/node_modules/cassandra-driver/lib/connection.js:426:29)
    at /srv/service/node_modules/cassandra-driver/lib/connection.js:532:7
    at OperationState._swapCallbackAndInvoke (/srv/service/node_modules/cassandra-driver/lib/operation-state.js:160:5)
    at OperationState.setResult (/srv/service/node_modules/cassandra-driver/lib/operation-state.js:154:10)
    at Connection.handleResult (/srv/service/node_modules/cassandra-driver/lib/connection.js:693:15)
    at ResultEmitter.emit (node:events:517:28)
    at ResultEmitter.each (/srv/service/node_modules/cassandra-driver/lib/streams.js:537:17)
    at ResultEmitter._write (/srv/service/node_modules/cassandra-driver/lib/streams.js:521:10)
    at writeOrBuffer (node:internal/streams/writable:392:12)
    at _write (node:internal/streams/writable:333:10)
    at Writable.write (node:internal/streams/writable:337:10)
    at Parser.ondata (node:internal/streams/readable:809:22)
    at Parser.emit (node:events:517:28)
    at addChunk (node:internal/streams/readable:368:12)
    at readableAddChunk (node:internal/streams/readable:341:9)
    at Readable.push (node:internal/streams/readable:278:10) {
  info: 'Represents an authentication error from the driver or from a Cassandra node.',
  additionalInfo: [ResponseError]
}. See innerErrors.

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.08.28?id=vOtgmZEBizfZbaVPJjwy

Jgiannelos added a project: Data-Persistence.Aug 28 2024, 2:33 PM

Change #1068024 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Re-enabling caching after adding missing credentials

https://gerrit.wikimedia.org/r/1068024

Change #1068024 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Re-enabling caching after adding missing credentials

https://gerrit.wikimedia.org/r/1068024

Yet another revert after:
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1068032

Mobileapps is serving uncached responses based on a user agent (so we keep restbase consistent) while we switchover
We configured PCS to serve uncached traffic for RESTBase/WMF user agents
RESTBase doesn't set the right headers on its requests

We need to find another way to detect if traffic comes from RESTBase because the user agent is not consistent

Change #1068798 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Re-enable caching after RESTBase sets the UA header

https://gerrit.wikimedia.org/r/1068798

Change #1068798 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Re-enable caching after RESTBase sets the UA header

https://gerrit.wikimedia.org/r/1068798

Jgiannelos moved this task from To Deploy to To Verify on the Content-Transform-Team-WIP board.Aug 29 2024, 4:02 PM

daniel mentioned this in T374135: Migrate RESTbase page content endpoints.Sep 5 2024, 4:50 PM

daniel added a parent task: T374135: Migrate RESTbase page content endpoints.Sep 5 2024, 4:52 PM

Seddon moved this task from Android Release - FY2024-25 to Tracking on the Wikipedia-Android-App-Backlog board.Sep 6 2024, 4:26 PM

Seddon edited projects, added Wikipedia-Android-App-Backlog; removed Wikipedia-Android-App-Backlog (Android Release - FY2024-25).

jijiki moved this task from Incoming🐅 to Misc on the User-jijiki board.Oct 15 2024, 10:43 AM

jijiki moved this task from Misc to Radar 📻 on the User-jijiki board.Oct 15 2024, 11:30 AM

jijiki moved this task from Radar 📻 to Misc on the User-jijiki board.Oct 15 2024, 12:54 PM

jijiki moved this task from Misc to Next up 🥌 on the User-jijiki board.Oct 15 2024, 1:12 PM

jijiki moved this task from Next up 🥌 to Incoming🐅 on the User-jijiki board.Oct 15 2024, 4:05 PM

jijiki moved this task from Incoming🐅 to Misc on the User-jijiki board.

Change #1064013 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Enable PCS pregeneration without restbase

https://gerrit.wikimedia.org/r/1064013

Change #1082183 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Orchestrate caching observability

https://gerrit.wikimedia.org/r/1082183

Change #1082183 merged by jenkins-bot: