Page MenuHomePhabricator

PCS caching and pregeneration when restbase is decommissioned
Open, HighPublic

Description

Currently, PCS sits behind two layers of caching:

  • the edge caches, which get filled by repeated external requests
  • restbase, which gets its contents pregenerated via changeprop

Given we have the goal of removing restbase from the equation, we set out to figure out how much all of the above actually benefit our users. We would like to keep the system as simple as possible.

In this task we want to determine if:

  • We can just get rid of the restbase pregeneration
  • We can get rid of pregeneration and caching/invalidation
  • We need to preserve both in PCS

Reference:

See also:

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 840097 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] changeprop: Disable pregeneration on PCS

https://gerrit.wikimedia.org/r/840097

[…] doing proper language converter redirects (T240068) and handling flagged revisions (T209936) become a lot easier if this code is moved to core (in PHP). It will also avoid a round trip, since both the parsoid and legacy html are cached in core. […]

Note that an (inefficient) PHP implementation of this already exists in the form of the TextExtracts extension. This was originally developed for the Popups extension, and after a performance review found latencies upward of a second (due to no use of Varnish cache and no use of ParserCache) it was then re-implemented in Node.js, although this wasn't more performant per-se. It's just that RESTBase's pre-caching helped hide the latency (T117082: Cached REST endpoint for extracts requests, and T117082, T70861, T123445, T118147).

The Popups extension (afaik) still supports the TextExtracts extension API as source for its Hovercards, and remains also what we use locally during development and for third-parties. I'm not suggesting anything regarding its current code per-se, but that extension might make for a good place to host the new REST endpoint, and then for someone to take ownership over that extension (and e.g. replace its Action API module with a call to the same optimised code, to support current batch consumers). See also: T256505: TextExtracts extension: Code stewardship review and T231797.

This was originally developed for the Popups extension,

I don't think this is true from what I recall. TextExtract predates Popups by quite a few years and was made by Max Semenik a while back for mobile.

The TextExtracts extension has a slightly different use case - and these days it's primarily used by user tools for obtaining plain text. The first version of Popups was built using it, but when web team inherited Popups we quickly determined it was not fit for the use case of page previews. The only reason we still link Page previews to TextExtracts is 3rd party support. If a PHP implementation existed, we would remove the API calls to TextExtracts and associated code and this would also save quite a few bytes on the client.

Just an update after experimenting with PCS/Summary not using pregenerated content in prod (ticket for reference: https://phabricator.wikimedia.org/T314770)

I think its important to revisit the numbers of cache efficiency:

  • Caching efficiency on edge for non popular endpoints is not as expected when we started investigating how PCS/summary would perform without pregeneration
    • page/summary has very good hit rates because its heavily used by bots/previews so it turns out that pregeneration is not a problem for it
    • page/mobile-html on the other hand didn't perform well (our example in production was ptwiki)
      • There are more details in the tickets mentioned but roughly
        • enwiki ~50% hit ratio
        • ptwiki ~30% hit ratio
      • My assumption is that PCS traffic competing with the overall traffic on edge (cache_text) causes a lot of evictions on PCS side leading to poor cache efficiency on /page/mobile-html endpoints

There is one very promising optimization that we couldn't try in prod without causing issues in cassandra, to reduce the roundtrips to Action API for redirect information that is already available in existing parsoid output (for more information there are details on the reports above).
I would be interested to hear from serviceops/traffic if there is any improvements we can do in edge caching for the soon-to-be migrated restbase endpoints in order to improve cache efficiency.

@KOfori, Could you please have someone from your team to help with consultation. Based on my chat with Frantz, his team is looking for someone from traffic team to help guide them through.

@Kappakayala indeed. Had a quick chat earlier with @FJoseph-WMF and briefly with the team. We'll set something up to discuss further.

@KOfori Is there a meeting scheduled to followup on this? If not, when can we expect the discussion to start?

I've scheduled a meeting this week for followup

Change 840097 abandoned by Jgiannelos:

[operations/deployment-charts@master] changeprop: Disable restbase pregeneration on PCS

Reason:

https://gerrit.wikimedia.org/r/840097

Change 1005456 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] caching: Enable middleware for media-list

https://gerrit.wikimedia.org/r/1005456

Change 1005456 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] caching: Enable middleware for media-list

https://gerrit.wikimedia.org/r/1005456

Change #1053904 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Allow disabling caching per user-agent

https://gerrit.wikimedia.org/r/1053904

Change #1053904 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Allow disabling caching per user-agent

https://gerrit.wikimedia.org/r/1053904

Change #1063765 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Configure caching for production

https://gerrit.wikimedia.org/r/1063765

Change #1064013 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] changeprop: Enable PCS pregeneration without restbase

https://gerrit.wikimedia.org/r/1064013

Change #1063765 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Configure caching for production

https://gerrit.wikimedia.org/r/1063765

There must be something wrong in the cassandra config and authentication:

All host(s) tried for query failed. First host tried, 10.192.48.142:9042: AuthenticationError: Provided username REDACTED and/or password are incorrect
    at authResponseCallback (/srv/service/node_modules/cassandra-driver/lib/connection.js:426:29)
    at /srv/service/node_modules/cassandra-driver/lib/connection.js:532:7
    at OperationState._swapCallbackAndInvoke (/srv/service/node_modules/cassandra-driver/lib/operation-state.js:160:5)
    at OperationState.setResult (/srv/service/node_modules/cassandra-driver/lib/operation-state.js:154:10)
    at Connection.handleResult (/srv/service/node_modules/cassandra-driver/lib/connection.js:693:15)
    at ResultEmitter.emit (node:events:517:28)
    at ResultEmitter.each (/srv/service/node_modules/cassandra-driver/lib/streams.js:537:17)
    at ResultEmitter._write (/srv/service/node_modules/cassandra-driver/lib/streams.js:521:10)
    at writeOrBuffer (node:internal/streams/writable:392:12)
    at _write (node:internal/streams/writable:333:10)
    at Writable.write (node:internal/streams/writable:337:10)
    at Parser.ondata (node:internal/streams/readable:809:22)
    at Parser.emit (node:events:517:28)
    at addChunk (node:internal/streams/readable:368:12)
    at readableAddChunk (node:internal/streams/readable:341:9)
    at Readable.push (node:internal/streams/readable:278:10) {
  info: 'Represents an authentication error from the driver or from a Cassandra node.',
  additionalInfo: [ResponseError]
}. See innerErrors.

https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.08.28?id=vOtgmZEBizfZbaVPJjwy

Change #1068024 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Re-enabling caching after adding missing credentials

https://gerrit.wikimedia.org/r/1068024

Change #1068024 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Re-enabling caching after adding missing credentials

https://gerrit.wikimedia.org/r/1068024

Yet another revert after:
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1068032

  • Mobileapps is serving uncached responses based on a user agent (so we keep restbase consistent) while we switchover
  • We configured PCS to serve uncached traffic for RESTBase/WMF user agents
  • RESTBase doesn't set the right headers on its requests

We need to find another way to detect if traffic comes from RESTBase because the user agent is not consistent

Change #1068798 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[operations/deployment-charts@master] mobileapps: Re-enable caching after RESTBase sets the UA header

https://gerrit.wikimedia.org/r/1068798

Change #1068798 merged by jenkins-bot:

[operations/deployment-charts@master] mobileapps: Re-enable caching after RESTBase sets the UA header

https://gerrit.wikimedia.org/r/1068798

jijiki moved this task from Incoming🐅 to Misc on the User-jijiki board.

Change #1064013 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: Enable PCS pregeneration without restbase

https://gerrit.wikimedia.org/r/1064013

Change #1082183 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/mobileapps@master] Orchestrate caching observability

https://gerrit.wikimedia.org/r/1082183

Change #1082183 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Orchestrate caching observability

https://gerrit.wikimedia.org/r/1082183