Page MenuHomePhabricator

New Service Request memcached-wikifunctions
Closed, ResolvedPublic

Description

Description: A bespoke memcached micro-cluster for Wikifunctions results, per SRE Service Ops's advice.
Timeline: By end of March 2022.
Diagram: https://commons.wikimedia.org/wiki/File:Wikifunctions_-_Top-level_architectural_model.svg (Red box)
Technologies: Memcached; at first our needs will be small, e.g. each payload will be ~500B, and maybe ~10000 unique calls to perhaps ~1000 functions, so that's perhaps 5 GiB of storage? Load can be adjusted downwards if needed.
WMF services this new service talks to: None
Which services will connect to this service and how:

  • Reads and writes from Wikifunctions orchestrator service (T297314)
  • Reads from MW app servers (WikiLambda extension); restricted at first to just wikifunctions.org

Will this service use our event platform? No
Does this service talk to an external service? No.
Point person: @Jdforrester-WMF

Event Timeline

Jdforrester-WMF moved this task from To triage to Backlog on the Abstract Wikipedia team board.

We have already procured the servers for this work, and they're set up already.

We have already procured the servers for this work, and they're set up already.

Yup, I just need to configure the entry point into the production config.

We have already procured the servers for this work, and they're set up already.

Yup, I just need to configure the entry point into the production config.

Sorry, somehow my message got cut in half. I was going to add: I think it would be a good idea to use mcrouter to connect to the memcached pool, which shouldn't be hard to add to your helm charts. I'll see if I can manage to do that tomorrow.

Also another question. You mention that mediawiki would have to access cache keys here, am I reading it right that we're making shared use of memcached across applications?

For now, I'll just work on making the memcached instance reachable from the orchestrator. I have some reservations to using a datastore (ephemeral or not) as an integration point between different services.

Answering myself: the model seems to imply that is the case.

I strongly urge you to reconsider that model mostly because using a database (even if it's a database of cached artifacts) as the integration point is not recommended as a pattern in general.

I can see at least 2 main issues with this approach:

  • Making any change to the data model in-cache will require coordination across the two systems and possibly some cache purging. It will make changing how you cache your data harder to change and always require you maintain backwards compatibility in the data store rather than in a programmable interface.
  • All of our observability tools, that assume that all services talk to each other via http and our service mesh, will break down. In general, debugging issues with the data that is cached is going to be much harder than debugging http calls.

There's the additional matter of having a small memcached cluster that we can't really expand in the next FY with no failover, which will limit our ability to scale up wikifunctions if that's the integration point.

Thinking a bit more about this:
how do we ensure that the sharding of keys across servers is the same in the two systems? One way to do it is using mcrouter, but that is potentially complicated by the fact we do have a default prefix set in mediawiki, so we'll need to look at the code on both sides to ensure that the cache keys are properly prefixed with the same path.

@Jdforrester-WMF can we configure a prefix path (something like /wikifunctions to all key fetches in MediaWiki?

Change 944159 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] function-orchestrator: add mcrouter support

https://gerrit.wikimedia.org/r/944159

Change 944160 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] wikifunctions: enable mcrouter for orchestrator

https://gerrit.wikimedia.org/r/944160

Also another question. You mention that mediawiki would have to access cache keys here, am I reading it right that we're making shared use of memcached across applications?

For now, I'll just work on making the memcached instance reachable from the orchestrator. I have some reservations to using a datastore (ephemeral or not) as an integration point between different services.

Our initial plans are just to access this from the MW side and not the orchestrator (that latter is only needed once we provide non-MW API access to the orchestrator, which is not planned for soon).

Thinking a bit more about this:
how do we ensure that the sharding of keys across servers is the same in the two systems? One way to do it is using mcrouter, but that is potentially complicated by the fact we do have a default prefix set in mediawiki, so we'll need to look at the code on both sides to ensure that the cache keys are properly prefixed with the same path.

@Jdforrester-WMF can we configure a prefix path (something like /wikifunctions to all key fetches in MediaWiki?

The current code call's MW's ObjectCache makeKey() method with 'WikiLambdaFunctionCall' as the first value. In config we can set the target cache to something in wgObjectCaches (defaulting to the main cache, which we don't want to use per this task). Configuring that is T342753, but needs us to be able to add the WF cache to wgObjectCaches; it sounds like you'd rather we had that routed internally?

That's exactly what I wanted to do, add a new entry to wgObjectCaches with a specific routingPrefix; I wasn't sure we could just configure the target cache in this case.

Change 944247 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki::wanrouter_cache: add wikifunctions

https://gerrit.wikimedia.org/r/944247

Change 944248 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki::wancache: add the wikifunctions pools and routes

https://gerrit.wikimedia.org/r/944248

Jdforrester-WMF changed the task status from Open to In Progress.Aug 2 2023, 5:02 PM
Jdforrester-WMF moved this task from Backlog to In Progress on the Abstract Wikipedia team board.

Change 944247 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki::wanrouter_cache: add wikifunctions

https://gerrit.wikimedia.org/r/944247

Change 945037 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mw-on-k8s: add wikifunction pools

https://gerrit.wikimedia.org/r/945037

Change 945037 merged by Giuseppe Lavagetto:

[operations/puppet@production] mw-on-k8s: add wikifunction pools

https://gerrit.wikimedia.org/r/945037

Change 945040 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] MediaWiki: add wikifunctions pool to mcrouter

https://gerrit.wikimedia.org/r/945040

Change 945040 merged by jenkins-bot:

[operations/deployment-charts@master] MediaWiki: add wikifunctions pool to mcrouter

https://gerrit.wikimedia.org/r/945040

Change 944248 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki::wancache: add the wikifunctions pools and routes

https://gerrit.wikimedia.org/r/944248

Change 945534 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/mediawiki-config@master] Add wikifunctions object cache

https://gerrit.wikimedia.org/r/945534

Change 945621 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] mediawiki::wanrouter_cache: add wikifunctions placeholder

https://gerrit.wikimedia.org/r/945621

Change 945621 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki::wanrouter_cache: add wikifunctions placeholder

https://gerrit.wikimedia.org/r/945621

Change 945534 merged by jenkins-bot:

[operations/mediawiki-config@master] Add wikifunctions object cache

https://gerrit.wikimedia.org/r/945534

Mentioned in SAL (#wikimedia-operations) [2023-08-10T13:36:00Z] <oblivian@deploy1002> Started scap: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]]

Mentioned in SAL (#wikimedia-operations) [2023-08-10T13:37:38Z] <oblivian@deploy1002> oblivian: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)

Mentioned in SAL (#wikimedia-operations) [2023-08-10T13:45:10Z] <oblivian@deploy1002> Finished scap: Backport for [[gerrit:945534|Add wikifunctions object cache (T297815)]] (duration: 09m 09s)

Jdforrester-WMF claimed this task.

I'll call this Resolved for now; though no doubt we'll want to extend and improve the service (including the potential for using it from the k8s/orchestrator service) in future, but we're done as initially conceived.