Page MenuHomePhabricator

New Service Request: function-orchestrator and function-evaluator (for Wikifunctions launch)
Closed, ResolvedPublic

Description

Description: The back-end 'function-orchestrator' and 'function-evaluator' services for the new Wikifunctions.org wiki and its WikiLambda extension
Timeline: Before 2022-03-31; pending finalisation of code, security and performance reviews, etc.
Diagram: https://commons.wikimedia.org/wiki/File:Wikifunctions_-_Top-level_architectural_model.svg
Technologies:

  • function-orchestrator: nodejs
  • function-evaluator: nodejs and python

WMF services this new service talks to: MW app servers (content fetch) and memcached (a new mini cluster: T297815)
Which services will connect to this service and how:

  • MW app servers (WikiLambda extension) via API; restricted at first to just wikifunctions.org
  • Direct usage by end-users on the Internet

Will this service use our event platform? No
Does this service talk to an external service? No.
Point person: @Jdforrester-WMF for now

Deployment checklist

  • Review charts: T295698
  • namespaces in k8s: <gerrit link>
  • puppet private tokens
  • Review helmfile.d files: <gerrit link>
  • LVS setup
  • Deploy to staging cluster and verify it works via curl
  • Deploy to prod clusters and verify it works via curl
  • Generate TLS certificates
  • Discovery DNS
  • Monitoring dashboard
  • Integration and Acceptance tests

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/mediawiki-configmaster+1 -3
operations/deployment-chartsmaster+6 -0
operations/puppetproduction+4 -0
operations/dnsmaster+3 -1
operations/puppetproduction+14 -0
operations/puppetproduction+75 -0
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+21 -3
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+10 -0
operations/deployment-chartsmaster+497 -1 K
operations/deployment-chartsmaster+11 -9
operations/deployment-chartsmaster+1 -2
operations/deployment-chartsmaster+7 -5
operations/deployment-chartsmaster+6 -2
operations/deployment-chartsmaster+31 -209
operations/deployment-chartsmaster+8 -8
operations/deployment-chartsmaster+4 -4
Show related patches Customize query in gerrit
ReferenceSource BranchDest BranchAuthorTitle
repos/abstract-wiki/wikifunctions/function-evaluator!30wmf-certificatesmainjforresterbuild: Add wmf-certificates package for production TLS certs
Customize query in GitLab

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 933614 had a related patch set uploaded (by Stef Dunlap; author: Stef Dunlap):

[operations/deployment-charts@master] Wikifunctions: update image name; bump tag

https://gerrit.wikimedia.org/r/933614

Change 933614 merged by jenkins-bot:

[operations/deployment-charts@master] Wikifunctions: update image name; bump tag

https://gerrit.wikimedia.org/r/933614

Change 933618 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Add some more real sample values for limits

https://gerrit.wikimedia.org/r/933618

Change 933618 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Add some more real sample values for limits

https://gerrit.wikimedia.org/r/933618

Change 934536 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Drop all the comments and default values

https://gerrit.wikimedia.org/r/934536

Change 934537 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator

https://gerrit.wikimedia.org/r/934537

Change 934536 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Drop all the comments and default values

https://gerrit.wikimedia.org/r/934536

Change 934537 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator

https://gerrit.wikimedia.org/r/934537

Change 937969 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Specify Envoy URL and use image with Head:

https://gerrit.wikimedia.org/r/937969

Change 937969 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Specify Envoy URL and use image with Head:

https://gerrit.wikimedia.org/r/937969

Change 937972 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator

https://gerrit.wikimedia.org/r/937972

Change 938295 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/puppet@production] [WIP] service, k8s: Add service definitions for function-orchestrator and function-evaluator

https://gerrit.wikimedia.org/r/938295

AIUI the only thing talking to the evaluator will be the orchestrator itself and I wonder if we really need to go through the service-mesh for that. If you're not expecting to ever go cross-dc (e.g. orchestrator in eqiad calling evaluator in codfw) it might be good enough to just use envoy as tls-terminator on the evaluatror side and connect to it using the cluster internal service name. In that case the orchestrator would need to call the evaluator via https and take care of connection pooling.

AIUI the only thing talking to the evaluator will be the orchestrator itself and I wonder if we really need to go through the service-mesh for that. If you're not expecting to ever go cross-dc (e.g. orchestrator in eqiad calling evaluator in codfw) it might be good enough to just use envoy as tls-terminator on the evaluatror side and connect to it using the cluster internal service name. In that case the orchestrator would need to call the evaluator via https and take care of connection pooling.

Sounds good. So in practical terms, we'd only register function-orchestrator in the mesh and for talking to the evaluator instead of discovery it'd use localhost on some port (which?).

Sounds good. So in practical terms, we'd only register function-orchestrator in the mesh and for talking to the evaluator instead of discovery it'd use localhost on some port (which?).

More or less yes. There are some things from our end still to be done (namely create some TLS certificates for you) but after that you will be able to access the evaluator from the orchestrator via https://function-evaluator-main-evaluator.wikifunctions.svc.cluster.local:6927
The orchestrator can then be made accessible from outside k8s via ingress (https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress - we will take care of that).

Change 938861 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Set evaluator local URLs per T297314#9019664

https://gerrit.wikimedia.org/r/938861

Change 938861 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Set evaluator local URLs per T297314#9019664

https://gerrit.wikimedia.org/r/938861

Change 939686 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikifunctions: Update orchestrator and evaluator

https://gerrit.wikimedia.org/r/939686

Change 939687 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikifunctions: Enable mesh and ingress

https://gerrit.wikimedia.org/r/939687

Change 939718 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] CI: TestOutcome for diffs requires stdout to not be empty

https://gerrit.wikimedia.org/r/939718

https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/939687/ is the change actually enabling the mesh for both of the services and ingress for the orchestrator. After deployment to staging you should be able to reach the latter (from within wmf network) via:

curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/

Change 939718 merged by jenkins-bot:

[operations/deployment-charts@master] CI: TestOutcome for diffs requires stdout to not be empty

https://gerrit.wikimedia.org/r/939718

Change 939686 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Update orchestrator and evaluator

https://gerrit.wikimedia.org/r/939686

Change 939687 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Enable mesh and ingress

https://gerrit.wikimedia.org/r/939687

Change 940087 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] function-orchestrator: Fix service name and port for function-evaluator

https://gerrit.wikimedia.org/r/940087

Change 940087 merged by jenkins-bot:

[operations/deployment-charts@master] function-orchestrator: Fix service name and port for function-evaluator

https://gerrit.wikimedia.org/r/940087

More or less yes. There are some things from our end still to be done (namely create some TLS certificates for you) but after that you will be able to access the evaluator from the orchestrator via https://function-evaluator-main-evaluator.wikifunctions.svc.cluster.local:6927
The orchestrator can then be made accessible from outside k8s via ingress (https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#Add_a_new_service_under_Ingress - we will take care of that).

I made the wrong call here pointing you to the non tls version of the evaluator service: I've fixed that in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/940087/ but I would say that this is a runtime configuration that should go into helmfile.d/services/wikifunctions rather than be hardcoded in the chart.

I have just deployed the changes to staging - no idea how to check orchestrator<->evaluator interaction, though

deploy1002:~$ curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/_info
{"name":"function-orchestrator","version":"0.0.1","description":"A Wikifunctions service to orchestrate WikiLambda function executors","home":"http://meta.wikimedia.org/wiki/Abstract%20Wikipedia"}

A basic check of the orchestrator should be:

curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{"Z1K1":"Z7","Z7K1":"Z801","Z801K1":"foo"},"doValidate":false}' --header "Content-type: application/json"

The response should start with {"Z1K1":"Z22","Z22K1":"foo","Z22K2":… (but it fails with Orchestration generally failed. right now.

A basic check of the evaluator through the orchestrator should be:

curl https://wikifunctions.k8s-staging.discovery.wmnet:30443/1/v1/evaluate/ -X POST --data '{"zobject":{ "Z1K1": "Z7", "Z7K1": { "Z1K1": "Z8", "Z8K1": [ "Z17", { "Z1K1": "Z17", "Z17K1": "Z6", "Z17K2": { "Z1K1": "Z6", "Z6K1": "Z400K1" }, "Z17K3": { "Z1K1": "Z12", "Z12K1": [ "Z11" ] } }, { "Z1K1": "Z17", "Z17K1": "Z6", "Z17K2": { "Z1K1": "Z6", "Z6K1": "Z400K2" }, "Z17K3": { "Z1K1": "Z12", "Z12K1": [ "Z11" ] } } ], "Z8K2": "Z1", "Z8K3": [ "Z20" ], "Z8K4": [ "Z14", { "Z1K1": "Z14", "Z14K1": "Z400", "Z14K3": { "Z1K1": "Z16", "Z16K1": { "Z1K1": "Z61", "Z61K1": "javascript" }, "Z16K2": "function Z400( Z400K1, Z400K2 ) { return (parseInt(Z400K1) + parseInt(Z400K2)).toString(); }" } } ], "Z8K5": "Z400" }, "Z400K1": "5", "Z400K2": "8" } ,"doValidate":false}' --header "Content-type: application/json"

The response should start with {"Z1K1":"Z22","Z22K1":"13","Z22K2":…

(Same failure.)

That error is from our top-level, last-line-of-defense try/catch statement. When that happens, we do log the stack trace, so I can try to track that down in the logs.

Change 940147 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/deployment-charts@master] wikifunctions: Both charts are required to use readOnlyRootFilesystem

https://gerrit.wikimedia.org/r/940147

Change 937972 abandoned by Jforrester:

[operations/deployment-charts@master] [WIP] wikifunctions: Add network ability for orchestrator to talk to evaluator

Reason:

Overtaken by SRE's work.

https://gerrit.wikimedia.org/r/937972

Change 940147 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Both charts are required to use readOnlyRootFilesystem

https://gerrit.wikimedia.org/r/940147

Calls to the orchestrator now work! When we try to call the evaluator, we get

{"Z1K1":"Z22","Z22K1":"Z24","Z22K2":{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z883","Z883K1":"Z6","Z883K2":"Z1"},"K1":[{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"errors","K2":{"Z1K1":"Z5","Z5K1":"Z507","Z5K2":{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z885","Z885K1":"Z507"},"Z507K1":"request to https://function-evaluator-main-evaluator-tls-service.wikifunctions.svc.cluster.local:4970/1/v1/evaluate/ failed, reason: unable to get local issuer certificate"}}},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationMemoryUsage","K2":"97.91 MiB"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationCpuUsage","K2":"110.106 ms"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationStartTime","K2":"2023-07-21T20:17:08.639Z"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationEndTime","K2":"2023-07-21T20:17:09.141Z"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationDuration","K2":"502 ms"},{"Z1K1":{"Z1K1":"Z7","Z7K1":"Z882","Z882K1":"Z6","Z882K2":"Z1"},"K1":"orchestrationHostname","K2":"function-orchestrator-main-orchestrator-6ff9c65c97-4lqsp"}]}}

The issue is this part: unable to get local issuer certificate.

@JMeybohm, it appears that there's a certificate issue with the evaluator service. Can you advise?

@JMeybohm, it appears that there's a certificate issue with the evaluator service. Can you advise?

I would assume don't have "our" CA's in trust store. Did you install wmf-certificates to your image? I'd guess that should already resolve this. If not please ensure you're trusting the certs in /etc/ssl/certs/wmf-ca-certificates.crt

@JMeybohm, it appears that there's a certificate issue with the evaluator service. Can you advise?

I would assume don't have "our" CA's in trust store. Did you install wmf-certificates to your image? I'd guess that should already resolve this. If not please ensure you're trusting the certs in /etc/ssl/certs/wmf-ca-certificates.crt

We have them in the orchestrator, but not in the evaluator; I'd have thought that was enough, but just in case I'm adding them to the latter above.

Jdforrester-WMF changed the task status from Open to In Progress.Jul 24 2023, 8:51 PM
Jdforrester-WMF claimed this task.
Jdforrester-WMF reassigned this task from Jdforrester-WMF to cmassaro.

Change 941312 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/dns@master] wmnet: Add cnames for'wikifunctions ingress

https://gerrit.wikimedia.org/r/941312

Change 941313 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] service::catalog: Add wikifunctions service

https://gerrit.wikimedia.org/r/941313

Change 941314 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] service::catalog: Switch wikifunctions to state production

https://gerrit.wikimedia.org/r/941314

Change 938295 abandoned by Jforrester:

[operations/puppet@production] [WIP] service, k8s: Add service definitions for function-orchestrator and function-evaluator

Reason:

Being done by Janis's patches instead.

https://gerrit.wikimedia.org/r/938295

Change 941313 merged by Alexandros Kosiaris:

[operations/puppet@production] service::catalog: Add wikifunctions service

https://gerrit.wikimedia.org/r/941313

Change 941775 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/puppet@production] service::catalog: Switch state to production

https://gerrit.wikimedia.org/r/941775

Change 941312 merged by Alexandros Kosiaris:

[operations/dns@master] wmnet: Add cnames for wikifunctions ingress

https://gerrit.wikimedia.org/r/941312

I 've gone ahead and created https://grafana.wikimedia.org/d/FEkiKFqVk/wikifunctions?orgId=1

Panels' are empty currently as it requires some tweaking, but it sets the stage on how this should look

I 've gone ahead and populated the Saturation panels. Traffic, Errors and Latencies will need more work, but I will not be able to help with that anytime soon.

I 've gone ahead and populated the Saturation panels. Traffic, Errors and Latencies will need more work, but I will not be able to help with that anytime soon.

That's great, thank you!

Change 941856 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/puppet@production] services_proxy: Add wikifunctions service

https://gerrit.wikimedia.org/r/941856

Change 941856 merged by Alexandros Kosiaris:

[operations/puppet@production] services_proxy: Add wikifunctions service

https://gerrit.wikimedia.org/r/941856

Change 941865 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/deployment-charts@master] wikifunctions: Configure service_proxy port

https://gerrit.wikimedia.org/r/941865

Change 941865 merged by jenkins-bot:

[operations/deployment-charts@master] wikifunctions: Configure service_proxy port

https://gerrit.wikimedia.org/r/941865

Change 941902 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] ProductionServices: Define the wikifunctions orchestrator access point

https://gerrit.wikimedia.org/r/941902

Change 941902 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: Define the wikifunctions orchestrator access point

https://gerrit.wikimedia.org/r/941902

Mentioned in SAL (#wikimedia-operations) [2023-07-26T12:28:30Z] <jforrester@deploy1002> Started scap: Backport for [[gerrit:941902|ProductionServices: Define the wikifunctions orchestrator access point (T297314)]]

Mentioned in SAL (#wikimedia-operations) [2023-07-26T12:30:01Z] <jforrester@deploy1002> jforrester: Backport for [[gerrit:941902|ProductionServices: Define the wikifunctions orchestrator access point (T297314)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)

Mentioned in SAL (#wikimedia-operations) [2023-07-26T12:36:10Z] <jforrester@deploy1002> Finished scap: Backport for [[gerrit:941902|ProductionServices: Define the wikifunctions orchestrator access point (T297314)]] (duration: 07m 39s)

Change 941775 abandoned by Alexandros Kosiaris:

[operations/puppet@production] service::catalog: Switch state to production

Reason:

Per dupe comment above

https://gerrit.wikimedia.org/r/941775

Change 941314 merged by Alexandros Kosiaris:

[operations/puppet@production] service::catalog: Switch wikifunctions to state production

https://gerrit.wikimedia.org/r/941314