Page MenuHomePhabricator

[25Q2] Service platform improvements
Closed, ResolvedPublic

Description

Background
This project is about evolving the platform our production services run on. Our services are currently based on the outdated service-runner / service-template-node platform, which is a maintenance, development, and performance concern for us. We want to replace the platform, initially with a simpler, more performant one based on Node, and ultimately, one based on a more suitable runtime environment like Rust or Golang.

Approach

  • We identify all the features of service-runner we are using
  • We replace service-runner with service-utils or similar in our production services (still running Node)
  • We create a wish-list of features we would like in the new service runner
  • We decide on whether Rust or Go is the appropriate programming language to move to (in liaison with colleagues across the Foundation, per Selena. Work product is documenting rationale of choice)
  • We prototype the initial build-out of the replacement service platform in the evaluator (still in Node)
  • Begin porting the evaluator code, starting with the WASM executor interaction; a deliverable here would be a PoC demonstrating how we will run the executors in the new programming language
  • Migrate the subset of tests that are not related to service-level concerns and make them pass in the port's repository.

Acceptance Criteria/Success Metrics

  • We are confident we understand the necessary parts of the service platform
  • Our production routing, logs, and metrics continue to work as expected
  • Our overhead load time in Node services is reduced by 5%(?)
  • We have decided whether to adopt Rust or Go as a new language for the backend services.
  • We have discussed with SRE all needed changes to container-level protections and sandboxing for the evaluator service and have redesigned our WASI management accordingly.
  • We have prototyped the WASI/executor interaction and are testing some Python/JS function calls in CI.

Stretch Goal

  • At least a stub service could be exposed in production.

Event Timeline