Page MenuHomePhabricator

Consider ways to improve the user-perceived performance and server cost of the WikiLambda API
Open, Needs TriagePublic

Description

General problems:

  • The current (March 2025) system design is quite complex, and has evolved based on SRE and Security input.
  • The original priorities were on separability, security, and relied on horizontal scaling for performance concerns.
  • The experience for users can be pretty fast, but is often of a system that's oddly slow for relatively simple operations, or worse, errors at them with a timeout, mysteriously working sometimes and not others for the most taxing of supported requests.
  • The production load is slight, but still seems to run into memory leak/CPU load concerns in some cases. (T385859)
  • The execution of functions suffers from a lot of hurry-up-and-wait, where actual CPU execution time is well under 100ms but lots of time is spent just waiting for data or other services. (T383806)

Acute issues:

  • Function calls can require up-to-date versions of tens of Wikifunctions and Wikidata objects; fetching them is not very fast (~120ms; T388683), and fetching them in parallel or series rather than a batch can be very slow or over-load the APIs.
    • Can we load fewer of these objects somehow? (For now, we've started loading certain pre-defined Objects from disc for certain purposes, but that only speeds up some parts of some operations.)
    • Can we pre-compute and store parts of requests inside the orchestrator? (T287601)
    • Can we pre-process requests more and so batch up the requests for needed objects into fewer, earlier calls?
    • Can we make this cheaper or off-load the concern to a change-prop-aware caching service, maybe in a sidecar so it can be shared between orchestrator instances?
  • Transmission between the orchestrator and evaluator services seems to take a very long time (~2s extra) when the payload is large (~ >200KiB) for no obvious reason. (T389375)
    • Can we understand whether this affect is in our code or in the k8s network layer somehow, and reduce/eliminate this somehow?

Longer-term concerns:

  • How much can the current system design scale out, were we to get a hockey-stick load increase e.g. as embeddable Wikifunctions calls roll out to more wikis, or we add new features like rich text output?
  • Loading of Objects from Wikifunctions calls the full-fat Action API; can we move things over to a cheaper, cacheable API but have it do cache eviction somehow? (T362271)
  • Each API call takes up a PHP worker on Wikifunctions.org; can we make this asynchronous somehow, so?
  • There's a Wikifunctions-specific memcached service we're using in the MW layer to cache call results (and share them cross-wiki, speeding up the embedded mode). Is there value in letting this be accessed from the orchestrator, so
  • Can we simplify the current architectural model so that the orchestrator ("linker") code is done in the MW appservers rather than a network call away?