General problems:
- The current (March 2025) system design is quite complex, and has evolved based on SRE and Security input.
- The original priorities were on separability, security, and relied on horizontal scaling for performance concerns.
- The experience for users can be pretty fast, but is often of a system that's oddly slow for relatively simple operations, or worse, errors at them with a timeout, mysteriously working sometimes and not others for the most taxing of supported requests.
- The production load is slight, but still seems to run into memory leak/CPU load concerns in some cases. (T385859)
- The execution of functions suffers from a lot of hurry-up-and-wait, where actual CPU execution time is well under 100ms but lots of time is spent just waiting for data or other services. (T383806)
Acute issues:
- Function calls can require up-to-date versions of tens of Wikifunctions and Wikidata objects; fetching them is not very fast (~120ms; T388683), and fetching them in parallel or series rather than a batch can be very slow or over-load the APIs.
- Can we load fewer of these objects somehow? (For now, we've started loading certain pre-defined Objects from disc for certain purposes, but that only speeds up some parts of some operations.)
- Can we pre-compute and store parts of requests inside the orchestrator? (T287601)
- Can we pre-process requests more and so batch up the requests for needed objects into fewer, earlier calls?
- Can we make this cheaper or off-load the concern to a change-prop-aware caching service, maybe in a sidecar so it can be shared between orchestrator instances?
- Transmission between the orchestrator and evaluator services seems to take a very long time (~2s extra) when the payload is large (~ >200KiB) for no obvious reason. (T389375)
- Can we understand whether this affect is in our code or in the k8s network layer somehow, and reduce/eliminate this somehow?
Longer-term concerns:
- How much can the current system design scale out, were we to get a hockey-stick load increase e.g. as embeddable Wikifunctions calls roll out to more wikis, or we add new features like rich text output?
- Loading of Objects from Wikifunctions calls the full-fat Action API; can we move things over to a cheaper, cacheable API but have it do cache eviction somehow? (T362271)
- Each API call takes up a PHP worker on Wikifunctions.org; can we make this asynchronous somehow, so?
- There's a Wikifunctions-specific memcached service we're using in the MW layer to cache call results (and share them cross-wiki, speeding up the embedded mode). Is there value in letting this be accessed from the orchestrator, so
- Can we simplify the current architectural model so that the orchestrator ("linker") code is done in the MW appservers rather than a network call away?