Follow-up to a number of async conversations. This Phabricator task is meant to centralize input and collect requirements.
The Wikidata Platform team is leading an effort to migrate the WDQS backend away from Blazegraph.
As part of this migration, we want to move bespoke business logic out of Blazegraph into a dedicated service-layer proxy. This includes federation access control, query logging, and instrumentation. This task is independent from the new triplestore. That we will handle separately.
This service will act as a passthrough to triplestore databases deployed in eqiad and codfw.
We would like to deploy the service on Kubernetes and are looking for advice on which cluster would be the best fit for our use case and availability/reliability needs, and how to best onboard.
We have a design doc under internal review, that we will circulate ASAP. These would be the key points we would like to align on:
- Maintainers: @lerickson @gmodena @trueg.
- Launch: Development will start in April, with a target launch date in July 2026. At launch, we may not be feature-complete, but we will need to expose the service on the public internet to a selected set of stakeholders (authentication may apply).
- Deployments: We'll need internal and external facing deployments, to frontend internal and external database fleets. The external service will be put behind REST Gateway and will respect Wikimedia traffic policies. We are already coordinating with Mediawiki Interfaces on this front. The internal facing service won't be put behind the REST Gateway.
- Traffic: Post-migration, the external facing deployment is expected to handle ~15–20k requests per minute (~333 RPS), with a worst-case concurrency of ~20k in-flight requests given a 60s timeout. We estimate that 4–10 proxy instances (pods) per datacenter will be sufficient under normal operating conditions, with additional headroom for failure scenarios. The internal facing deployment is expected to serve ~15rps/minute with predictable traffic patterns. This won't need to be overprovisioned for spikes, and we estimated 2-3 pods per datacenter will be sufficient.
- Workload: minimal cpu required for query inspection and request parsing, and bookeeping (e.g. federation access), but the expected workflow is predominantely i/o bound (network).
- State: the service will be stateless. However, we will need to handle a federation allow list (textfile) currently stored in puppet.
- Stack: Java (>= JDK 25), Quarkus web framework, jena RDF library for parsing. We are aware that, depending on cluster, additional work may be required to meet k8s onboarding requirements for a Java service. Based on our experience with Flink, we can take on the implementation work and coordinate with SRE. Java is a requirement both to simplify refactoring from Blazegraph (itself a Java codebase) and to leverage domain-specific features (RDF parsing). The team has experience in operating Java services at scale.
- Security: we are aware that security review might be required. We will follow up with the Security team, but need explicit approval that this stack is viable.
- Availability and reliability:
- We would like to at lest meet the availability and reliability requirements described in https://wikitech.wikimedia.org/wiki/SLO/WDQS, with the goal of meeting the Ideal Targets expressed in the doc. (caveat: we are aware that the SLO might need to be re-negotiated).
- We would like a multi-DC deployment that matches the current WDQS fleet (eqiad + codfw).
- We would like input from SRE of how our requirements would change if on-call support was required.
Related