Page MenuHomePhabricator

New k8s service request: wdqs-proxy
Open, Needs TriagePublic

Description

Follow-up to a number of async conversations. This Phabricator task is meant to centralize input and collect requirements.

The Wikidata Platform team is leading an effort to migrate the WDQS backend away from Blazegraph.
As part of this migration, we want to move bespoke business logic out of Blazegraph into a dedicated service-layer proxy. This includes federation access control, query logging, and instrumentation. This task is independent from the new triplestore. That we will handle separately.

This service will act as a passthrough to triplestore databases deployed in eqiad and codfw.
We would like to deploy the service on Kubernetes and are looking for advice on which cluster would be the best fit for our use case and availability/reliability needs, and how to best onboard.

We have a design doc under internal review, that we will circulate ASAP. These would be the key points we would like to align on:

  • Launch: Development will start in April, with a target launch date in July 2026. At launch, we may not be feature-complete, but we will need to expose the service on the public internet to a selected set of stakeholders (authentication may apply).
  • Deployments: We'll need internal and external facing deployments, to frontend internal and external database fleets. The external service will be put behind REST Gateway and will respect Wikimedia traffic policies. We are already coordinating with Mediawiki Interfaces on this front. The internal facing service won't be put behind the REST Gateway.
  • Traffic: Post-migration, the external facing deployment is expected to handle ~15–20k requests per minute (~333 RPS), with a worst-case concurrency of ~20k in-flight requests given a 60s timeout. We estimate that 4–10 proxy instances (pods) per datacenter will be sufficient under normal operating conditions, with additional headroom for failure scenarios. The internal facing deployment is expected to serve ~15rps/minute with predictable traffic patterns. This won't need to be overprovisioned for spikes, and we estimated 2-3 pods per datacenter will be sufficient.
  • Workload: minimal cpu required for query inspection and request parsing, and bookeeping (e.g. federation access), but the expected workflow is predominantely i/o bound (network).
  • State: the service will be stateless. However, we will need to handle a federation allow list (textfile) currently stored in puppet.
  • Stack: Java (>= JDK 25), Quarkus web framework, jena RDF library for parsing. We are aware that, depending on cluster, additional work may be required to meet k8s onboarding requirements for a Java service. Based on our experience with Flink, we can take on the implementation work and coordinate with SRE. Java is a requirement both to simplify refactoring from Blazegraph (itself a Java codebase) and to leverage domain-specific features (RDF parsing). The team has experience in operating Java services at scale.
  • Security: we are aware that security review might be required. We will follow up with the Security team, but need explicit approval that this stack is viable.
  • Availability and reliability:
    • We would like to at lest meet the availability and reliability requirements described in https://wikitech.wikimedia.org/wiki/SLO/WDQS, with the goal of meeting the Ideal Targets expressed in the doc. (caveat: we are aware that the SLO might need to be re-negotiated).
    • We would like a multi-DC deployment that matches the current WDQS fleet (eqiad + codfw).
    • We would like input from SRE of how our requirements would change if on-call support was required.

Related

Event Timeline

This seems like a good candidate for the Wikikube k8s cluster, so tagging ServiceOps new for their input.

Historically the wikidata-query-gui is running on wikikube (since there was no other cluster back then) but it does not seem to fit the goal of the wikikube cluster very much (I might be mistaken):

Applications that are deployed in these clusters SHOULD fall into at least one of the following categories:

  • MediaWiki itself (appservers, API servers, job runners, etc.)
  • Services that MediaWiki relies on internally (EventBus, session store, etc.)
  • Services that MediaWiki relies on publicly/client-side (Citoid, Maps, etc.)
  • Services that provide an API that fundamentally depends on MediaWiki (mobileapps, wikifeeds, etc.)

The goal of the DSE cluster might need some updating, since I don't think it's current definition still fits. But the last part ("general Data Science and Engineering (DSE) workloads") to make sense for this.

One of the primary goals of this cluster is to train Machine Learning applications using Kubeflow. This cluster was also known by the project code name of Lift Wing, but the scope has since been broadened to encompass more general Data Science and Engineering (DSE) workloads, hence the renaming to of the cluster to dse-k8s.

Minor nit, the API Gateway is going away, everything is being centralized behind REST Gateway, which may or may not change name once the API Gateway is completely gone. On that subject, the design of the REST Gateway specifically excludes internal calls (for instance from MediaWiki), can you confirm that's the case?

Minor nit, the API Gateway is going away, everything is being centralized behind REST Gateway, which may or may not change name once the API Gateway is completely gone.

Thanks for the heads up. I am still stuck with old naming conventions. I'll update to clarify.
@Clement_Goubert do I understand it correctly though that REST Gateway will still be a frontend for requests other than Rest API (e.g.. Action API, SPARQL) ?

On that subject, the design of the REST Gateway specifically excludes internal calls (for instance from MediaWiki), can you confirm that's the case?

This tracks.
I'll update the task to reflect, but internal calls (wdqs-internal-main, wdqs-internal-scholarly) won't need rate-limits. That traffic is predictable and access is gated.

MLechvien-WMF subscribed.

Removing Serviceops tag as it seems this will be more on DSE cluster, please add us back if we can help

gmodena renamed this task from [WIP] New k8s service request: wdqs service layer proxy to New k8s service request: wdqs-proxy.Mar 31 2026, 12:38 PM

I agree. I think that we can target the dse-k8s cluster for this.