Page MenuHomePhabricator

Helm chart for wdqs-qlever and wdqs-streaming-consumer
Open, In Progress, HighPublic5 Estimated Story Points

Description

The next iteration of WDQS (migration to the new backend) is planned around an idea from @BTullis:

  • We add a kubelet to our existing wdqs nodes and add them to the dse-k8s cluster
  • We add a host taint so that regular workloads are not scheduled on these wdqs nodes.
  • We deploy qlever to these hosts using kubernetes, with a specific toleration so that it selects only the wdqs db nodes.
  • In this way, our wdqs db nodes become dedicated kubernetes worker nodes, instead of being dedicated bare-metal nodes.
  • In addition to this, we make the local storage on the wdqs nodes available to use as Kubernetes Volumes.
    • This means that we would not be compromising on I/O performance by moving WDQS to Kubernetes, as we would if we were to use Ceph volumes.

An important part of this is the deployment of the backend the the consumer via Kubernetes (targetting DSE).

AC:

  • Helm chart for wdqs-qlever and wdqs-streaming-consumer as a sidecar container
  • Helmfile deployment for the chart which binds replicas to the wdqs nodes via taint and toleration values

Event Timeline

trueg changed the task status from Open to In Progress.Apr 30 2026, 3:41 PM
trueg triaged this task as High priority.
trueg added a parent task: Restricted Task.

A discussion with @BTullis resulted in the following:

  • A sidecar container is not applicable. Sidecar containers are by design init containers that simply run for the entire lifetime of the pod. Our "sidecar", the wdqs-streaming-consumer, however, depends on the backend to run. Ideally it should start after the main container. Since this is not possible, the wdqs-streaming-consumer will need to be able to "wait" on the backend. This could be achieved by simply allowing "endless" retries for applying the update (given the proper HTTP response, ie. none)
  • We will have two releases in the helmfile: wdqs-main and wdqs-scholarly, each with several replicas.
  • We need unique but stable names for the used kafka groups. Ideally each node has its own group, matching the data in its local storage. Once possibility would be a configmap that mounts a local file, which in turn contains the value. There might be a simpler solution though.
  • We want to avoid having to copy the index files on each restart of the pod. Thus, the idea is to use a locally mounted storage rather than an emptyDir.
  • An init container will copy the index from an S3 bucket to the local storage, only if there were changes to the index in the S3. This could be monitored by recording the date of the copied index in a file.
  • Ben will create the necessary phab to add wdqs1028 into the DSE Kubernetes cluster, renaming it to something like dse-wdqs-1001 in the process.
  • Ben suggested to add the wdqs-proxy into the same chart for simplicity and ease of maintenance.
lerickson set the point value for this task to 5.May 6 2026, 2:11 PM

Update on the development of the WDQS deployment:

Overview

  • The main deployment will live in DSE Kubernetes with dedicated wdqs nodes that are bound to the WDQS pods via taints and tolerations.
  • Each WDQS backend pod contains of several containers:
    • The main Qlever container (using the wdqs-qlever image) which is only concerned with actually running Qlever on an existing index.
    • The wdqs-streaming-consumer which applies RDF patches to the Qlever instance based on the Kafka changes stream
    • An init container (based on the wdqs-backend-init image) which prepares the Qlever index files and resets the Kafka consumer group (details below).
    • A sidebar container which monitors the state of Qlever and the consumer. This container will expose a ready probe for Kubernetes, along with basic metrics. This allows to automatically re-pool the pod once the consumer has reduced the update lag below a certain threshold.
  • An additional pod hosts the wdqs-proxy

Index Initialization
The index init script (running in the wdqs-backend-init container) will perform the following steps:

  • Copy the new index files from an S3 bucket. These index files are prepared by an Airflow pipeline (T422179).
  • Create a backup of the current index if necessary (handled via symlinks to avoid fops)
  • Reset the Kafka consumer offset to the timestamp of the index.

The Qlever index files will always be accompanied by a wikidata_dump_meta.json file which contains at least the timestamp of the index, meaning the time at which the original Wikidata dump had been started. The afore mentioned Airflow pipeline takes care of creating this file. This allows comparison of indexes. In addition it means easy access to the necessary timestamp for Kafka consumer offset reset.
The latter will be done on every restart of the pod for now. This is because Qlever does not commit online changes to the index to disk automatically yet, thus loosing all changes applied by the wdqs-streaming-consumer after restart.
In a later iteration of this deployment we envision another sidecar/cronjob container which regularly re-indexes Qlever.

Change #1294315 had a related patch set uploaded (by Trueg; author: Trueg):

[operations/deployment-charts@master] Add wdqs namespace for the new deployment

https://gerrit.wikimedia.org/r/1294315

Change #1295068 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] dse-k8s: Create kubeconfigs for WDQS

https://gerrit.wikimedia.org/r/1295068

Change #1295068 merged by Bking:

[operations/puppet@production] dse-k8s: Create kubeconfigs for WDQS

https://gerrit.wikimedia.org/r/1295068

Change #1294315 merged by Bking:

[operations/deployment-charts@master] dse-k8s-eqiad: Add wdqs namespaces for the new deployment

https://gerrit.wikimedia.org/r/1294315

Change #1295465 had a related patch set uploaded (by Trueg; author: Trueg):

[operations/deployment-charts@master] dse-k8s-codfw: Add wdqs namespaces for the new deployment

https://gerrit.wikimedia.org/r/1295465

Change #1295465 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s-codfw: Add wdqs namespaces for the new deployment

https://gerrit.wikimedia.org/r/1295465

Mentioned in SAL (#wikimedia-operations) [2026-06-02T04:49:51Z] <ryankemper> T425007 (k8s) created 4 wdqs namespaces on dse-k8s-codfw's admin_ng ns: wdqs-[internal,external] & wdqs-[internal,external]-next; certs issued

Change #1298817 had a related patch set uploaded (by Trueg; author: Trueg):

[operations/deployment-charts@master] dse-k8s: Allow the usage of ceph-rdb-ssd for wdqs namespaces

https://gerrit.wikimedia.org/r/1298817

Change #1298817 merged by jenkins-bot:

[operations/deployment-charts@master] dse-k8s: Allow the usage of ceph-rdb-ssd for wdqs namespaces

https://gerrit.wikimedia.org/r/1298817