Page MenuHomePhabricator

Use unified service urls for DPE services
Open, MediumPublic

Description

We have been using hard coded urls for most of the DPE services ie Druid hosts etc. which bring some unexpected down times during host maintenance or reimaging. We aim to introduce sturdier mechanisms to ensure better uptimes for our hosts incase of any event. In the form of high-availability and managed failover mechanisms.

An example for this is, we have several hardcoded druid hosts that need to be manually changed every time we have some maintenance work which at times leads to some human errors with hosts being left out and causing some downtime.

The druid_public cluster is on high availability (LVS) under druid-public-broker.svc.eqiad.wmnet. The first step shall be changing al hardcoded instances to use the svc url for the public cluster as we work on availing the same for the analytics_druid cluster. This has been previously discussed here T288750 and would also provide the chance for use to investigate further options for T360769.

These are the services we need to update in our code.

  • Set druid_public cluster to use svc url T404068
  • Set the analytics_meta MariaDB server to use high-availability and managed failover mechanisms T360769
  • Set the druid_analytics cluster to use a singular url

Event Timeline

Could we make the description a bit more specific, please?

The two cases that I can think of are:

  • The analytics_meta MariaDB server
  • The druid_analytics cluster

Both of these have specific challenges, but the druid one is probably easier to handle, even though it won't be high-ly-available.
We don't have a load-balancing option available for druid because there is no: T288750: LVS in Analytics VLANs

The MariaDB service address is all about: T360769: Investigate high-availability and managed failover mechanisms for the analytics_meta MariaDB instances

Gehel triaged this task as Medium priority.Sep 9 2025, 2:04 PM
Gehel moved this task from Incoming to Scratch on the Data-Platform-SRE board.

Thanks for opening this ticket and giving some attention to this issue. I just wanted to share my opinion on this topic as well: specifically, that this is a WMF-wide problem and we should probably have a unified service discovery solution that doesn't require:

  • directly touching LVS config
  • directly manipulating DNS records
  • running in K8s

It's possible that mainline SRE has something cooked up...or maybe we could do something with Istio? I started T404119 to discuss it further.