Page MenuHomePhabricator

Create a visual representation of where each service is active from, any given time
Closed, ResolvedPublic

Description

During a recent incident, one of the issues that came up (though not an old issue), was that we had no immediate visibility of where each of our services in discovery are served from. That information is easily obtainable via confctl, when everything is onfire, a human consuming the following information is not easy:

{"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=eventgate-analytics-external"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=kartotherian"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=parsoid-php"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=mw-web"}
{"codfw": {"pooled": true, "references": [], "ttl": 10}, "tags": "dnsdisc=api-ro"}
{"codfw": {"pooled": true, "references": [], "ttl": 10}, "tags": "dnsdisc=appservers-ro"}
{"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=echostore"}
{"codfw": {"pooled": false, "references": [], "ttl": 300}, "tags": "dnsdisc=mw-api-ext"}
{"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=eventstreams-internal"}
{"codfw": {"pooled": true, "references": [], "ttl": 300}, "tags": "dnsdisc=recommendation-api"}

<snip>

Moreover, the above data does not allow us to know if a service is active-active or active passive.

One solution to this could be polling confd for that information and representing them on grafana,. If we do so, would be great to include it in the switchover dashboard

Event Timeline

Change 886069 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] configmaster: Remove disc_desired_state.py

https://gerrit.wikimedia.org/r/886069

Change 886069 merged by Clément Goubert:

[operations/puppet@production] configmaster: Remove disc_desired_state.py

https://gerrit.wikimedia.org/r/886069

Change 886839 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] configmaster: Cleanup disc_desired_state

https://gerrit.wikimedia.org/r/886839

Just to add to the available options, listing the services, their A/A A/P status and in which DCs they are pooled is also easily achievable with a cookbook using https://doc.wikimedia.org/spicerack/master/api/index.html#spicerack.Spicerack.service_catalog

Change 886839 merged by Clément Goubert:

[operations/puppet@production] configmaster: Cleanup disc_desired_state

https://gerrit.wikimedia.org/r/886839

Removed references to disc_desired_state from wikitech LVS and SwitchDC docs

Could we clarify more the requirements:

  • Scope: only codfw and eqiad DCs? all deployed services (available from Spicerack service catalog)?
  • Expected metrics, and if any opinion how they should be presented on dashboard (heatmap, stat, gauge)
  • Freshness of data / time granularity
  • Data retention / length of history

Is anything still needed beyond the functionality in sudo cookbook -d sre.discovery.datacenter status all? That provides the following table:

Service                       Type           eqiad     codfw
=================================================================
apertium                      Active/Active  pooled    pooled    
api-gateway                   Active/Active  pooled    pooled    
apt                           Active/Passive pooled              
apus                          Active/Active  pooled    pooled    
citoid                        Active/Active  pooled    pooled    
config-master                 Active/Active  pooled    pooled    
cxserver                      Active/Active  pooled    pooled    
device-analytics              Active/Active  pooled    pooled    
docker-registry               Active/Passive           pooled    
echostore                     Active/Active  pooled    pooled    
eventgate-analytics           Active/Active  pooled    pooled    
[...]

That's nice and should solve the initial need captured by @Clement_Goubert @jijiki when incidents occur.

My understanding is that we may want to pool and store that status every X seconds and provide a visual / dashboard version (current status and history for each service).

Let me know if that's the case or if it's not a need anymore

Is anything still needed beyond the functionality in sudo cookbook -d sre.discovery.datacenter status all? That provides the following table:

Service                       Type           eqiad     codfw
=================================================================
apertium                      Active/Active  pooled    pooled    
api-gateway                   Active/Active  pooled    pooled    
apt                           Active/Passive pooled              
apus                          Active/Active  pooled    pooled    
citoid                        Active/Active  pooled    pooled    
config-master                 Active/Active  pooled    pooled    
cxserver                      Active/Active  pooled    pooled    
device-analytics              Active/Active  pooled    pooled    
docker-registry               Active/Passive           pooled    
echostore                     Active/Active  pooled    pooled    
eventgate-analytics           Active/Active  pooled    pooled    
[...]

This is good enough for a quick check, but the idea of this task is to have a grafana dashboard, or a small web page, that we can have up, mostly during the switchover, to keep track of services moving without having to eye-grep a long list of text.

Some of this information is exported by pybal, but with the upcoming move to liberica I don't think it's wise to rely on this.

However, config-master has a prometheus export directly from conftool, giving this set of metrics.

As far as I can tell, this is a confd::file that holds a watch, so it will update in quasi-real-time with the status in conftool's etcd.

Now, that doesn't give us a way to check that the service is supposed to be A/P or A/A, but should be a good starting point for a visual representation of pooled status.

As far as to *how* we should represent that visually, I'm open to suggestions. It may be that grafana doesn't allow us to do what we want and we need to do something custom to host on config-master.

Thanks a lot @Clement_Goubert .

The new dashboard is here: https://grafana.wikimedia.org/goto/AyPMrxZDR?orgId=1

It displays the pooled state in each datacenter and the history per service, using the wmf_dnsdiscovery_service_pooled metric which you pointed to Clement.

The A/A and A/P status does not seem to be exported, so I'll dig how to do that, either by modifying the current exporter of that metric ( Gerrit change where it was created) or creating a new one.

MLechvien-WMF changed the task status from Open to In Progress.Dec 2 2025, 6:21 PM

Change #1216763 had a related patch set uploaded (by Matthieulec; author: Matthieulec):

[operations/puppet@production] Adds a new Python script to extract the Active/Active or Active/Passive configuration status for each service from the service catalog and exposes it as a Prometheus gauge metric. This will help during DC switchover operations.

https://gerrit.wikimedia.org/r/1216763

Tested the script today locally on cumin1003:

root@cumin1003:/home/matthieulec# python3 ./export_service_type.py --outfile=test.prom
root@cumin1003:/home/matthieulec# cat test.prom 
# HELP wmf_dnsdiscovery_service_active_active 1 if Active/Active, 0 if Active/Passive
# TYPE wmf_dnsdiscovery_service_active_active gauge
wmf_dnsdiscovery_service_active_active{service="apertium"} 1.0
wmf_dnsdiscovery_service_active_active{service="apus"} 1.0
wmf_dnsdiscovery_service_active_active{service="citoid"} 1.0
wmf_dnsdiscovery_service_active_active{service="cxserver"} 1.0
wmf_dnsdiscovery_service_active_active{service="echostore"} 1.0
wmf_dnsdiscovery_service_active_active{service="eventgate-analytics"} 1.0
wmf_dnsdiscovery_service_active_active{service="eventgate-logging-external"} 1.0
wmf_dnsdiscovery_service_active_active{service="eventgate-analytics-external"} 1.0
wmf_dnsdiscovery_service_active_active{service="eventgate-main"} 1.0
wmf_dnsdiscovery_service_active_active{service="eventstreams"} 1.0
wmf_dnsdiscovery_service_active_active{service="eventstreams-internal"} 1.0
wmf_dnsdiscovery_service_active_active{service="k8s-ingress-wikikube-ro"} 1.0
wmf_dnsdiscovery_service_active_active{service="k8s-ingress-wikikube-rw"} 0.0
wmf_dnsdiscovery_service_active_active{service="k8s-ingress-ml-serve"} 1.0
wmf_dnsdiscovery_service_active_active{service="kartotherian"} 1.0
wmf_dnsdiscovery_service_active_active{service="mathoid"} 1.0
wmf_dnsdiscovery_service_active_active{service="mobileapps"} 1.0
wmf_dnsdiscovery_service_active_active{service="mwdebug"} 1.0
wmf_dnsdiscovery_service_active_active{service="mwdebug-next"} 1.0
wmf_dnsdiscovery_service_active_active{service="mw-web"} 0.0
wmf_dnsdiscovery_service_active_active{service="mw-web-ro"} 1.0
wmf_dnsdiscovery_service_active_active{service="mw-web-next"} 0.0
wmf_dnsdiscovery_service_active_active{service="mw-web-next-ro"} 1.0
wmf_dnsdiscovery_service_active_active{service="mw-api-ext"} 0.0
wmf_dnsdiscovery_service_active_active{service="mw-api-ext-ro"} 1.0
wmf_dnsdiscovery_service_active_active{service="mw-api-ext-next"} 0.0
wmf_dnsdiscovery_service_active_active{service="mw-api-ext-next-ro"} 1.0
wmf_dnsdiscovery_service_active_active{service="mw-api-int"} 0.0
wmf_dnsdiscovery_service_active_active{service="mw-api-int-ro"} 1.0
wmf_dnsdiscovery_service_active_active{service="mw-jobrunner"} 0.0
wmf_dnsdiscovery_service_active_active{service="mw-parsoid"} 0.0
wmf_dnsdiscovery_service_active_active{service="proton"} 1.0
wmf_dnsdiscovery_service_active_active{service="proxoid"} 1.0
wmf_dnsdiscovery_service_active_active{service="push-notifications"} 1.0
wmf_dnsdiscovery_service_active_active{service="recommendation-api"} 1.0
wmf_dnsdiscovery_service_active_active{service="restbase"} 1.0
wmf_dnsdiscovery_service_active_active{service="restbase-async"} 1.0
wmf_dnsdiscovery_service_active_active{service="schema"} 1.0
wmf_dnsdiscovery_service_active_active{service="search"} 1.0
wmf_dnsdiscovery_service_active_active{service="search-omega"} 1.0
wmf_dnsdiscovery_service_active_active{service="search-psi"} 1.0
wmf_dnsdiscovery_service_active_active{service="sessionstore"} 1.0
wmf_dnsdiscovery_service_active_active{service="shellbox"} 1.0
wmf_dnsdiscovery_service_active_active{service="shellbox-constraints"} 1.0
wmf_dnsdiscovery_service_active_active{service="shellbox-media"} 1.0
wmf_dnsdiscovery_service_active_active{service="shellbox-syntaxhighlight"} 1.0
wmf_dnsdiscovery_service_active_active{service="shellbox-timeline"} 1.0
wmf_dnsdiscovery_service_active_active{service="shellbox-video"} 1.0
wmf_dnsdiscovery_service_active_active{service="tegola-vector-tiles"} 1.0
wmf_dnsdiscovery_service_active_active{service="thanos-query"} 1.0
wmf_dnsdiscovery_service_active_active{service="thanos-web"} 1.0
wmf_dnsdiscovery_service_active_active{service="thanos-swift"} 1.0
wmf_dnsdiscovery_service_active_active{service="termbox"} 1.0
wmf_dnsdiscovery_service_active_active{service="wcqs"} 1.0
wmf_dnsdiscovery_service_active_active{service="wdqs-internal-main"} 1.0
wmf_dnsdiscovery_service_active_active{service="wdqs-internal-scholarly"} 1.0
wmf_dnsdiscovery_service_active_active{service="wdqs-main"} 1.0
wmf_dnsdiscovery_service_active_active{service="wdqs-scholarly"} 1.0
wmf_dnsdiscovery_service_active_active{service="wikifeeds"} 1.0
wmf_dnsdiscovery_service_active_active{service="zotero"} 1.0
wmf_dnsdiscovery_service_active_active{service="api-gateway"} 1.0
wmf_dnsdiscovery_service_active_active{service="linkrecommendation"} 1.0
wmf_dnsdiscovery_service_active_active{service="inference"} 1.0
wmf_dnsdiscovery_service_active_active{service="puppetboard"} 1.0
wmf_dnsdiscovery_service_active_active{service="device-analytics"} 1.0
wmf_dnsdiscovery_service_active_active{service="pki"} 1.0
wmf_dnsdiscovery_service_active_active{service="rest-gateway"} 0.0
wmf_dnsdiscovery_service_active_active{service="rest-gateway-ro"} 1.0
wmf_dnsdiscovery_service_active_active{service="config-master"} 1.0

Next stage is to test scraping it in Prometheus exporter to ensure the metric is correct, and then wrap it as a maintenance script to run regularly

Change #1216763 merged by Kamila Součková:

[operations/puppet@production] Add new script to export A/A and A/P service types from Cumin hosts.

https://gerrit.wikimedia.org/r/1216763

A new metric wmf_dnsdiscovery_service_active_active (value = 1 for Active/Active, 0 for Active/Passive) is exported by Cumin hosts.

I added this metric to both the Pooled Status dashboard and the Datacenter Switchover dashboards.

@Clement_Goubert @jijiki could you confirm if this visualization is fine to consider this task complete? Color choices can totally be discussed :)

Looks good to me, we may want to try and tune a more compact viz but I haven't been able to find a form that'd work.

On possible improvements/open questions, I'm actually wondering if we shouldn't gather discovery type from services excluded from the switchover in the exporter script. This would avoid having a bunch of "Not available" statuses in the visualization. We could export that switchover inclusion status to filter on for the switchover dashboard.

Yes indeed, easy change and would be cleaner on the dashboard.
I'll test what the script gathers for those excluded services.

And exporting the switchover inclusion list for that dashboard can be tracked in another task which I can file.

It seems it's producing valid data for the missing services, giving us the types for about 20% more services.

We'll still have some gaps like restbase-backend, thumbor or codesearch, but it's a good improvement.

I'll send the PR over.

Change #1219595 had a related patch set uploaded (by Matthieulec; author: Matthieulec):

[operations/puppet@production] Keeping all services in the exported metrics. The switchover exclusion list should be applied on the final dashboard to filter out services data consistently.

https://gerrit.wikimedia.org/r/1219595

Change #1219595 merged by Clément Goubert:

[operations/puppet@production] export_service_type: Remove exclusion list

https://gerrit.wikimedia.org/r/1219595

New version looks good, I'll proceed to close this bug then. Thanks Clement and Raine for the help!