Page MenuHomePhabricator

Create a dashboard to show depooled hosts
Open, MediumPublic

Description

Per my IRC chat with @FCeratto-WMF , let's try to create a dashboard to visualize hosts that are depooled.
Right now to check if a host is depooled we have to check:
https://noc.wikimedia.org/dbconfig/eqiad.json
https://noc.wikimedia.org/dbconfig/codfw.json

Or else do it via dbctl get (or with the wrapped), but it would be better if we have a dashboard where we can get that list (and why not even depool and repool from there).
Let's start the discussion on how we want this to look like.

Event Timeline

Marostegui triaged this task as Medium priority.Jan 20 2025, 2:26 PM
Marostegui moved this task from Triage to Refine on the DBA board.

I'm not sure you'd need puppet data to do this. dbctl provides the list of pooled/depooled dbs: https://noc.wikimedia.org/dbconfig/eqiad.json you can get the list of depooled ones by removing anything that's pooled from hostsByName. It's not super clean but I think we can even provide a better list via dbctl (I don't know how to setup grafana dashboard for it though)

I'm not sure you'd need puppet data to do this. dbctl provides the list of pooled/depooled dbs: https://noc.wikimedia.org/dbconfig/eqiad.json you can get the list of depooled ones by removing anything that's pooled from hostsByName. It's not super clean but I think we can even provide a better list via dbctl (I don't know how to setup grafana dashboard for it though)

I am confused by this - where did we mention puppet?

Puppet could be used for T257814, see comment https://phabricator.wikimedia.org/T257814#10666643 - for an initial "MVP" implementation of a dashboard indeed we can get away without it.

I'm not sure you'd need puppet data to do this. dbctl provides the list of pooled/depooled dbs: https://noc.wikimedia.org/dbconfig/eqiad.json you can get the list of depooled ones by removing anything that's pooled from hostsByName. It's not super clean but I think we can even provide a better list via dbctl (I don't know how to setup grafana dashboard for it though)

I am confused by this - where did we mention puppet?

It was a subtask (T389663: Fetch DB-related data from puppet) not anymore.

I removed the direct subtask relation to prevent confusion

Screenshot 2025-03-26 at 13-00-54 Dash.png (1×1 px, 182 KB)
A working prototype that shows pooled and depooled hosts, with links to the host and section on orchestrator.

Change #1135382 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] hiera: Add zarcillo service to k8s

https://gerrit.wikimedia.org/r/1135382

Change #1135387 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] hiera: Add zarcillo k8s service on traffic server

https://gerrit.wikimedia.org/r/1135387

Change #1135414 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] Add zarcillo k8s service

https://gerrit.wikimedia.org/r/1135414

Change #1135432 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] python-webapp: Update modules

https://gerrit.wikimedia.org/r/1135432

Change #1135438 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/dns@master] Add zarcillo (aux k8s) CNAME

https://gerrit.wikimedia.org/r/1135438

Change #1135382 merged by Federico Ceratto:

[operations/puppet@production] hiera: Add zarcillo service to k8s

https://gerrit.wikimedia.org/r/1135382

Change #1135696 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] Add namespace for zarcillo

https://gerrit.wikimedia.org/r/1135696

Change #1135696 merged by Federico Ceratto:

[operations/deployment-charts@master] Add namespace for zarcillo

https://gerrit.wikimedia.org/r/1135696

Change #1135432 merged by jenkins-bot:

[operations/deployment-charts@master] python-webapp: Update modules

https://gerrit.wikimedia.org/r/1135432

Change #1135438 merged by Federico Ceratto:

[operations/dns@master] Add zarcillo (aux k8s) CNAME

https://gerrit.wikimedia.org/r/1135438

Change #1135414 merged by Federico Ceratto:

[operations/deployment-charts@master] Add zarcillo k8s service

https://gerrit.wikimedia.org/r/1135414

Change #1137314 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] values.yaml: Update chart for zarcillo in aux-k8s

https://gerrit.wikimedia.org/r/1137314

Change #1137314 merged by jenkins-bot:

[operations/deployment-charts@master] values.yaml: Update deployment for zarcillo in aux-k8s

https://gerrit.wikimedia.org/r/1137314

Change #1138688 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: values.yaml: Update container path

https://gerrit.wikimedia.org/r/1138688

Change #1138688 merged by Federico Ceratto:

[operations/deployment-charts@master] zarcillo: values.yaml: Update container path

https://gerrit.wikimedia.org/r/1138688

Change #1145127 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: values.yaml: Fix typo, remove comment

https://gerrit.wikimedia.org/r/1145127

Change #1145127 merged by Clément Goubert:

[operations/deployment-charts@master] zarcillo: values.yaml: Fix typo, remove comment

https://gerrit.wikimedia.org/r/1145127

Change #1146018 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/deployment-charts@master] zarcillo: values.yaml: Add FQDN for SNI

https://gerrit.wikimedia.org/r/1146018

Change #1146018 merged by jenkins-bot:

[operations/deployment-charts@master] zarcillo: values.yaml: Add FQDN for SNI

https://gerrit.wikimedia.org/r/1146018

Change #1135387 merged by Federico Ceratto:

[operations/puppet@production] hiera: Add zarcillo k8s service on traffic server

https://gerrit.wikimedia.org/r/1135387

The initial development version is up: https://zarcillo.wikimedia.org/
It is just a prototype at the moment but any early feedback could be beneficial.

Nice work!
Some initial comments/ideas

  • Can you give masters in a different colour?
  • Can you give hosts belonging to the same DC the same colour and hosts in the other DC different one? It would make it easier to see at a first glance
  • The Zone column is probably not very useful, I'd simply skip it
  • Ideally depooled hosts, should show at least the section the belong to (currently they show None)

@Marostegui thanks. I added colors to highlight master hosts, the currently primary datacenter and removed Zone. Do we want to expose all weights or just one row per host or one per instance?
To fetch depooled hosts from the zarcillo db I'm going to need a user with access rights. Ideally a dedicated user with subnet filtering.

I think having the weights there is useful. For the user, that's fine, we can create a user with just SELECT to that DB from the DB you'd be doing it. Which host will it fetch data from? Right now we normally have 10.64.% for eqiad, would that be enough?

I'm told by @Clement_Goubert 10.194.0.0/16 and 10.67.0.0/16 is required for k8s.
Regarding DB users, ideally we could start with a read-only user to access different tables on zarcillo db for the dashboard

Then, the process fetching data from the prod databases (independent from the dashboard) would need a read-only user to run SHOW REPLICA STATUS and accessing the heartbeats, plus an user that can write the output into zarcillo db.

That's fine, please create a username and a random password, commit it on the private puppet repo and I will grab it from there. Once done I will create the read only user.

@Marostegui I created the entries in the private puppet repo, also created a copy for you in your home directory as discussed on IRC.

usernamepermissionshost
zarcillo_ids_rw_preprodGRANT SELECT, INSERT, UPDATE, DELETE, CREATE ON zarcillo_preproddb1215
zarcillo_ids_rw_prodGRANT SELECT, INSERT, UPDATE, DELETE, CREATE ON zarcillodb1215

The networks to allow are: 10.67.80.0/255.255.248.0

Edit: we discussed using Prometheus and Orchestrator instead of scraping replication data from the databases and review this in future.

Commands for db2230 for the test database:

CREATE DATABASE `zarcillo_preprod`;
CREATE USER 'zarcillo_ids_rw_preprod'@'10.67.80.0/255.255.248.0' IDENTIFIED BY 'REPLACEME';
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE ON `zarcillo_preprod`.* TO 'zarcillo_ids_rw_preprod'@'10.67.80.0/255.255.248.0';
FLUSH PRIVILEGES;

The user zarcillo_ids_rw_preprod on db1215 has been created as discussed on IRC.

FCeratto-WMF moved this task from Blocked to In progress on the DBA board.

I'm adding documentation for the Web UI at https://doc.wikimedia.org/data_persistence/zarcillo/README.html#_web_ui as a way to share progress here. I can also paste the documentation here if desired.

I'm adding documentation for the Web UI at https://doc.wikimedia.org/data_persistence/zarcillo/README.html#_web_ui as a way to share progress here. I can also paste the documentation here if desired.

Thank you - can you use https://wikitech.wikimedia.org/wiki/MariaDB ? We tend to publish everything under there. Please complete it all under https://wikitech.wikimedia.org/wiki/Zarcillo