Page MenuHomePhabricator

[EPIC][GRAFMIGR] Spruce up Wikidata Grafana Metrics
Open, Needs TriagePublic

Description

Context

Since Graphite has long been EOL and statsd (The API we use to post metrics to the Grafana backend) will be replaced with a new one, supporting Prometheus (statslib), we will embark on a series of tasks to perform the migration and also clean up our metrics and dashboards in Grafana.

WMF: T228380 Tech debt: sunsetting of Graphite

Main Objectives

  • Clean up some deprecated and unused tables in graphite so that they are not migrated to Prometheus.
    • Because of the deadline we just migrated all of it
  • All of the API calls are made to a Prometheus backend (and alongside to Graphite, for an interim phase) using the new statslib API across the variety of extensions supporting Wikidata.
  • 🚧 Ensure that Grafana dashboards are querying Prometheus instead of Graphite.
  • 🚧 Standardization of the presentation of and context for metrics across Grafana Dashboards.
  • 🚧 Remove all unused Prometheus/Graphite data processes given new Grafana dashboards.
  • 🚧 Mark historical dashboards that Product wants to keep as archived/deprecated (ex: WD co-editors)
  • Eventually deprecate the copying of stats to Graphite.

Mitigated Risks

  • Graphite is EOL and achieving these objectives will allow us to stop using it.

Statsd uses abandoned libraries in it's upstream and so migrating to statslib will mean we will have one fewer unmaintained weak points in out dependencies. T326607

  • Additional stakeholders from MediaWiki Ecosystem will not have to perform these tasks for us with little context.
  • Broken Dashboards and Tables in Grafana erodes the trust in the Analytics data we are presenting to Users and Engineers

Potential Tasks

Related Objects

StatusSubtypeAssignedTask
OpenNone
ResolvedHasanAkgun_WMDE
Resolvedcolewhite
ResolvedAnnWF
ResolvedLucas_Werkmeister_WMDE
ResolvedAndrewTavis_WMDE
Resolvedandrea.denisse
ResolvedHasanAkgun_WMDE
ResolvedHasanAkgun_WMDE
OpenNone
OpenBUG REPORTNone
Resolvedandrea.denisse
ResolvedHasanAkgun_WMDE
Resolvedfgiunchedi
ResolvedHasanAkgun_WMDE
OpenAndrewTavis_WMDE
OpenAndrewTavis_WMDE
ResolvedAndrewTavis_WMDE
ResolvedAndrewTavis_WMDE
ResolvedAndrewTavis_WMDE
ResolvedAndrewTavis_WMDE
ResolvedAndrewTavis_WMDE
ResolvedJakob_WMDE
OpenNone

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Lucas_Werkmeister_WMDE renamed this task from [EPIC] [WD-ANALYTICS] Spruce up Wikidata Grafna Metrics to [EPIC] [WD-ANALYTICS] Spruce up Wikidata Grafana Metrics.Aug 1 2024, 3:11 PM
karapayneWMDE renamed this task from [EPIC] [WD-ANALYTICS] Spruce up Wikidata Grafana Metrics to [EPIC][GRAFMIGR] Spruce up Wikidata Grafana Metrics.Aug 29 2024, 8:12 AM

Hope it's ok that I added my Grafana Cleanup Notes to the epic description, @ItamarWMDE! :)

Note that I don't think that we did the first point above - Clean up some deprecated and unused tables in graphite so that they are not migrated to Prometheus. We just jumped in and have been changing things over... Do we want to cross out that point and add another after Standardization of the presentation of and context for metrics across Grafana Dashboards? Something like Remove all unused Prometheus/Graphite data processes given new Grafana dashboards?

Hi!

I brought this up in the WMDE cross engineering team chat and I wanted to check if this epic also included the work to migrate not just dashboards on grafana from one data source to another but also to migrate grafana alerts.

It seems to me that since T391793 some alerts that the Wikidata team may be relying on are now paused with the assumption that they are going to be deleted in the near future anyway. An example of this alert would be Edits: below 30 per minute (for 3 minutes). I noticed Lucas referenced this in T228380#10744144 but as far I can tell they would not have been fired because they were already paused.

I wondered if you were planning to migrate these existing grafana based, graphite backed alerts to alertmanage based and prometheus alerts? Some comparable work could be these tickets from the performance world:

I could also imagine a world that you might decide you no longer want these alerts and are happy that they are sunset with Graphite or that I've missed some equivalent alerts already migrated to alertmanager/prometheus as part of another ticket. I'm no expert but after our sub 3min chat today I thought it would be helpful to try and put in writing some of my thoughts.

@Tarrow, coming back to this epic given priorities. I'd really support your suggestion above being included in here and would be happy to do what I can to support to get the alerts migrated. Do you want to add an objective to the task text? I'd be happy to add a summary of your points as well 😊

@Tarrow, coming back to this epic given priorities. I'd really support your suggestion above being included in here and would be happy to do what I can to support to get the alerts migrated. Do you want to add an objective to the task text? I'd be happy to add a summary of your points as well 😊

In my opinion the alerts are something that the engineering team probably ought to define and own. That's what I was trying to get across in:

I could also imagine a world that you might decide you no longer want these alerts and are happy that they are sunset with Graphite or that I've missed some equivalent alerts already migrated to alertmanager/prometheus as part of another ticket

I was a bit unclear here but in this case "you" === "Wikidata engineers responsible to keeping Wikidata up and functional for users." I think it's quite important that alerts are tied to some tangible action and that action probably needs to be defined by the engineering team. However I'd be surprised if all of the alerts want to now be sunset.

Totally agree that the alerts should be your all's main responsibility. Was just saying that I'm happy to help if I can :) I do think that the task should be assigned to this epic though and sub tasks can then be attached and assigned to engineers.

Totally agree that the alerts should be your all's main responsibility. Was just saying that I'm happy to help if I can :) I do think that the task should be assigned to this epic though and sub tasks can then be attached and assigned to engineers.

I've made T397146 as a subtask.
I'm not certain if it should be under this task or not; I would leave that up to y'all (Wikidata enginners). I'm very aware the title of this task is about Grafana Metrics and the subtask is about alerts that should probably not remain on grafana at all. Maybe someone has already moved them off and I'm just not aware of it but if not we should probably do something so that the alerts still fire if something breaks.