We have an alarm firing when there are too many Gearman functions waiting. It relies on monitoring Graphite and is defined in Puppet via:
monitoring::graphite_threshold{ 'zuul_gearman_wait_queue': ensure => $ensure, description => 'Work requests waiting in Zuul Gearman server', dashboard_links => ['https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1'], metric => 'zuul.geard.queue.waiting', contact_group => 'contint', from => '10min', percentage => 100, warning => 90, critical => 150, notes_link => 'https://www.mediawiki.org/wiki/Continuous_integration/Zuul', }
The related Grafana dashboard is https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 and has an alert defined which is not necessarily consistent with what is defined in Puppet:
Guides coming the observability team:
https://wikitech.wikimedia.org/wiki/Alertmanager#Grafana_alerts
If our team hasn't been onboarded to alert manager:
https://wikitech.wikimedia.org/wiki/Alertmanager#I'm_part_of_a_new_team_that_needs_onboarding_to_Alertmanager,_what_do_I_need_to_do
https://wikitech.wikimedia.org/wiki/Alertmanager#Onboard