Page MenuHomePhabricator

jmx_presto prometheus job down for some an-presto hosts
Open, Needs TriagePublic

Description

The following hosts are not pollable by prometheus eqiad on their jmx_exporter ports:

https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets?orgId=1&var-datasource=thanos&var-Filters=job%7C%3D%7Cjmx_presto

2023-01-24 09:48:33	an-presto1006:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1007:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1008:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1009:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1010:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1011:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1012:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1013:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1014:10281	jmx_presto	analytics	eqiad	0
2023-01-24 09:48:33	an-presto1015:10281	jmx_presto	analytics	eqiad	0

cc @BTullis @Stevemunene

Event Timeline

Hi @fgiunchedi the 10 servers were taken out of the cluster due to challenges in joining the presto cluster documented here T325809 T325331 T323783

Last message from @BTullis being " Icinga downtime and Alertmanager silence (ID=4f65baff-bd05-4e2f-8578-eae4b972dc3b) set by btullis@cumin1001 for 30 days, 0:00:00 on 10 host(s) and their services with reason: Still not ready to add these new presto servers to the cluster - btullis "