Convert sge.py from a Diamond collector to a Python script that will make metrics available to node-exporter.
Find another submit host to run it on instead of tools-services (so they can be non-submit hosts to simplify infrastructure).
• GTirloni | |
Dec 11 2018, 1:16 PM |
F28005908: Screen Shot 2019-01-23 at 10.32.07.png | |
Jan 23 2019, 5:33 PM |
Convert sge.py from a Diamond collector to a Python script that will make metrics available to node-exporter.
Find another submit host to run it on instead of tools-services (so they can be non-submit hosts to simplify infrastructure).
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
toolforge: Prometheus replacement for sge.py diamond collector | operations/puppet | production | +167 -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | fgiunchedi | T177195 Reduce technical debt in metrics monitoring | |||
Resolved | fgiunchedi | T177196 Port non-deprecated Diamond collectors to Prometheus | |||
Resolved | • GTirloni | T207591 tools-services: Migrate to Stretch | |||
Resolved | bd808 | T211684 Toolforge: Port sge.py stats to Prometheus | |||
Resolved | • Bstorm | T215845 Add monitoring for disabled grid nodes to the prometheus collector | |||
Open | None | T213567 Toolforge: refresh grafana dashboard |
Change 485372 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Prometheus replacement for sge.py diamond collector
Change 485372 merged by GTirloni:
[operations/puppet@production] toolforge: Prometheus replacement for sge.py diamond collector
# curl http://tools-sgegrid-master.tools.eqiad.wmflabs:9100/metrics | grep ^sge % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0sge_hostjobs{host="tools-sgeexec-0901.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0902.tools.eqiad.wmflabs"} 2 sge_hostjobs{host="tools-sgeexec-0903.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0904.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0905.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0906.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0907.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0908.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0909.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0910.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0911.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0912.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0913.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0914.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0915.tools.eqiad.wmflabs"} 2 sge_hostjobs{host="tools-sgeexec-0916.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0917.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0918.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0919.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0920.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0921.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0922.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0923.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0924.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0925.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0926.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0927.tools.eqiad.wmflabs"} 2 sge_hostjobs{host="tools-sgeexec-0928.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0929.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0930.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0931.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0932.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0933.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0934.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0935.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0936.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0937.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgeexec-0938.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0939.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0940.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0941.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgeexec-0942.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-generic-0901.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-generic-0902.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-generic-0903.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-generic-0904.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0901.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0902.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0903.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0904.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0905.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0906.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0907.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0908.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0909.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0911.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0912.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0913.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0915.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0916.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0917.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0918.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0919.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0920.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0922.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0923.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0924.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0925.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0926.tools.eqiad.wmflabs"} 1 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0927.tools.eqiad.wmflabs"} 0 sge_hostjobs{host="tools-sgewebgrid-lighttpd-0928.tools.eqiad.wmflabs"} 0 sge_jobseqnum 22173 sge_queuejobs{queue="continuous",state="r"} 20 sge_queuejobs{queue="task",state="r"} 6 100 100k 100 100k 0 0 5119k 0 --:--:-- --:--:-- --:--:-- 5268k sge_queuejobs{queue="webgrid-generic",state="r"} 1 sge_queuejobs{queue="webgrid-lighttpd",state="r"} 9
Mentioned in SAL (#wikimedia-cloud) [2019-01-23T17:44:46Z] <bd808> Added rules to default security group for prometheus monitoring on port 9100 (T211684)
The security group stuff needed a bit of help from @Andrew who figured out that we were out of quota in eqiad1-r for security groups/rules. After raising the limit things started working.