Page MenuHomePhabricator

Toolforge: Port sge.py stats to Prometheus
Closed, ResolvedPublic

Description

Convert sge.py from a Diamond collector to a Python script that will make metrics available to node-exporter.

Find another submit host to run it on instead of tools-services (so they can be non-submit hosts to simplify infrastructure).

Event Timeline

GTirloni created this task.
bd808 edited projects, added Toolforge; removed Goal, SRE.

Change 485372 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Prometheus replacement for sge.py diamond collector

https://gerrit.wikimedia.org/r/485372

Change 485372 merged by GTirloni:
[operations/puppet@production] toolforge: Prometheus replacement for sge.py diamond collector

https://gerrit.wikimedia.org/r/485372

# curl http://tools-sgegrid-master.tools.eqiad.wmflabs:9100/metrics | grep ^sge
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0sge_hostjobs{host="tools-sgeexec-0901.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0902.tools.eqiad.wmflabs"} 2
sge_hostjobs{host="tools-sgeexec-0903.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0904.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0905.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0906.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0907.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0908.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0909.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0910.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0911.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0912.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0913.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0914.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0915.tools.eqiad.wmflabs"} 2
sge_hostjobs{host="tools-sgeexec-0916.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0917.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0918.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0919.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0920.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0921.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0922.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0923.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0924.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0925.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0926.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0927.tools.eqiad.wmflabs"} 2
sge_hostjobs{host="tools-sgeexec-0928.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0929.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0930.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0931.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0932.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0933.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0934.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0935.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0936.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0937.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0938.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0939.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0940.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0941.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0942.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-generic-0901.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-generic-0902.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-generic-0903.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-generic-0904.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0901.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0902.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0903.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0904.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0905.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0906.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0907.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0908.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0909.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0911.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0912.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0913.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0915.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0916.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0917.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0918.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0919.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0920.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0922.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0923.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0924.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0925.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0926.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0927.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0928.tools.eqiad.wmflabs"} 0
sge_jobseqnum 22173
sge_queuejobs{queue="continuous",state="r"} 20
sge_queuejobs{queue="task",state="r"} 6
100  100k  100  100k    0     0  5119k      0 --:--:-- --:--:-- --:--:-- 5268k
sge_queuejobs{queue="webgrid-generic",state="r"} 1
sge_queuejobs{queue="webgrid-lighttpd",state="r"} 9

Screen Shot 2019-01-23 at 10.32.07.png (106×1 px, 40 KB)

Looks like we might be having cross-region network issues between the eqiad1-r instances and https://tools-prometheus.wmflabs.org/tools/targets

Mentioned in SAL (#wikimedia-cloud) [2019-01-23T17:44:46Z] <bd808> Added rules to default security group for prometheus monitoring on port 9100 (T211684)

bd808 added a subscriber: Andrew.

The security group stuff needed a bit of help from @Andrew who figured out that we were out of quota in eqiad1-r for security groups/rules. After raising the limit things started working.