Page MenuHomePhabricator

Toolforge: Port sge.py stats to Prometheus
Closed, ResolvedPublic

Description

Convert sge.py from a Diamond collector to a Python script that will make metrics available to node-exporter.

Find another submit host to run it on instead of tools-services (so they can be non-submit hosts to simplify infrastructure).

Details

Related Gerrit Patches:

Event Timeline

GTirloni triaged this task as High priority.Dec 11 2018, 1:16 PM
GTirloni created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2018, 1:16 PM
Dzahn removed a subscriber: Dzahn.Jan 18 2019, 6:39 PM
bd808 claimed this task.Jan 19 2019, 7:33 PM
bd808 edited projects, added Toolforge; removed Goal, Operations.

Change 485372 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] toolforge: Prometheus replacement for sge.py diamond collector

https://gerrit.wikimedia.org/r/485372

Change 485372 merged by GTirloni:
[operations/puppet@production] toolforge: Prometheus replacement for sge.py diamond collector

https://gerrit.wikimedia.org/r/485372

# curl http://tools-sgegrid-master.tools.eqiad.wmflabs:9100/metrics | grep ^sge
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0sge_hostjobs{host="tools-sgeexec-0901.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0902.tools.eqiad.wmflabs"} 2
sge_hostjobs{host="tools-sgeexec-0903.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0904.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0905.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0906.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0907.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0908.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0909.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0910.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0911.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0912.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0913.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0914.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0915.tools.eqiad.wmflabs"} 2
sge_hostjobs{host="tools-sgeexec-0916.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0917.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0918.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0919.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0920.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0921.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0922.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0923.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0924.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0925.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0926.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0927.tools.eqiad.wmflabs"} 2
sge_hostjobs{host="tools-sgeexec-0928.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0929.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0930.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0931.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0932.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0933.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0934.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0935.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0936.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0937.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgeexec-0938.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0939.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0940.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0941.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgeexec-0942.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-generic-0901.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-generic-0902.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-generic-0903.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-generic-0904.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0901.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0902.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0903.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0904.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0905.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0906.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0907.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0908.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0909.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0911.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0912.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0913.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0915.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0916.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0917.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0918.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0919.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0920.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0922.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0923.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0924.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0925.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0926.tools.eqiad.wmflabs"} 1
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0927.tools.eqiad.wmflabs"} 0
sge_hostjobs{host="tools-sgewebgrid-lighttpd-0928.tools.eqiad.wmflabs"} 0
sge_jobseqnum 22173
sge_queuejobs{queue="continuous",state="r"} 20
sge_queuejobs{queue="task",state="r"} 6
100  100k  100  100k    0     0  5119k      0 --:--:-- --:--:-- --:--:-- 5268k
sge_queuejobs{queue="webgrid-generic",state="r"} 1
sge_queuejobs{queue="webgrid-lighttpd",state="r"} 9
bd808 added a comment.Jan 23 2019, 5:33 PM


Looks like we might be having cross-region network issues between the eqiad1-r instances and https://tools-prometheus.wmflabs.org/tools/targets

Mentioned in SAL (#wikimedia-cloud) [2019-01-23T17:44:46Z] <bd808> Added rules to default security group for prometheus monitoring on port 9100 (T211684)

bd808 closed this task as Resolved.Jan 24 2019, 4:00 PM
bd808 added a subscriber: Andrew.

The security group stuff needed a bit of help from @Andrew who figured out that we were out of quota in eqiad1-r for security groups/rules. After raising the limit things started working.