Page MenuHomePhabricator

monitoring of phabricator
Closed, ResolvedPublic

Description

phabricator.wikimedia.org should have monitoring

we want to check at least an URL from external and some running processes

we do already monitor misc-web-lb in general, but not individual services on it

we can let a check talk to nginx and check https, if we use check_http -S we get free certificate monitoring as well,
we can also talk to the Apache backend on iridium, to check if that is up, independent from misc-web

T274 is similar, but for the legalpad instance

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added a project: acl*sre-team.
Dzahn changed Security from none to None.
Dzahn updated the task description. (Show Details)
Dzahn subscribed.

what we already had. is the misc-web-lb up in general, on IPv4 and IPv6

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=misc-web

suggested process monitoring of PhabricatorTaskmasterDaemon

https://gerrit.wikimedia.org/r/#/c/169585/

"currently we have around 20 processes running that match this

setting it to warn under 10 and over 40,
crit under 1 and over 50. let me know if you think i should
change these numbers

also, want me to additionally monitor processes matching these?

"PhabricatorGarbageCollectorDaemon"
"PhabricatorRepositoryPullLocalDaemon""

thresholds seem ok, a little trial and error here I guess.

also, want me to additionally monitor processes matching these?

"PhabricatorGarbageCollectorDaemon"
"PhabricatorRepositoryPullLocalDaemon""

I wouldn't as long as phd is good it controls the child procs.

commented on changeset :)

additional check for https URL from external:

https://gerrit.wikimedia.org/r/#/c/169604/

add check to LVS,hmm.. not sure if we should

https://gerrit.wikimedia.org/r/#/c/169303/

and this one, i'll likely just abandon. https://gerrit.wikimedia.org/r/#/c/169265/2

^ works! can i resolve? now you want the same for legalpad too?

Dzahn mentioned this in Unknown Object (Diffusion Commit).Nov 15 2014, 1:05 AM
Dzahn mentioned this in Unknown Object (Diffusion Commit).
Dzahn mentioned this in Unknown Object (Diffusion Commit).Dec 10 2014, 8:37 PM
Dzahn mentioned this in Unknown Object (Diffusion Commit).

In T334250 I am wondering if we should remove the process monitoring part of this and only keep https monitoring.