Page MenuHomePhabricator

Capture metrics for success and failure of process-control jobs
Closed, ResolvedPublic

Description

We would like more visibility into the success / failure / status of the jobs run with process-control. Currently we get the count of jobs running and we get FailMail responses from run-job when jobs fail.

Currently, job status is stored in state files per job in: /var/lib/jenkins/state/$job.yaml
The file contains a history portion showing the last 10 execution runs including start and finish times. Also, recorded is that status of the job which is generally: completed, skipped, failed

Initially I think we should start capturing the latest status. Since we have the history, we may also want to move into a metric that includes % of X last runs that were not in a completed status. but that will need some thinking.

Event Timeline

Have coded up the parsing of the state files. Just need to figure out how to best export this data that can be useful for metrics/alerts.

Rolled out new collector and had some fun debugging edge cases. Data is being collected being tracked on the Process Control Jobs dashboard.

[frack::puppet] e967f501 New process_control job stats collector (Dallas Wisehaupt:Dallas Wisehaupt)
[frack::puppet] a3e136b3 Shift process_control_stats to a class and include on civicrm role (Dallas Wisehaupt:Dallas Wisehau
[frack::puppet] 60afb417 Missed on C for lowercasing (Dallas Wisehaupt:Dallas Wisehaupt)
[frack::puppet] 517f45f3 Was too agressive in my updating and forgot a class setting (Dallas Wisehaupt:Dallas Wisehaupt)
[frack::puppet] 653e5bca Missed another class assignment. fixing using mapping (Dallas Wisehaupt:Dallas Wisehaupt)
[frack::puppet] 05c41da0 Check and log if no history in the state file (Dallas Wisehaupt:Dallas Wisehaupt)
[frack::puppet] 8e563ca2 Change to is instead of == for NoneType check (Dallas Wisehaupt:Dallas Wisehaupt)
[frack::puppet] 896ed44d Check for empty input file (Dallas Wisehaupt:Dallas Wisehaupt)

Dwisehaupt triaged this task as Medium priority.
Dwisehaupt moved this task from In Progress to Done on the fundraising-tech-ops board.