While monitoring how the backlog of htmlCacheUpdate job was going down over the day, I've noticed that the generic queue size or delay in time doesn't really tell anything interesting about the queue health for the htmlCacheUpdate. The really useful and interesting metric, that would've allowed to see the queue health by the first glance is the difference between the job.params.range.start page ID and the final page ID in the range, possibly as percentage. This number would immediately give an insight into how much work is actually done in the queue.
I suppose, that similar useful metrics could be created for other job types which are proven to be problematic, but these cannot me made generic as they depend on the structure of the job, so I propose to add a mechanism into ChangeProp to be able to add such custom metrics.
I envision it as something like a special per-rule config parameter that lists JS classes (functions basically) which receive a job event and possibly a couple more variables and then report a custom metric.
Issues I see with this approach are:
- We've been able to create generic metrics for all the jobs because the events have strict schemas, but job params a effectively schema-less, so we these custom metrics will be broken by any change in the job structure. We can heavily try-catch these custom metric functions so that they don't break anything on CP side, but a way better solution would obviously be to schema job evens, but that's a separate story.
- Conceptually, these metric calculating functions will not belong in CP. We can move them to a separate npm module, though.
I've created this task to collect opinions on whether you think this is a good idea, no rush for actually implementing it.