Page MenuHomePhabricator

Wikibase DispatchChanges job potentially broken
Closed, ResolvedPublic

Description

At 10:04 UTC, the wikidata-monitoring email received an alert about “DispatchChanges Normal job backlog time (mean avg, 15min)”:

[1] Firing
Labels
alertname = DispatchChanges Normal job backlog time (mean avg, 15min) alert
alert_rule_uid = MF0FSjJ4z
contacts = "AlertManager","cxserver"
datasource_uid = 000000026
grafana_folder = Wikidata
ref_id = A
rule_uid = MF0FSjJ4z
severity = critical
team = wikidata
Annotations
alertId = 309
dashboardUid = TUJ0V-0Zk
orgId = 1
panelId = 28
grafana_state_reason = NoData
message = DispatchChanges job backlog is over 10 minutes! Normal values are between 0.5s and 1s
Source

According to another email received at 10:24 UTC, the alert was resolved, but the job in Grafana still doesn’t look good – the backlog time just cut off:

image.png (573×1 px, 92 KB)

We should figure out what’s going on here, and if anything is still broken.

Event Timeline

The cutoff in the 1h graph seems to be 45 minutes after the cutoff in the 15min graph, so if we speculate that the cause was 15 minutes before the cutoff in the 15min graph, then that would give us 9:42 or so, which would line up pretty well with some SAL message about “changeprop-jobqueue” by @Clement_Goubert… do you have any idea?

Also, if I reload the Grafana tab, the graph starts to look quite different, which is very confusing:

image.png (336×1 px, 39 KB)

Ah you have an alert on that metric, sorry :(
We switched the metric to an histogram because aggregation was wrong, the job itself is ok.

@akosiaris is fixing the charts to fit the new metric type.

From an end-user perspective, Wikidata edits on English Wikipedia recent changes still seem to arrive as usual, so I don’t think anything’s immediately on fire.

Ah you have an alert on that metric, sorry :(
We switched the metric to an histogram because aggregation was wrong, the job itself is ok.
@akosiaris is fixing the charts.

Ah, is that why it changed after a reload? :D

Ah you have an alert on that metric, sorry :(
We switched the metric to an histogram because aggregation was wrong, the job itself is ok.
@akosiaris is fixing the charts.

Ah, is that why it changed after a reload? :D

Yep, that was my (bad) attempt at fixing the graph. For what it's worth, these graphs aggregated prometheus summaries, which are non-aggregatable, so they were wrong anyways.

Sorry about that. For what is worth, we are approaching this piecemeal and this is the first instance. There are more changeprop related metrics that are wrongly summaries and not histograms, we will ping you before changing the next few ones.

I think I have fixed the graphs now to be correct. They will definitely be more correct than previously where they were doing statistically wrong things (aggregating aggregates)

Fixed the alert too. Took me a bit to figure out how to find it, thanks for posting the link in the task.

Alright, thanks for making it make sense! Does that mean we can close this?

Clement_Goubert claimed this task.

I fixed your alert too, which will now alert if p50 on 15 minutes goes over 10 minutes. We can resolve since you don't appear to have any other alerts on cp-jobqueue metrics.