Wikibase DispatchChanges job potentially broken
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Lucas_Werkmeister_WMDE
	Jul 4 2023, 11:05 AM

Description

At 10:04 UTC, the wikidata-monitoring email received an alert about “DispatchChanges Normal job backlog time (mean avg, 15min)”:

[1] Firing
Labels
alertname = DispatchChanges Normal job backlog time (mean avg, 15min) alert
alert_rule_uid = MF0FSjJ4z
contacts = "AlertManager","cxserver"
datasource_uid = 000000026
grafana_folder = Wikidata
ref_id = A
rule_uid = MF0FSjJ4z
severity = critical
team = wikidata
Annotations
alertId = 309
dashboardUid = TUJ0V-0Zk
orgId = 1
panelId = 28
grafana_state_reason = NoData
message = DispatchChanges job backlog is over 10 minutes! Normal values are between 0.5s and 1s
Source

According to another email received at 10:24 UTC, the alert was resolved, but the job in Grafana still doesn’t look good – the backlog time just cut off:

We should figure out what’s going on here, and if anything is still broken.

Related Objects

Mentioned In: T349178: [C-DIS] [TECH] "DispatchChanges Normal job backlog time (p50, 15min)" alert misfires with "No Data"

Event Timeline

Lucas_Werkmeister_WMDE created this task.Jul 4 2023, 11:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 4 2023, 11:05 AM

The cutoff in the 1h graph seems to be 45 minutes after the cutoff in the 15min graph, so if we speculate that the cause was 15 minutes before the cutoff in the 15min graph, then that would give us 9:42 or so, which would line up pretty well with some SAL message about “changeprop-jobqueue” by @Clement_Goubert… do you have any idea?

Also, if I reload the Grafana tab, the graph starts to look quite different, which is very confusing:

Michael subscribed.Jul 4 2023, 11:08 AM

Ah you have an alert on that metric, sorry :(
We switched the metric to an histogram because aggregation was wrong, the job itself is ok.

@akosiaris is fixing the charts to fit the new metric type.

From an end-user perspective, Wikidata edits on English Wikipedia recent changes still seem to arrive as usual, so I don’t think anything’s immediately on fire.

In T341054#8987695, @Clement_Goubert wrote:

Ah you have an alert on that metric, sorry :(
We switched the metric to an histogram because aggregation was wrong, the job itself is ok.
@akosiaris is fixing the charts.

Ah, is that why it changed after a reload? :D

In T341054#8987701, @Lucas_Werkmeister_WMDE wrote:

In T341054#8987695, @Clement_Goubert wrote:

Ah you have an alert on that metric, sorry :(
We switched the metric to an histogram because aggregation was wrong, the job itself is ok.
@akosiaris is fixing the charts.

Ah, is that why it changed after a reload? :D

Yep, that was my (bad) attempt at fixing the graph. For what it's worth, these graphs aggregated prometheus summaries, which are non-aggregatable, so they were wrong anyways.

Sorry about that. For what is worth, we are approaching this piecemeal and this is the first instance. There are more changeprop related metrics that are wrongly summaries and not histograms, we will ping you before changing the next few ones.

I think I have fixed the graphs now to be correct. They will definitely be more correct than previously where they were doing statistically wrong things (aggregating aggregates)

Fixed the alert too. Took me a bit to figure out how to find it, thanks for posting the link in the task.

Alright, thanks for making it make sense! Does that mean we can close this?

I fixed your alert too, which will now alert if p50 on 15 minutes goes over 10 minutes. We can resolve since you don't appear to have any other alerts on cp-jobqueue metrics.

Michael mentioned this in T349178: [C-DIS] [TECH] "DispatchChanges Normal job backlog time (p50, 15min)" alert misfires with "No Data".Oct 18 2023, 10:21 AM

Wikibase DispatchChanges job potentially brokenClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Wikibase DispatchChanges job potentially broken
Closed, ResolvedPublic
Actions