Page MenuHomePhabricator

gerrit: replication monitoring improvement
Open, MediumPublic

Assigned To
Authored By
ABran-WMF
Feb 23 2026, 8:16 AM
Referenced Files
F72466885: gerrit_replication_metrics.png
Mar 1 2026, 7:52 PM
F72296220: GitHub_replication_gaps.png
Feb 23 2026, 10:53 PM
F72290409: image.png
Feb 23 2026, 1:40 PM
F72279291: gerrit_per_replica_retries.png
Feb 23 2026, 11:19 AM
F72277432: Gerrit_replication_counters.png
Feb 23 2026, 10:51 AM
F72277331: Gerrit_replication_rates_and_delta.png
Feb 23 2026, 10:51 AM

Description

With T416912: Replication to GitHub seems to have stalled we saw that the replication plugin did not reloaded properly when its configuration was updated. None of the exposed metrics on gerrit were reflecting the replication staleness that ensued the buggy reload. This task is to track the monitoring tweaks to address that issue.

[] write mtail configuration to follow gerrit logs WIP

Event Timeline

Change #1238315 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] gerrit: add mtail monitoring on replication

https://gerrit.wikimedia.org/r/1238315

ABran-WMF triaged this task as Medium priority.Feb 23 2026, 9:06 AM
ABran-WMF moved this task from Incoming to Backlog on the collaboration-services board.

From my comment on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1238315/comments/a98b94a7_a77584ef , the Gerrit replication plugin exposes count of replication related events, ex:

MetricValue
events_ref_replicated_total51911.0
events_ref_replication_done_total11808.0
events_ref_replication_scheduled_total51911.0

I went to add new panels to the Gerrit replication dashboard. For Feb 9th, when the GitHub replication got broken that gives:

Rates of scheduled/replicated refs + rate of replications batches that have completed:

Gerrit_replication_rates_and_delta.png (368×1 px, 45 KB)

The delta of total scheduled ref replication minus total replicated ref:

Gerrit_replication_counters.png (360×775 px, 35 KB)

But because the GitHub replica was broken/improperly reloaded, the tasks were never scheduled and are thus not showing in the queue or number of scheduled replications.

I have also added the per replicas retries (whatever it means) counts and rate:

gerrit_per_replica_retries.png (464×618 px, 47 KB)

Thanks @hashar for the graphs! The issue we had was happening between 12:xx and 18:xx and I found nothing reflecting that in these graphs. At least, nothing that I could translate in a promql alerting query. I may have missed something though!
Do you have an example of a promql query that you think could reflect that issue without adding any additional metrics? Adding metrics from mtail would give us an easy way to create an alert, where the query would be something like irate(mtail_gerrit_replication_suceeded[10m]) != 0

I think I found something interesting on the dashboard you linked:

https://grafana.wikimedia.org/goto/Ls1wWNOvR?orgId=1

We could just use that metric as an alerting query, because the 2 last replication issue events are visible:

image.png (1×1 px, 123 KB)

I'll send a patch soon with that alert

Change #1242399 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] gerrit: alert for broken replication

https://gerrit.wikimedia.org/r/1242399

Change #1238315 abandoned by Arnaudb:

[operations/puppet@production] gerrit: add mtail monitoring on replication

Reason:

https://gerrit.wikimedia.org/r/c/operations/alerts/+/1242399

https://gerrit.wikimedia.org/r/1238315

Awesome! For reference the metric is:

events_ref_replication_scheduled_total - events_ref_replicated_total

I could not explain those two down spikes though and I did not investigate, I am happy to see that correlates with the two broken replication 🎉

The events_* metrics corresponds to events emitted by the plugin, they are exposed by Gerrit itself rather than the plugin. Gerrit metrics doc is at https://gerrit.wikimedia.org/r/Documentation/metrics.html . To find when those events are triggered I went to dig in the plugin source code ( https://gerrit.googlesource.com/plugins/replication ):

$ cd src/main/java/com/googlesource/gerrit/plugins/replication
$ git grep 'TYPE = "ref'
events/RefReplicatedEvent.java:  public static final String TYPE = "ref-replicated";
events/RefReplicationDoneEvent.java:  public static final String TYPE = "ref-replication-done";
events/ReplicationScheduledEvent.java:  public static final String TYPE = "ref-replication-scheduled";

ref-replication-scheduled is a single push being scheduled.

I wasn't sure what is the diff between ref-replicated and ref-replication-done so I went to dig in the code:

PushResultProcessing.java
@Override
public void onRefReplicatedToOneNode(
    String project,
    String ref,
    URIish uri,
    RefPushResult status,
    RemoteRefUpdate.Status refStatus) {
  postEvent(new RefReplicatedEvent(project, ref, uri, status, refStatus));
}

So that is one push to one replica which emits RefReplicatedEvent and thus the ref-replicated metric.

And in the same file:

PushResultProcessing.java
@Override
public void onRefReplicatedToAllNodes(String project, String ref, int nodesCount) {
  postEvent(new RefReplicationDoneEvent(project, ref, nodesCount));
}

ref-replication-done means that got pushed to all nodes. That is the Completed replications panel. I would have expect it to be stall when the replication to GitHub did not work, but since the plugin bug caused the GitHub remote to be removed that was not blocking replications to the other replica and that counter kept increasing.

However the left over scheduled events had nowhere to be pushed to, and thus the queue kept increasing? Something like that.

Another thing I looked at tonight is the delay/latency graph. They show the values for different quantiles which remains when there is no replication:

plugins_replication_replication_delay_github{quantile="0.5",} 15000.0
plugins_replication_replication_delay_github{quantile="0.75",} 15000.0
plugins_replication_replication_delay_github{quantile="0.95",} 15000.0
plugins_replication_replication_delay_github{quantile="0.98",} 15000.0
plugins_replication_replication_delay_github{quantile="0.99",} 15000.0
plugins_replication_replication_delay_github{quantile="0.999",} 15000.0

The metric thus has a continuous value and the graph shows a flat bar. Historically that is how I detected that GitHub had a stall replication: noticing a flat graph compared to the other replicas. That is not quite easy.

There is a an associated counter:

plugins_replication_replication_delay_github_count 3509.0

My assumption is that a replication causes the counter to increase. Thus if the counter has a rate it means a replication happened else nothing and the quantiles value can be filtered out. How to do that? Well hmm lot of heuristic and eventually:

plugins_replication_replication_delay_$target{site="eqiad",instance="$instance",}
and ignoring(quantile)
rate({"plugins_replication_replication_delay_${target}_count", instance="$instance", site="eqiad"}[$__rate_interval]) !=0
  • Pick the delay values which have an extra label quantile
  • AND it, ignoring the extra quantile label to match vectors from the next metric which does not have it
  • the rate of the counter is calculated and no value is returned when it is 0

The lack of a value for a rate cause to discard all quantiles from the first metric.

Here is a view for Feb 9th which shows some gaps and I think, that will be easier to see.

GitHub_replication_gaps.png (463×1 px, 45 KB)

Also over the week-end I felt like we should have a view of the replication plugin WorkQueue, it does not emit metrics which I think was forgotten when metrics were introduced later on.

https://gerrit-review.googlesource.com/c/plugins/replication/+/554781

That would give us the number of threads and the queue size instead of getting it from the events "events_ref_replication_scheduled_total" - "events_ref_replicated_total")

There are two dashboards for them:

https://grafana.wikimedia.org/d/tllYBrhGz/queues , which I have created by creating a panel for each queue. Then I found out upstream has some dashboards and imported them at https://grafana.wikimedia.org/d/Zh_ncGsWk/queues-upstream

I guess those can be slightly improved, that would benefit replication monitoring when its WorkQueue starts reporting metrics.

🎉

Another thing I looked at tonight is the delay/latency graph. They show the values for different quantiles which remains when there is no replication:

plugins_replication_replication_delay_github{quantile="0.5",} 15000.0
plugins_replication_replication_delay_github{quantile="0.75",} 15000.0
plugins_replication_replication_delay_github{quantile="0.95",} 15000.0
plugins_replication_replication_delay_github{quantile="0.98",} 15000.0
plugins_replication_replication_delay_github{quantile="0.99",} 15000.0
plugins_replication_replication_delay_github{quantile="0.999",} 15000.0

The metric thus has a continuous value and the graph shows a flat bar. Historically that is how I detected that GitHub had a stall replication: noticing a flat graph compared to the other replicas. That is not quite easy.

Thanks for highlighting that metric. I'm wondering they follow the same pattern when no one is sending new things to Gerrit and replication is quiesced. This was the hard part finding when looking for a relevant metric to alert from. No replication does not necessarily mean we should be alerted, and could also represent a more quiet period for Gerrit.

Here is a view for Feb 9th which shows some gaps and I think, that will be easier to see.

GitHub_replication_gaps.png (463×1 px, 45 KB)

That is what I was meaning above ↑ at 4am, there is a gap, but we should not alert because replication is working, nothing is replicating.

Also over the week-end I felt like we should have a view of the replication plugin WorkQueue, it does not emit metrics which I think was forgotten when metrics were introduced later on.

https://gerrit-review.googlesource.com/c/plugins/replication/+/554781

That would give us the number of threads and the queue size instead of getting it from the events "events_ref_replication_scheduled_total" - "events_ref_replicated_total")

There are two dashboards for them:

https://grafana.wikimedia.org/d/tllYBrhGz/queues , which I have created by creating a panel for each queue. Then I found out upstream has some dashboards and imported them at https://grafana.wikimedia.org/d/Zh_ncGsWk/queues-upstream

I guess those can be slightly improved, that would benefit replication monitoring when its WorkQueue starts reporting metrics.

🎉

Thanks for the patch on Gerrit and the dashboards, I'll see if I find something viable for the replication issue we had on these

Nothing obvious on neither of these for the Feb. 09 issue:

I think we're safe enough with that patch until the metrics you're adding upstream in the plugin are merged and we update it

Change #1242399 merged by jenkins-bot:

[operations/alerts@master] gerrit: alert for broken replication

https://gerrit.wikimedia.org/r/1242399

this can be considered done, follow up tasks have been created

Change #1243102 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/alerts@master] gerrit: limit GerritHAProxyServiceUnavailable scope

https://gerrit.wikimedia.org/r/1243102

I am reopening because I would like to:

  • Polish up the Gerrit > replication dashboard
  • document this tasks findings on the dashboard (so we have them immediately available)
  • upstream the dashboard changes to https://gerrit.googlesource.com/gerrit-monitoring/
  • and make sure the forged metric Total Scheduled - Total Replicated The counters are off-by one, there is no reason it has to become negative (why would it have replicated more ref than refs scheduled) but I guess it derails when a replica is not reachable. I don't think, that can be recovered unless the counters are flushed by restarting Gerrit

Change #1243102 merged by jenkins-bot:

[operations/alerts@master] gerrit: limit GerritHAProxyServiceUnavailable scope

https://gerrit.wikimedia.org/r/1243102

Polish up the Gerrit > replication dashboard

I have changed the Latency and Delay panels to heatmaps per quantile. That is slightly nicer and the spikes of latency/delay shows up nicely compared to the timeseries panel which used a logarithmic scale.

I am not entirely happy about it, the color scale goes GreenYellowRedPurple and its bound by the min/max value in the serie. Thus if you have:

  • a range of 0ms to 2 hours (a problem), the 2 hours stands out in purple
  • with a range of 0ms to 90ms (a good latency), the 90ms bucket shows in purple as well, which is awkward.

Maybe that is fixable using thresholds or somehow hardcode the max value.

I have split the retries rate and count to individual panels. It is user to visualize this way.

document this tasks findings on the dashboard (so we have them immediately available)

I have added a metric descriptions to the panels ( icon).

At the top of the panels are links to the replication plugin documentation: plugin configuration and Metrics documentation .

Last night I have manually triggered a replication for mediawiki/extensions and mediawiki/skins with an url filter for github. That generates a few hundred of replications which causes the two metrics to be disconnected.

https://grafana.wikimedia.org/d/d4a4da73-c27f-4ce6-a9e5-ab84dd7a4ebb/replication?orgId=1&from=2026-02-28T20:00:00.000Z&to=2026-03-01T01:00:00.000Z&timezone=utc&var-instance=gerrit.wikimedia.org:443&var-target=$__all

gerrit_replication_metrics.png (401×934 px, 41 KB)

The left panel is events_ref_replication_scheduled_total - events_ref_replicated_total.

The right panel is the individual breakdown for each of them:

events_ref_replication_scheduled_totalgreen
events_ref_replicated_totalyellow

We have roughly 1400 extensions an skins which lines up with the delta. Those manually added replications are added to [nav events_ref_replication_scheduled_total} but are not taken in account in events_ref_replicated_total :-\ They track different things respectively nodes replication and individual refs. There are some events/metrics missing and I ideally they could use some renaming to clarify.

I'm handing you the task @hashar as it looks like all the remaining steps are yours!
happy to help if needed, otherwise I'll follow up on T418216: Create alerts using the new replication metrics exposed in Gerrit once T418215: Update gerrit replication plugin with new metrics is over.