Page MenuHomePhabricator

Review logging cluster merge pressure
Open, Stalled, LowPublic

Description

Following the theme of the parent task I spend some time today experimenting with a performance row in the logstash cluster health dashboard at https://grafana-rw.wikimedia.org/d/oXH_v3rWk/logstash-cluster-health?forceLogin&orgId=1&refresh=30s&from=now-30d&to=now

One of the metrics added there is merge pressure, and it seems like this one is worth a closer look. At present the logging clusters appear to spend between 30-50% of their total merge time in throttled state.

Screenshot 2025-04-11 at 2.15.55 PM.png (534×3 px, 241 KB)

Creating a task to have a look into what can be done to help reduce this merge throttling/pressure

Some ideas off hand:

Event Timeline

Change #1136394 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates

https://gerrit.wikimedia.org/r/1136394

Change #1136394 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates

https://gerrit.wikimedia.org/r/1136394

Thinking out loud the prometheus scrape interval of 30s less processing time through kafka+logstash could be a rough guideline for when things would start to feel slow comparatively. Might be able to double the refresh_interval again up into the ballpark of 20s if it helps.

Change #1136400 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: set "index.translog.durability": "async" as template default

https://gerrit.wikimedia.org/r/1136400

Change #1136394 merged by Herron:

[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates

https://gerrit.wikimedia.org/r/1136394

Followinig https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136394/comments/9a5047f0_96b7ed56 , I lack the context and I don't know what the index refresh interval is or why it had to be tuned from 5 seconds to 10 seconds. I can only assume it means the indexing now runs every ten seconds introducing a delay before events can be found via a search. Presumably that is done due to the system being overloaded.

The scap logstash checker has a window of 20 seconds (canary_wait_time: 20). Previously with an update every 5 seconds, we would at worse seen 15 seconds of traffic. With the refresh window being now at 10 seconds, the span of events could be down to 10 seconds. That might make the check less sensible. Maybe it is not an issue we shall see.

The apifeature usage is very spammy, and we can certainly index it less often. I don't think there is any good reason for having near real time indexing for it. Raising the indexing to occur every 30 seconds, or even 1 minute, might relieve some more pressure. It could be possible to add a message in that sense on https://meta.wikimedia.org/wiki/Special:ApiFeatureUsage.

Our MediaWiki config enables a lot of logging bucket. I am pretty sure we keep adding to the list and barely revisit them. There are plenty of log buckets configured at debug level and we have logs of sessions which are super spammy. Maybe some of the pressure can be removed from the emitter side?

Is there any profiling about Logstash behavior? Maybe it does a lot of regex against MediaWiki messages which are not using contextualized messages?

Thanks for outlining this! I realize now that adjusting refresh_interval itself is big enough for a task so I made T392092: Review logging index refresh_intervals to help organize this and we can discuss further there

Change #1136394 merged by Herron:

[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates

https://gerrit.wikimedia.org/r/1136394

We've reverted this with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137027

Revert "logstash: increase refresh_interval to 10s in index templates" Reason for revert: did not observe a noticeable improvement or change
in performance with this, reverting to original values.

herron changed the task status from Open to Stalled.Jun 18 2025, 2:44 PM
herron triaged this task as Low priority.

Change #1136400 abandoned by Herron:

[operations/puppet@production] logstash: set "index.translog.durability": "async" as template default

https://gerrit.wikimedia.org/r/1136400