Page MenuHomePhabricator

IPoid: Rate of "request handled" log events flattened
Closed, ResolvedPublic

Description

On Friday, 12 September, around 12:00 UTC the rate of "request handled" events became suspiciously constant – far flatter than during the earlier daily cycles. Additionally, when zooming into the intervals, most entries are reported at the beginning of every third hour, instead of being equally distributed across the time window (there are still a few events coming later, though).

The number of events in a three-hour interval is about three times lower than between 0:00 and 3:00 UTC (the minimum during the day).

image.png (604×1 px, 51 KB)

image.png (604×1 px, 51 KB)

Request rate in Grafana seems unchanged: https://grafana.wikimedia.org/d/6C9Bm6uVz/ipoid?forceLogin=true&from=now-7d&orgId=1&to=now&var-container_name=ipoid-production&var-dc=000000026&var-prometheus=k8s&var-service=ipoid&var-site=eqiad&timezone=utc&viewPanel=panel-196

Logstash URL: https://logstash.wikimedia.org/goto/cc47c299c9a57d7d43b6f0f45e8551ef

Event Timeline

kostajh updated the task description. (Show Details)
kostajh renamed this task from IPoid: Rate of "request handled" log events flatened to IPoid: Rate of "request handled" log events flattened.Sep 16 2025, 7:49 AM

The sawtooth pattern reminded me of T395899. It shouldn't be related, but maybe there was a similar throttle somewhere affecting these logs.

The log volume seems to have gone mostly back to normal on Monday afternoon. Here's the same search, but with a bigger chart: https://logstash.wikimedia.org/goto/35a757159907937d81e30015adb08663

image.png (233×1 px, 33 KB)

The sawtooth pattern reminded me of T395899. It shouldn't be related, but maybe there was a similar throttle somewhere affecting these logs.

The post-kafka throttle/ratelimit described in T395899 is not configured to affect these logs. The throttler, prior to dropping logs will tag the events with throttle_warning and still be found in logstash. There are no throttle_warning events in that query.

We observed no logging system outages during that time. Just prior to recovery, however, there was a noticeable dip in proxy/httpd logs flowing towards logstash.

STran claimed this task.
STran subscribed.

Not sure if there's any action on PSI's end and it seems to have self-resolved:

image.png (538×1 px, 69 KB)

Given that, I'm going to be bold tm and close this but if someone disagrees, feel free to re-open.