Page MenuHomePhabricator

Page: cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%
Closed, ResolvedPublic

Assigned To
Authored By
Eevans
Feb 10 2024, 3:40 AM
Referenced Files
F41835075: image.png
Feb 10 2024, 3:45 AM
F41835064: image.png
Feb 10 2024, 3:45 AM
F41835021: image.png
Feb 10 2024, 3:40 AM
F41835013: image.png
Feb 10 2024, 3:40 AM

Description

8:36 PM <+jinxer-wm> (Primary outbound port utilisation over 80%  #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
8:38 PM <+jinxer-wm> (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
8:41 PM <+jinxer-wm> (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads

image.png (470×1 px, 406 KB)

image.png (835×1 px, 124 KB)
codfw
image.png (835×1 px, 239 KB)
eqiad

Event Timeline

Looking at this briefly (it's Saturday and the moment has passed), the request rate goes up somewhat (so looks unusual, but not at the level I would expect to cause an issue), but both frontend and backend network utilisation is significantly elevated, which makes me wonder if this was a lot of hits on an original rather than a thumb or similar.

Could Too many eqiad mediawiki originals uploads be a red herring? The traffic jumps are all in codfw. I'm not sure what's actionable in this ticket.

Could Too many eqiad mediawiki originals uploads be a red herring? The traffic jumps are all in codfw.

Honestly, that alert is what prompted me to look at the Swift dashboards where network throughput corresponds with the increase on cr2-codfw:

image.png (835×1 px, 124 KB)

Looking back, I guess I didn't question why that alert had fired for eqiad (which seems pretty curious to me now that you bring it up).

I'm not sure what's actionable in this ticket.

Maybe there isn't anything. I opened this ticket before heading off to bed, in case it wasn't the end of it, and someone else picked it up.

hnowlan claimed this task.

Could Too many eqiad mediawiki originals uploads be a red herring? The traffic jumps are all in codfw.

Honestly, that alert is what prompted me to look at the Swift dashboards where network throughput corresponds with the increase on cr2-codfw:

image.png (835×1 px, 124 KB)

Looking back, I guess I didn't question why that alert had fired for eqiad (which seems pretty curious to me now that you bring it up).

To be honest that probably needs its own investigation - checking back it seems like this alert has flapped at least once a day almost every day this year.

I'm not sure what's actionable in this ticket.

Maybe there isn't anything. I opened this ticket before heading off to bed, in case it wasn't the end of it, and someone else picked it up.

Totally fair! For now I might resolve it