Page MenuHomePhabricator

Reduce impact of Elastic snapshots
Closed, ResolvedPublic

Description

Per today's Wikimedia-Search IRC conversation with @BCornwall , we inadvertently triggered an LVS alert for high network bandwidth by running an Elastic snapshot/restore (See Grafana). Creating this ticket to:

  • Investigate Discuss ways to reduce strain on external services (LVS, Swift) during snapshot operations with owning teams.
  • Implement changes, if possible necessary.

Updated task description above to reflect IRC discussions re: alert detuning (Traffic/Infrastructure Foundations) vs. outbound traffic rate-limiting from ES hosts (Data Platform SRE).

Event Timeline

bking added a subscriber: CDanis.

Per today's IRC discussion in the security channel, @CDanis mentioned detuning or removing the LVS alerts for internal hosts. So I'll set this one to blocked at the moment. Chris and/or Brett, let us know what your teams decide.

bking renamed this task from Reduce network impact of Elastic snapshots to Reduce impact of Elastic snapshots.Nov 17 2023, 4:09 PM
bking updated the task description. (Show Details)

Just wanted to add that Envoy is deployed for Swift frontends per today's SRE meeting.

That being said, we (Search Platform/Data Platform SRE) would prefer not to implement Envoy across our Elastic fleet without confirmation from Traffic and/or Data Persistence that this internal LVS bandwidth alert actually represents a problem. Tagging @MatthewVernon for awareness.

I don't expect the change to make difference to how anyone is using swift - moving from nginx to envoy for TLS termination was more about bringing swift more up-to-date in terms of TLS termination, and getting better observability and reliability.

Update: Traffic team merged a patch that makes these LVS high RX alerts non-paging . Thus, I believe we don't be inadvertently paging on-call SREs every time we take a snapshot. As such, I'm closing out this ticket.

Please feel free to reopen if Search Platform/Data Platform SRE does need to take further action.

bking claimed this task.