Page MenuHomePhabricator

InterfaceSpeedError - mw2286
Closed, ResolvedPublic

Description

Common information

  • alertname: InterfaceSpeedError
  • cluster: api_appserver
  • instance: mw2286:9100
  • interface: eno1
  • job: node
  • operstate: up
  • prometheus: ops
  • severity: task
  • site: codfw
  • source: prometheus
  • team: dcops

Firing alerts


Event Timeline

the cable or the 1G SFP might need to be replaced. can we downtime the server for a small window to test the cabling?

Dzahn renamed this task from InterfaceSpeedError to InterfaceSpeedError - mw2286.May 14 2024, 4:37 PM
Dzahn added a project: serviceops.

Mentioned in SAL (#wikimedia-operations) [2024-05-14T16:39:17Z] <mutante> depooled mw2286.codfw.wmnet because of interface error / needed cable replacement T364863

Mentioned in SAL (#wikimedia-operations) [2024-05-14T16:40:46Z] <dzahn@cumin2002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mw2286.codfw.wmnet with reason: T364863

Mentioned in SAL (#wikimedia-operations) [2024-05-14T16:41:00Z] <dzahn@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mw2286.codfw.wmnet with reason: T364863

@Jhancock.wm cc: @RLazarus I depooled the server and set a downtime of 24 hours.

I reseated the sfp and the cable. Looks fixed and remained steady for an hour. Should be good to add back.

Mentioned in SAL (#wikimedia-operations) [2024-05-15T14:38:26Z] <claime> Removing downtime on mw2286.codfw.wmnet - T364863

Mentioned in SAL (#wikimedia-operations) [2024-05-15T14:39:37Z] <claime> Repooling mw2286.codfw.wmnet - T364863

Clement_Goubert subscribed.

Host downtime removed and repooled, thanks @Dzahn @Jhancock.wm