@elukey moving this server to make room for an-workers. Can I do this Monday 8 Feb @1515UTC. It will be the same rack so network ports will stay the same, just need 5 minutes of downtime to relocate.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | • odimitrijevic | T255145 Analytics Hardware for Fiscal Year 2020/2021 | |||
| Resolved | elukey | T255146 Put 24 Hadoop worker nodes in service (cluster expansion) | |||
| Unknown Object (Task) | |||||
| Resolved | • Cmjohnson | T273982 eqiad: move db1111 to rack A8 | |||
| Resolved | • Cmjohnson | T273983 eqiad: Move maps1001 same rack A4 | |||
| Resolved | wiki_willy | T260445 (Need By: TBD) rack/setup/install an-worker11[18-41] | |||
| Resolved | Request | Jclark-ctr | T267065 eqiad: Server moves to free up space on 10g racks | ||
| Resolved | Request | • Cmjohnson | T268810 decommission es1015.eqiad.wmnet | ||
| Resolved | Request | • Cmjohnson | T268100 decommission es1011.eqiad.wmnet | ||
| Resolved | Request | • Cmjohnson | T268101 decommission es1012.eqiad.wmnet | ||
| Resolved | Request | • Cmjohnson | T268812 decommission es1016.eqiad.wmnet |
Event Timeline
Adding @hnowlan to understand if the time window is ok for the host (we briefly had a chat about it on IRC).
The idea is to:
- shutdown the node
- move it to a different rack within the same row (no ip change, no vlan change)
- boot it up again
It should take max 30 mins but it may also depend on how busy Chris is in the DC, if there are emergencies etc..
Since this will free space up for Analytics Hadoop workers, thanks a lot!
@hnowlan just a heads up that it looks like the depool of maps1001 left maps@eqiad underprovisioned:
https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&from=1612777514651&to=1612819057362
The still-pooled servers have been at 100% CPU for some hours, and we also had a page due to some HTTP requests timing out.
Thanks for the heads up @CDanis - I've repooled. It appears there were some issues with the weights of other maps hosts that should have prevented this having an impact, I've rectified that now too.