Monday Oct 30 2017, 14:30 UTC
T168584 - Reboot labsdb1001.eqiad.wmnet (aka c1.labsdb) for kernel updates
- There is a possibility of catastrophic hardware failure in this reboot. There will be no way to recover the server or the data it currently hosts if that happens.
Tuesday Nov 07 2017, 14:30 UTC
T168584 - Reboot labsdb1003.eqiad.wmnet (aka c3.labsdb) for kernel updates
- Cancelled due to hardware failure on labsdb1001.eqiad.wmnet and subsequent failover of all *.labsdb traffic to this host.
- *.labsdb service names switched to point at *.analytics.db.svc.eqiad.wmflabs equivalents.
- User created tables will not be allowed on the new servers.
- DBAs will stop replication from production hosts to labsdb1003.eqiad.wmnet
- DBAs will make databases on labsdb1003.eqiad.wmnet read-only for all users
- labsdb1001.eqiad.wmnet removed from service permanently.
- labsdb1003.eqiad.wmnet removed from service permanently.
- c1.labsdb service name will be removed from DNS.
- c3.labsdb service name will be removed from DNS.
Labsdb1001 and labsdb1003 are the latest old-servers from a particular batch in use and are blocking sending them back.
Purchased hosts labsdb1009/10/11 intended as a replacement are in full production, and available to be used instead. Because the improved architecture (allowing real high availability, load balancing and automatic failover) there, however, is a (conscientious) decision of not covering all use cases -in particular, direct(?) write of user databases T156869- so the migration may not be 100% transparent and user impacting (some programming changes may be needed). In all other areas, however, the now hosts are more powerful, better managed and with better data quality.
Cloud team should probably setup a roadmap to understand when the decommission can happen; otherwise, rather than a decommission process, we will have an unplanned outage -current hosts are failing component by component, have multiple hw/IPMI alerts, their storage is not redundant disk-wise (due to disk space constraints, which it is still a growing issue), and in general it is unlikely they will survive more than a few months.