Migrate all users to new Wiki Replica cluster and decommission old hardware
Closed, ResolvedPublic

Description

Timeline

Monday Oct 30 2017, 14:30 UTC
T168584 - Reboot labsdb1001.eqiad.wmnet (aka c1.labsdb) for kernel updates

  • There is a possibility of catastrophic hardware failure in this reboot. There will be no way to recover the server or the data it currently hosts if that happens.

Tuesday Nov 07 2017, 14:30 UTC
T168584 - Reboot labsdb1003.eqiad.wmnet (aka c3.labsdb) for kernel updates

  • Cancelled due to hardware failure on labsdb1001.eqiad.wmnet and subsequent failover of all *.labsdb traffic to this host.

Wednesday 2017-12-13

  • *.labsdb service names switched to point at *.analytics.db.svc.eqiad.wmflabs equivalents.
  • User created tables will not be allowed on the new servers.

Thursday 2017-12-14

  • DBAs will stop replication from production hosts to labsdb1003.eqiad.wmnet
  • DBAs will make databases on labsdb1003.eqiad.wmnet read-only for all users

Wednesday 2018-01-17

  • labsdb1001.eqiad.wmnet removed from service permanently.
  • labsdb1003.eqiad.wmnet removed from service permanently.
  • c1.labsdb service name will be removed from DNS.
  • c3.labsdb service name will be removed from DNS.

Labsdb1001 and labsdb1003 are the latest old-servers from a particular batch in use and are blocking sending them back.

Purchased hosts labsdb1009/10/11 intended as a replacement are in full production, and available to be used instead. Because the improved architecture (allowing real high availability, load balancing and automatic failover) there, however, is a (conscientious) decision of not covering all use cases -in particular, direct(?) write of user databases T156869- so the migration may not be 100% transparent and user impacting (some programming changes may be needed). In all other areas, however, the now hosts are more powerful, better managed and with better data quality.

Cloud team should probably setup a roadmap to understand when the decommission can happen; otherwise, rather than a decommission process, we will have an unplanned outage -current hosts are failing component by component, have multiple hw/IPMI alerts, their storage is not redundant disk-wise (due to disk space constraints, which it is still a growing issue), and in general it is unlikely they will survive more than a few months.

Related Objects

StatusAssignedTask
Resolved jcrespo
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolved madhuvishy
Resolvedbd808
Resolvedbd808
Resolvedbd808
Resolvedzhuyifei1999
Openbd808
Resolved jcrespo
Resolvedbd808
Resolvederanroz
ResolvedMarostegui
DeclinedNone
OpenNone
StalledNone
OpenNone
DeclinedNone
StalledNone
StalledNone
Resolvedzhuyifei1999
DeclinedNone
Resolved madhuvishy
Resolved jcrespo
Resolvedbd808
ResolvedMarostegui
ResolvedBanyek
ResolvedBanyek
ResolvedCmjohnson
There are a very large number of changes, so older changes are hidden. Show Older Changes
bd808 renamed this task from Decommission labsdb1001 and labsdb1003 to Migrate all users to new Wiki Replica cluster and decommission old hardware.Sep 29 2017, 10:47 PM
bd808 updated the task description. (Show Details)
bd808 updated the task description. (Show Details)Oct 19 2017, 12:27 AM
bd808 updated the task description. (Show Details)Oct 19 2017, 12:33 AM

Change 397256 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts

https://gerrit.wikimedia.org/r/397256

Change 397256 merged by Madhuvishy:
[operations/puppet@production] labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts

https://gerrit.wikimedia.org/r/397256

bd808 updated the task description. (Show Details)Dec 13 2017, 11:53 PM

@Marostegui and @jcrespo: Could one of you please stop replication on labsdb1003 and make the databases read-only (matching labsdb1001) at your earliest convenience?

I've updated the timelines a bit. I am now suggesting that we give the users a final 3 week period ending on 2018-01-03 before we take labsdb1001 and labsdb1003 offline for good. This will give tool maintainers who have managed not to get any of the other announcements a small amount of time to archive their user tables or migrate them to tools.db.svc.eqiad.wmflabs.

Death blow for GHEL coordinate extraction and WikiMiniAtlas. 🙁

Mentioned in SAL (#wikimedia-operations) [2017-12-14T07:18:15Z] <marostegui> Stop replication and set read-only on labsdb1003 - T142807

@Marostegui and @jcrespo: Could one of you please stop replication on labsdb1003 and make the databases read-only (matching labsdb1001) at your earliest convenience?

mysql:root@localhost [(none)]> stop all slaves;
Query OK, 0 rows affected, 5 warnings (0.04 sec)

mysql:root@localhost [(none)]> show warnings;
+-------+------+------------------------+
| Level | Code | Message                |
+-------+------+------------------------+
| Note  | 1938 | SLAVE 's2' stopped     |
| Note  | 1938 | SLAVE 's4' stopped     |
| Note  | 1938 | SLAVE 's7' stopped     |
| Note  | 1938 | SLAVE 's6' stopped     |
| Note  | 1938 | SLAVE 'db1095' stopped |
+-------+------+------------------------+
5 rows in set (0.00 sec)

mysql:root@localhost [(none)]> set global read_only=ON;
Query OK, 0 rows affected (0.00 sec)

I've updated the timelines a bit. I am now suggesting that we give the users a final 3 week period ending on 2018-01-03 before we take labsdb1001 and labsdb1003 offline for good. This will give tool maintainers who have managed not to get any of the other announcements a small amount of time to archive their user tables or migrate them to tools.db.svc.eqiad.wmflabs.

Makes sense, thanks!

@bd808 could we change the old servers to point to the analytics hosts instead? I think, (I may be wrong) that they are pointing to the web one, and most of the non-migrated scripts seem to be long-running queries on crons. Thanks.

bd808 added a comment.Dec 14 2017, 5:24 PM

@bd808 could we change the old servers to point to the analytics hosts instead? I think, (I may be wrong) that they are pointing to the web one, and most of the non-migrated scripts seem to be long-running queries on crons. Thanks.

Yes, we can certainly do that. I guessed arbitrarily that the web use case would be the dominant one, but if we are mostly seeing the opposite all we need to do is update the labsdb zone file in puppet to target the other cluster alias.

We may want to wait to do that until after I've done T181492: sql command should point to the new labsdb servers where I will make sql enwiki connect to the analytics instance(s). That may change the mix for us as well.

@bd808 We really would need the proposed change now, before the "web" server explodes- the web one is the supposed to be fast and responsive, it is hard to convince people to use this instead for quick queries if it is the slowest because it is the default. See: T182997 and T182995. Plus: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=12&fullscreen&orgId=1&var-server=labsdb1011&var-network=eth0&from=1513347373505&to=1513372272955 vs https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&var-server=labsdb1010&var-network=eth0&from=1513347373505&to=1513372272955&refresh=1m&panelId=12&fullscreen

Change 398551 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wiki replicas: point *.labsdb to *.analytics.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/398551

Change 398551 merged by Andrew Bogott:
[operations/puppet@production] wiki replicas: point *.labsdb to *.analytics.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/398551

JJMC89 updated the task description. (Show Details)Dec 17 2017, 12:15 AM

Hi!

What's the status of labsdb1001 and 1003 regarding its decommissioning? as Wednesday 2018-01-03 as already passed by :-)

Hi!

What's the status of labsdb1001 and 1003 regarding its decommissioning? as Wednesday 2018-01-03 as already passed by :-)

This weeks madness kind of stole our attention, I'll make a note to discuss this mon/tue next week. AFAIK nothing is pointing at the old stuff explicitly and it's a small announcement and formality at this point.

bd808 added a comment.Jan 5 2018, 7:43 PM

Hi!

What's the status of labsdb1001 and 1003 regarding its decommissioning? as Wednesday 2018-01-03 as already passed by :-)

It was on my list of things to do today to ping @Marostegui and @jcrespo here about this. :) The only outstanding issue that could be considered a blocker for full decomm is T183758.

bd808 added a comment.Jan 12 2018, 5:07 PM

I think we are ready to shutdown labsdb1001 (which actually had another storage crash today) and labsdb1003. The _p databases there have been archived which was the last blocker.

jcrespo updated the task description. (Show Details)Jan 12 2018, 5:10 PM

I think we are ready to shutdown labsdb1001 (which actually had another storage crash today) and labsdb1003. The _p databases there have been archived which was the last blocker.

Fine by me!
Feel free to proceed :)

bd808 updated the task description. (Show Details)Jan 12 2018, 9:36 PM

labsdb1003 RAID policy started to fail and it is now on WT instead of WB.
Possibly the BBU is failing.

Reminder: we would like to stop MySQL on both hosts (actually on 1001 is already stopped/dead) on Wednesday 17th so cloud-services-team can proceed with T184832
@bd808 any objection to stop MySQL for good on 1003?

Change 404323 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Set as spares labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/404323

@bd808 any objection to stop MySQL for good on 1003?

No objections.

MySQL has been stopped on labsdb1001 (it was already unavailable) and labsdb1003.

Change 405338 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Exclude labsdb1001,2,3 from megacli policy check

https://gerrit.wikimedia.org/r/405338

Change 405338 merged by Jcrespo:
[operations/puppet@production] mariadb: Exclude labsdb1001,2,3 from megacli policy check

https://gerrit.wikimedia.org/r/405338

This can probably be closed I assume?

bd808 added a comment.Mar 20 2018, 2:56 PM

This can probably be closed I assume?

There are a few cleanup tasks left and one big issue: creating some process for loading user curated data into the wiki replica instances. That should probably be broken out into its own Epic at this point. The original discussions are fragmented and stalled. I'll see if I can clean this up soon.

bd808 closed this task as Resolved.Jun 29 2018, 9:10 PM
bd808 claimed this task.

We have several follow up tasks still pending, but the major changes outlined here have been complete for months as @Marostegui pointed out (months ago) in T142807#4064108.