Migrate all users to new Wiki Replica cluster and decommission old hardware
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Aug 12 2016, 7:31 AM

Description

Timeline

Monday Oct 30 2017, 14:30 UTC
T168584 - Reboot labsdb1001.eqiad.wmnet (aka c1.labsdb) for kernel updates

There is a possibility of catastrophic hardware failure in this reboot. There will be no way to recover the server or the data it currently hosts if that happens.

~~Tuesday Nov 07 2017, 14:30 UTC~~
T168584 - Reboot labsdb1003.eqiad.wmnet (aka c3.labsdb) for kernel updates

Cancelled due to hardware failure on labsdb1001.eqiad.wmnet and subsequent failover of all *.labsdb traffic to this host.

Wednesday 2017-12-13

*.labsdb service names switched to point at *.analytics.db.svc.eqiad.wmflabs equivalents.
User created tables will not be allowed on the new servers.

Thursday 2017-12-14

DBAs will stop replication from production hosts to labsdb1003.eqiad.wmnet
DBAs will make databases on labsdb1003.eqiad.wmnet read-only for all users

Wednesday 2018-01-17

labsdb1001.eqiad.wmnet removed from service permanently.
labsdb1003.eqiad.wmnet removed from service permanently.
c1.labsdb service name will be removed from DNS.
c3.labsdb service name will be removed from DNS.

See https://wikitech.wikimedia.org/wiki/Wiki_Replica_c1_and_c3_shutdown for more information.

Labsdb1001 and labsdb1003 are the latest old-servers from a particular batch in use and are blocking sending them back.

Purchased hosts labsdb1009/10/11 intended as a replacement are in full production, and available to be used instead. Because the improved architecture (allowing real high availability, load balancing and automatic failover) there, however, is a (conscientious) decision of not covering all use cases -in particular, direct(?) write of user databases T156869- so the migration may not be 100% transparent and user impacting (some programming changes may be needed). In all other areas, however, the now hosts are more powerful, better managed and with better data quality.

Cloud team should probably setup a roadmap to understand when the decommission can happen; otherwise, rather than a decommission process, we will have an unplanned outage -current hosts are failing component by component, have multiple hw/IPMI alerts, their storage is not redundant disk-wise (due to disk space constraints, which it is still a growing issue), and in general it is unlikely they will survive more than a few months.

Details

Subject	Repo	Branch	Lines +/-
mariadb: Exclude labsdb1001,2,3 from megacli policy check	operations/puppet	production	+1 -1
wiki replicas: point .labsdb to .analytics.db.svc.eqiad.wmflabs	operations/puppet	production	+922 -922
labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts	operations/puppet	production	+938 -913

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		jcrespo	T140788 Labs databases rearchitecture (tracking)
Resolved		bd808	T166402 Program 7 Outcome 3: data services
Resolved		bd808	T142807 Migrate all users to new Wiki Replica cluster and decommission old hardware
Resolved		bd808	T172704 Promote initial use of new Wiki Replica servers
Resolved		bd808	T174860 Define naming scheme for connecting to new wiki replica cluster
Resolved		• madhuvishy	T168584 Labsdb* servers need to be rebooted
Resolved		bd808	T175086 Create and announce timeline for shutting down labsdb100[13]
Resolved		bd808	T175096 Identify tools hosting databases on labsdb100[13] and notify maintainers
Resolved		bd808	T176688 Update `sql` command to use new wiki replica servers
Resolved		zhuyifei1999	T176694 Switch Quarry to use *.analytics.db.svc.eqiad.wmflabs as replica database host
Open		None	T176886 Update meta_p database for new service names
Resolved		jcrespo	T177096 Some queries to new replica hosts are dramatically slower than labsdb; missing indexes?
Resolved		bd808	T177223 Determine schema differences between labsdb1001 and labsdb1009
Resolved		eranroz	T179227 Migrate copypatrol & plagiabot to use tools.labsdb
Resolved		Marostegui	T179464 labsdb1001 crashed - storage issue
Declined		None	T179628 Consider granting `CREATE TEMPORARY TABLES` to labsdbuser
Open		None	T180558 Include namespace IDs and their names to mysql wikireplicas (meta_p database)
Declined		None	T180636 Make Dispenser's principle_links table accessible in new Wiki replica cluster
Open	Feature	None	T173511 Implement technical details and process for "datasets_p" on wikireplica hosts
Declined		None	T173512 Create a phabricator project called "wikireplica-datasets"
Duplicate		None	T173513 Create a database on the wikireplica servers called "datasets_p"
Duplicate		None	T173514 Document the process for importing a new "datasets_p" table
Resolved		zhuyifei1999	T181492 sql command should point to the new labsdb servers
Declined		None	T182948 Create method for accessing user watchlists in database queries
Resolved		• madhuvishy	T183029 Stop managing account creation for labsdb1001 and 1003 through the maintain-dbusers script
Resolved		jcrespo	T186585 Review m5 backups
Resolved		bd808	T183651 centralauth database service name missing from new replicas
Resolved		Marostegui	T183758 Create backups of user tables from decommissioned database servers
Resolved		• Banyek	T183983 Re-institute query killer for the analytics WikiReplica
Resolved		• Banyek	T203674 Debian package or files managed my puppet for pt-kill-wmf
Resolved		• Cmjohnson	T184832 Decommission labsdb1001 and labsdb1003

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Ricordisamoa subscribed.Sep 27 2017, 2:52 AM

bd808 added a subtask: T176886: Update meta_p database for new service names.Sep 27 2017, 4:42 PM

bd808 added a parent task: T166402: Program 7 Outcome 3: data services.Sep 29 2017, 10:45 PM

bd808 renamed this task from Decommission labsdb1001 and labsdb1003 to Migrate all users to new Wiki Replica cluster and decommission old hardware.Sep 29 2017, 10:47 PM

bd808 updated the task description. (Show Details)

bd808 closed subtask T176688: Update `sql` command to use new wiki replica servers as Resolved.Oct 1 2017, 3:35 AM

bd808 added a subtask: T177096: Some queries to new replica hosts are dramatically slower than labsdb; missing indexes?.Oct 2 2017, 3:34 PM

jcrespo closed subtask T177096: Some queries to new replica hosts are dramatically slower than labsdb; missing indexes? as Resolved.Oct 4 2017, 5:42 PM

bd808 mentioned this in T178135: ores_classification table corrupt on enwiki labs replica labsdb1001.Oct 16 2017, 4:59 AM

bd808 mentioned this in T172567: Data missing from labs replica of enwiki.imagelinks.Oct 18 2017, 2:17 AM

bd808 updated the task description. (Show Details)Oct 19 2017, 12:27 AM

bd808 updated the task description. (Show Details)Oct 19 2017, 12:33 AM

bd808 closed subtask T175086: Create and announce timeline for shutting down labsdb100[13] as Resolved.Oct 19 2017, 12:48 AM

• madhuvishy reopened subtask T168584: Labsdb* servers need to be rebooted as Open.Oct 24 2017, 7:04 PM

• madhuvishy updated the task description. (Show Details)Oct 24 2017, 8:17 PM

Quiddity mentioned this in T179219: Generic hostnames for wiki database replicas?.Oct 28 2017, 12:56 AM

eranroz created subtask T179227: Migrate copypatrol & plagiabot to use tools.labsdb.Oct 28 2017, 6:53 AM

eranroz closed subtask T179227: Migrate copypatrol & plagiabot to use tools.labsdb as Resolved.Nov 1 2017, 10:45 PM

bd808 closed subtask T168584: Labsdb* servers need to be rebooted as Resolved.Nov 2 2017, 12:02 AM

bd808 added a subtask: T179464: labsdb1001 crashed - storage issue.Nov 2 2017, 11:53 PM

bd808 updated the task description. (Show Details)

bd808 added a subtask: T179628: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser.Nov 3 2017, 10:30 PM

Krenair subscribed.Nov 13 2017, 1:38 AM

Dispenser created subtask T180558: Include namespace IDs and their names to mysql wikireplicas (meta_p database).Nov 15 2017, 4:22 AM

Dispenser created subtask T180636: Make Dispenser's principle_links table accessible in new Wiki replica cluster.Nov 15 2017, 8:59 PM

bd808 mentioned this in T181492: sql command should point to the new labsdb servers.Nov 29 2017, 8:57 PM

bd808 added a subtask: T181492: sql command should point to the new labsdb servers.

Change 397256 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts

https://gerrit.wikimedia.org/r/397256

gerritbot added a project: Patch-For-Review.Dec 11 2017, 6:57 AM

bd808 closed subtask T175096: Identify tools hosting databases on labsdb100[13] and notify maintainers as Resolved.Dec 11 2017, 7:03 AM

bd808 closed subtask T179628: Consider granting `CREATE TEMPORARY TABLES` to labsdbuser as Declined.Dec 11 2017, 7:06 AM

Change 397256 merged by Madhuvishy:
[operations/puppet@production] labsdb: Point DNS at equivalent web.db.svc.eqiad.wmflabs hosts

https://gerrit.wikimedia.org/r/397256

bd808 updated the task description. (Show Details)Dec 13 2017, 11:53 PM

DNS switch announced: https://lists.wikimedia.org/pipermail/cloud-announce/2017-December/000013.html

@Marostegui and @jcrespo: Could one of you please stop replication on labsdb1003 and make the databases read-only (matching labsdb1001) at your earliest convenience?

I've updated the timelines a bit. I am now suggesting that we give the users a final 3 week period ending on 2018-01-03 before we take labsdb1001 and labsdb1003 offline for good. This will give tool maintainers who have managed not to get any of the other announcements a small amount of time to archive their user tables or migrate them to tools.db.svc.eqiad.wmflabs.

Death blow for GHEL coordinate extraction and WikiMiniAtlas. 🙁

Mentioned in SAL (#wikimedia-operations) [2017-12-14T07:18:15Z] <marostegui> Stop replication and set read-only on labsdb1003 - T142807

In T142807#3836277, @bd808 wrote:

@Marostegui and @jcrespo: Could one of you please stop replication on labsdb1003 and make the databases read-only (matching labsdb1001) at your earliest convenience?

mysql:root@localhost [(none)]> stop all slaves;
Query OK, 0 rows affected, 5 warnings (0.04 sec)

mysql:root@localhost [(none)]> show warnings;
+-------+------+------------------------+
| Level | Code | Message                |
+-------+------+------------------------+
| Note  | 1938 | SLAVE 's2' stopped     |
| Note  | 1938 | SLAVE 's4' stopped     |
| Note  | 1938 | SLAVE 's7' stopped     |
| Note  | 1938 | SLAVE 's6' stopped     |
| Note  | 1938 | SLAVE 'db1095' stopped |
+-------+------+------------------------+
5 rows in set (0.00 sec)

mysql:root@localhost [(none)]> set global read_only=ON;
Query OK, 0 rows affected (0.00 sec)

I've updated the timelines a bit. I am now suggesting that we give the users a final 3 week period ending on 2018-01-03 before we take labsdb1001 and labsdb1003 offline for good. This will give tool maintainers who have managed not to get any of the other announcements a small amount of time to archive their user tables or migrate them to tools.db.svc.eqiad.wmflabs.

Makes sense, thanks!

@bd808 could we change the old servers to point to the analytics hosts instead? I think, (I may be wrong) that they are pointing to the web one, and most of the non-migrated scripts seem to be long-running queries on crons. Thanks.

Marostegui mentioned this in T174569: Schema change for refactored comment storage.Dec 14 2017, 1:16 PM

In T142807#3836763, @jcrespo wrote:

@bd808 could we change the old servers to point to the analytics hosts instead? I think, (I may be wrong) that they are pointing to the web one, and most of the non-migrated scripts seem to be long-running queries on crons. Thanks.

Yes, we can certainly do that. I guessed arbitrarily that the web use case would be the dominant one, but if we are mostly seeing the opposite all we need to do is update the labsdb zone file in puppet to target the other cluster alias.

We may want to wait to do that until after I've done T181492: sql command should point to the new labsdb servers where I will make sql enwiki connect to the analytics instance(s). That may change the mix for us as well.

• chasemp subscribed.Dec 14 2017, 7:14 PM

RobH added a subtask: T128821: reclaim and return all cisco servers.Dec 14 2017, 7:15 PM

RobH mentioned this in T128821: reclaim and return all cisco servers.

@bd808 We really would need the proposed change now, before the "web" server explodes- the web one is the supposed to be fast and responsive, it is hard to convince people to use this instead for quick queries if it is the slowest because it is the default. See: T182997 and T182995. Plus: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&panelId=12&fullscreen&orgId=1&var-server=labsdb1011&var-network=eth0&from=1513347373505&to=1513372272955 vs https://grafana.wikimedia.org/dashboard/file/server-board.json?orgId=1&var-server=labsdb1010&var-network=eth0&from=1513347373505&to=1513372272955&refresh=1m&panelId=12&fullscreen

Change 398551 had a related patch set uploaded (by BryanDavis; owner: Bryan Davis):
[operations/puppet@production] wiki replicas: point *.labsdb to *.analytics.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/398551

jcrespo awarded a token.Dec 15 2017, 9:27 PM

Change 398551 merged by Andrew Bogott:
[operations/puppet@production] wiki replicas: point *.labsdb to *.analytics.db.svc.eqiad.wmflabs

https://gerrit.wikimedia.org/r/398551

JJMC89 updated the task description. (Show Details)Dec 17 2017, 12:15 AM

JJMC89 mentioned this in T183066: MySQL errors for erwin85 relatedchanges, random article, and categorycount.Dec 17 2017, 12:31 AM

jcrespo removed a subtask: T128821: reclaim and return all cisco servers.Dec 21 2017, 10:23 AM

jcrespo added a parent task: T128821: reclaim and return all cisco servers.

bd808 added a subtask: T183651: centralauth database service name missing from new replicas.Dec 24 2017, 4:56 AM

bd808 closed subtask T183651: centralauth database service name missing from new replicas as Resolved.Dec 24 2017, 5:34 AM

Hi!

What's the status of labsdb1001 and 1003 regarding its decommissioning? as Wednesday 2018-01-03 as already passed by :-)

In T142807#3877442, @Marostegui wrote:

Hi!

What's the status of labsdb1001 and 1003 regarding its decommissioning? as Wednesday 2018-01-03 as already passed by :-)

This weeks madness kind of stole our attention, I'll make a note to discuss this mon/tue next week. AFAIK nothing is pointing at the old stuff explicitly and it's a small announcement and formality at this point.

In T142807#3877442, @Marostegui wrote:

Hi!

What's the status of labsdb1001 and 1003 regarding its decommissioning? as Wednesday 2018-01-03 as already passed by :-)

It was on my list of things to do today to ping @Marostegui and @jcrespo here about this. :) The only outstanding issue that could be considered a blocker for full decomm is T183758.

bd808 closed subtask T183758: Create backups of user tables from decommissioned database servers as Resolved.Jan 9 2018, 6:39 PM

jcrespo mentioned this in T179464: labsdb1001 crashed - storage issue.Jan 12 2018, 12:56 PM

I think we are ready to shutdown labsdb1001 (which actually had another storage crash today) and labsdb1003. The _p databases there have been archived which was the last blocker.

jcrespo updated the task description. (Show Details)Jan 12 2018, 5:10 PM

In T142807#3897157, @bd808 wrote:

I think we are ready to shutdown labsdb1001 (which actually had another storage crash today) and labsdb1003. The _p databases there have been archived which was the last blocker.

Fine by me!
Feel free to proceed :)

bd808 updated the task description. (Show Details)Jan 12 2018, 9:36 PM

bd808 added a subtask: T184832: Decommission labsdb1001 and labsdb1003.Jan 12 2018, 10:00 PM

bd808 removed a parent task: T128821: reclaim and return all cisco servers.Jan 12 2018, 11:14 PM

labsdb1003 RAID policy started to fail and it is now on WT instead of WB.
Possibly the BBU is failing.

Reminder: we would like to stop MySQL on both hosts (actually on 1001 is already stopped/dead) on Wednesday 17th so cloud-services-team can proceed with T184832
@bd808 any objection to stop MySQL for good on 1003?

Change 404323 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Set as spares labsdb1001 and labsdb1003

https://gerrit.wikimedia.org/r/404323

In T142807#3900321, @Marostegui wrote:

@bd808 any objection to stop MySQL for good on 1003?

No objections.

MySQL has been stopped on labsdb1001 (it was already unavailable) and labsdb1003.

Change 405338 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] mariadb: Exclude labsdb1001,2,3 from megacli policy check

https://gerrit.wikimedia.org/r/405338

Change 405338 merged by Jcrespo:
[operations/puppet@production] mariadb: Exclude labsdb1001,2,3 from megacli policy check

https://gerrit.wikimedia.org/r/405338

jcrespo closed subtask T182948: Create method for accessing user watchlists in database queries as Declined.Feb 7 2018, 3:02 PM

• madhuvishy closed subtask T183029: Stop managing account creation for labsdb1001 and 1003 through the maintain-dbusers script as Resolved.Feb 14 2018, 6:02 PM

Tpt reopened subtask T180558: Include namespace IDs and their names to mysql wikireplicas (meta_p database) as Open.Feb 28 2018, 6:21 PM

zhuyifei1999 closed subtask T181492: sql command should point to the new labsdb servers as Resolved.Mar 5 2018, 7:22 PM

zhuyifei1999 mentioned this in T138954: enwiki_p DB corruption.Mar 7 2018, 5:43 PM

This can probably be closed I assume?

In T142807#4064108, @Marostegui wrote:

This can probably be closed I assume?

There are a few cleanup tasks left and one big issue: creating some process for loading user curated data into the wiki replica instances. That should probably be broken out into its own Epic at this point. The original discussions are fragmented and stalled. I'll see if I can clean this up soon.

jcrespo mentioned this in T140788: Labs databases rearchitecture (tracking).Mar 26 2018, 7:50 PM

jcrespo changed the status of subtask T180636: Make Dispenser's principle_links table accessible in new Wiki replica cluster from Open to Stalled.Jun 15 2018, 3:13 PM

We have several follow up tasks still pending, but the major changes outlined here have been complete for months as @Marostegui pointed out (months ago) in T142807#4064108.