Page MenuHomePhabricator

Failover DB masters in row D
Open, NormalPublic

Description

The following masters are in row D and need to be failed over to be able to allow the eqiad row D switch upgrade (T172459)

  • s2: db1122
  • Candidate master in row B
  • s4: db1068
  • Candidate master in row A
  • s5: db1070
  • Candidate master in row C
  • s7: db1062
  • Candidate master in row B
  • s8: db1109
  • Candidate master in row B

This requires read only time.

Event Timeline

Marostegui removed ayounsi as the assignee of this task.Feb 1 2018, 7:13 AM
Marostegui triaged this task as Normal priority.
Marostegui created this task.
Marostegui moved this task from Triage to Next on the DBA board.
Marostegui moved this task from Next to Backlog on the DBA board.Feb 9 2018, 12:18 PM
jcrespo moved this task from Backlog to Next on the DBA board.Apr 4 2018, 2:20 PM
jcrespo moved this task from Next to In progress on the DBA board.Apr 5 2018, 4:06 PM
jcrespo moved this task from In progress to Next on the DBA board.
jcrespo moved this task from Next to In progress on the DBA board.

We should throw a plan for this, but for all rows.

Marostegui moved this task from In progress to Next on the DBA board.Apr 25 2018, 5:55 AM

Let's plan for the next DC failover to do this or at least move a couple of them.

Marostegui changed the task status from Open to Stalled.Aug 20 2018, 9:29 AM

Stalling this as the switch upgrade isn't clear now how it will proceed as per the network issues found at T201145

Marostegui updated the task description. (Show Details)Sep 5 2018, 3:49 PM

I have been syncing-up with @ayounsi about the scheduled network maintenance and the switches issues.
So far they are still doing some tests (T201145) and they should know a bit more how to proceed further in a few days. Until that happens, everything is stalled.
If things go well, there're chances that a maintenance will happen in row B and quoting his words: This would mean up to 30min of downtime for the servers on that switches (worse case that we will probably shorten).
I have let him know that we have misc masters in that row that will not be failed over to codfw (wikitech and phabricator).

The same maintenance would need to be applied to row C and D, but that is unlikely to happen during the failover time as per Arzhel's comments.

My proposal in regards to this task is as follows:

  1. Wait for the unblock of T201145
  2. Coordinate with Arzhel to see if maintenance on row B happens during the failover
  3. If maintnance on row B happens, then move s7 and s8 masters to row B so that row would be done.

I just sync'ed with @ayounsi about the network maintenance. It is still blocked on the cables.

row A:
If cables arrive on time, they are expecting to do maintenance on row A (no servers) later this week or early next week. Once done, we could move servers there (it is currently empty).

row B:
If maintenance on row A happens, then maintenance on row B could happen the week of 1st Oct.

We need to decide/plan if we want to physically move the servers to any of those rows, or just do a DC failover to the candidates (I put on the task where the candidates are placed). If we decide to move them physically, that implies moving the candidate masters too.

As agreed during our meeting today, we will combine this failovers with the next hardware purchase for eqiad (I will link the task here once I create it)

Marostegui mentioned this in Unknown Object (Task).Sep 21 2018, 9:29 AM
Marostegui changed the task status from Stalled to Open.Jun 19 2019, 10:22 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Blocked external/Not db team to In progress on the DBA board.
Marostegui updated the task description. (Show Details)Tue, Sep 10, 5:46 AM
Marostegui updated the task description. (Show Details)Tue, Sep 17, 6:01 AM