Page MenuHomePhabricator

Failover DB masters in row D
Closed, ResolvedPublic

Description

The following masters are in row D and need to be failed over to be able to allow the eqiad row D switch upgrade (T172459)

  • s2: db1122
  • Candidate master in row B
  • s3: db1123
  • Candidate master in row C
  • s4: db1068
  • Candidate master in row A
  • s5: db1070
  • Candidate master in row C
  • s7: db1062
  • Candidate master in row B
  • s8: db1109 T239238
  • Candidate master in row B
  • es5: es1023
  • Candidate master in row A

This requires read only time.

Related Objects

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui created this task.
Marostegui moved this task from Triage to Pending comment on the DBA board.
jcrespo moved this task from In progress to Pending comment on the DBA board.
jcrespo moved this task from Pending comment to In progress on the DBA board.

We should throw a plan for this, but for all rows.

Let's plan for the next DC failover to do this or at least move a couple of them.

Marostegui changed the task status from Open to Stalled.Aug 20 2018, 9:29 AM

Stalling this as the switch upgrade isn't clear now how it will proceed as per the network issues found at T201145

I have been syncing-up with @ayounsi about the scheduled network maintenance and the switches issues.
So far they are still doing some tests (T201145) and they should know a bit more how to proceed further in a few days. Until that happens, everything is stalled.
If things go well, there're chances that a maintenance will happen in row B and quoting his words: This would mean up to 30min of downtime for the servers on that switches (worse case that we will probably shorten).
I have let him know that we have misc masters in that row that will not be failed over to codfw (wikitech and phabricator).

The same maintenance would need to be applied to row C and D, but that is unlikely to happen during the failover time as per Arzhel's comments.

My proposal in regards to this task is as follows:

  1. Wait for the unblock of T201145
  2. Coordinate with Arzhel to see if maintenance on row B happens during the failover
  3. If maintnance on row B happens, then move s7 and s8 masters to row B so that row would be done.

I just sync'ed with @ayounsi about the network maintenance. It is still blocked on the cables.

row A:
If cables arrive on time, they are expecting to do maintenance on row A (no servers) later this week or early next week. Once done, we could move servers there (it is currently empty).

row B:
If maintenance on row A happens, then maintenance on row B could happen the week of 1st Oct.

We need to decide/plan if we want to physically move the servers to any of those rows, or just do a DC failover to the candidates (I put on the task where the candidates are placed). If we decide to move them physically, that implies moving the candidate masters too.

As agreed during our meeting today, we will combine this failovers with the next hardware purchase for eqiad (I will link the task here once I create it)

Marostegui mentioned this in Unknown Object (Task).Sep 21 2018, 9:29 AM
Marostegui changed the task status from Stalled to Open.Jun 19 2019, 10:22 AM
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Blocked external/Not db team to In progress on the DBA board.
jcrespo changed the task status from Open to Stalled.Apr 29 2020, 7:29 AM

Stalling as per Marostegui's updates. Not really blocked on switchover anymore.

Marostegui changed the task status from Stalled to Open.Aug 28 2020, 10:35 AM
Marostegui moved this task from Backlog to Pending comment on the DBA board.

This needs review: we need to check how many masters we have per row and distribute them equally once eqiad is stand by

Current distribution:

root@db1115.eqiad.wmnet[zarcillo]> select instance,rack,section from masters inner join servers on masters.instance=servers.hostname where masters.dc='eqiad' order by rack;
+----------+------+---------+
| instance | rack | section |
+----------+------+---------+
| db1107   | A2   | m2      |
| db1080   | A2   | m1      |
| db1081   | A2   | s4      |
| db1103   | A3   | x1      |
| es1024   | A5   | es5     |
| pc1007   | A6   | pc1     |
| db1115   | A6   | tendril |
| db1077   | B1   | test-s4 |
| db1083   | B1   | s1      |
| db1086   | B3   | s7      |
| es1021   | B3   | es4     |
| db1132   | B8   | m3      |
| db1100   | C2   | s5      |
| pc1009   | C3   | pc3     |
| db1133   | C3   | m5      |
| pc1010   | D3   | pc2     |
| db1114   | D4   | test-s1 |
| db1122   | D6   | s2      |
| db1093   | D8   | s6      |
| db1109   | D8   | s8      |
| db1123   | D8   | s3      |
+----------+------+---------+
21 rows in set (0.001 sec)

It is not too bad and we've spread it quite fine lately.
If we check just core it is a bit less balanced:

+----------+------+---------+
| instance | rack | section |
+----------+------+---------+
| db1081   | A2   | s4      |
| db1083   | B1   | s1      |
| db1086   | B3   | s7      |
| db1100   | C2   | s5      |
| db1122   | D6   | s2      |
| db1093   | D8   | s6      |
| db1109   | D8   | s8      |
| db1123   | D8   | s3      |
+----------+------+---------+
8 rows in set (0.002 sec)

We can move s8 out from row D to row B as planned: T239238
We can also move s3 out from row D to row A (db1075 is in row A)

So we would have:

2 masters in row A
3 masters in row B
1 master in row C
2 masters in row D

We could try to move some other candidate masters physically to row C and try to get one of B out. But that might require some time as needs on-site help.

This is no longer blocking the switch upgrade as it will be done while eqiad is passive, but it is a matter of having our MW masters a bit more balanced within rows.

We definitely need to move one out of D8, having 3 there is too much (T261454#6422998)

Marostegui claimed this task.

After failing over s6 eqiad master (T263227) and s8 eqiad master (T239238), the scope for this task, which was to balance the masters across rows for sX-x1, that's achieved:
2 masters in row A
3 masters in row B
2 master in row C
2 masters in row D

Resolving this for now.