Page MenuHomePhabricator

Master and candidate master of s5 and s8 in eqiad are in the same row
Closed, ResolvedPublic

Description

According to https://fault-tolerance.toolforge.org/map?cluster=s8 and https://fault-tolerance.toolforge.org/map?cluster=s5.

s8 master is in A1 (db1126) and its candidate is in A6 (db1209)

s5 master is in B3 (db1130) and its candidate is in B1 (db1183)

Wonder we could move the hosts during the eqiad full depool?

s5 candidate can go to any rack in row C, D, E (no other replicas there) and s8 candidate go on to any rack in C, D, E (only one replica in each row)

Event Timeline

s1 and s5 masters are both in B3. We probably should move s5's master instead (to anywhere there is no master there: https://fault-tolerance.toolforge.org/map?cluster=db-masters)

Also candidate masters of s2, s3, and s6 all are in D3.

Not sure if by move you mean physically or logically. The latter is the best/easiest way to do so. Just converting a different host into candidate master

logically is definitely preferred but I fear we in some cases we might not have an option (all other eligible replicas being in some rack that there is already a master) just finding a new candidate master for s1 was quite a pain (T342284#9037553 onwards). I can come up with a list of suggestions and we take it from there.

For s5. Viable options are:

  • db1200 (F2)
  • db1161 (A8)

Any other replica is in a same rack as another master or candidate master. Noting that the current s5 master is in the same rack as x2 and s2 masters and its candidate is in the same rack as candidate master of s1. So we probably need to move both.

For s8. Viable options are:

  • db1214 (B6)
  • db1167 (C3)
  • db1192 (E3)
  • db1193 (F1)
  • db1203 (F3)

Also s6 master is in the same rack as candidate masters of s2 and s3 (D3)

For s3. Viable options for new candidate masters are:

  • db1166 (C3)
  • db1198 (E3)

For s2. Same:

  • db1129 (A8)
  • db1197 (E2)

The options here might conflict if we decide on one, so let's make sure that doesn't happen. I havent' checked if they are multiinstance, sanitarium master, history of issues, etc.

And candidate of s4 and s8 are in the same rack. We definitely need to move s8 candidate.

I would hold all this as there are some hosts that will go into production next Q, so we can probably play with those too: T344036 T342176

s5 situation is fixed with the new hosts:
db1183 - master - B1
db1230 - candidate - C3

I will see what are the posibiities for s8

Marostegui triaged this task as Medium priority.Nov 7 2023, 3:27 PM
Marostegui moved this task from Blocked to In progress on the DBA board.

Do you want me to update fault tolerance map so you could use it?

I see db1209 as master in https://noc.wikimedia.org/dbconfig/eqiad.json (in A6 in yellow). It could be some caching issue I think. Let me dig.

Yeah but it is listed as candidate on the fault-tolerance pages (for me at least)

Change 972444 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1192: New candidate master for s8

https://gerrit.wikimedia.org/r/972444

Change 972444 merged by Marostegui:

[operations/puppet@production] db1192: New candidate master for s8

https://gerrit.wikimedia.org/r/972444

Mentioned in SAL (#wikimedia-operations) [2023-11-07T18:50:33Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1192 T346454', diff saved to https://phabricator.wikimedia.org/P53153 and previous config saved to /var/cache/conftool/dbconfig/20231107-185033-root.json

Change 972450 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1126: No longer candidate master

https://gerrit.wikimedia.org/r/972450

Change 972450 merged by Marostegui:

[operations/puppet@production] db1126: No longer candidate master

https://gerrit.wikimedia.org/r/972450

Fixed
db1192 is the new s8 candidate master