DBA plan to mitigate asw-c2-eqiad reboots
Closed, ResolvedPublic

Description

(This has been done between @jcrespo and myself after having a chat with @faidon) - feel free to edit if you see something wrong

Databases present:

  • es1016 - static (ro) external storage, can be depooled at any time, and it *should* depool automatically when unavailable
    • Plan: depool indefinitely
  • db1060 - s2 api "SPOF" because 54 is on the same rack. Technically, API goes to the other main servers automatically, but the performance may not be ideal
    • Plan: pool an s2 api server somewhere else (maybe moving just the chassis somewhere else?)
  • db1059 - s4 api - should be covered by 68, can be depooled at any time, and it *should* depool automatically when unavailable
    • Plan: depool indefinitely
  • db1057 - s1 MASTER SPOF. Mediawiki goes read only when unavailable.
    • Switchover to another server
  • db1056 - s4 rc - should be covered by 63, can be depooled at any time, and it *should* depool automatically when unavailable
    • Plan: depool indefinitely
  • db1055 - s1 rc "SPOF" because 51 is on the same rack. Technically, API goes to the other main servers automatically, but the performance may not be ideal
    • Plan: pool an s1 rc server somewhere else (requires partitioning - copy from codfw? move server physically?)
  • db1054 - s2 api "SPOF" because 60 is on the same rack. Technically, API goes to the other main servers automatically, but the performance may not be ideal
    • Plan: pool an s2 api server somewhere else
  • db1052 - s1 old master (db1095's master) - it should only affect new labsdb servers, not a priority
    • Plan: do nothing
  • db1051 - s1 rc "SPOF" because 55 is on the same rack. Technically, API goes to the other main servers automatically, but the performance may not be ideal
    • Plan: pool an s1 rc server somewhere else (requires partitioning - copy from codfw?) - move the chassis somewhere else?
  • db1088 - s6 - should be covered by 85 and 93, but due to the large weight it could impact negatively on rebalancing- should be depooled or lowered weight to avoid large flapping
    • Plan: depool or lower weight
  • db1087 - s5 - should be covered by 82 and 92, but due to the large weight it could impact negatively on rebalancing- should be depooled or lowered weight to avoid large flapping
    • Plan: depool or lower weight
  • labstore1004 (only regarding db stuff): it handles labsdb accounting- accoording to Yuvi, when it fails, new accounts creation are paused, but it should go back to normal, without account loss when back up
    • Do nothing
  • es1015 - es2 slave. Can be depooled at any time, and it *should* depool automatically when unavailable
    • Plan: depool indefinitely

Impact if the switch goes down:

  • s1 would run out of rc hosts- which means RCs stops working, for enwiki
    • Quickest way of solving it: move one of the servers to another rack?
  • s2 would run out of api hosts
    • Quickest way of solving it: move one of the servers to another rack?
  • s1 thread for new labsdb servers will be out of dated until network is back (not a priority)

Roadmap:

  • s1 - move db1051 to another rack (high priority) -> maybe to B3 - T156004
  • s1 - move db1052 to another rack (high priority) -> maybe to B3 - T156006
  • s1 - reimage db1065/db1066 to 10.0.28 - T156005
  • s1 - switchover master: db1057 -> db1052 - T156008
  • s1 - move db1073 to another rack (they are all on D1) - T156126
  • s2 - move db1054 to another rack -> maybe to C3 - T156225

We will create subtasks soon

Marostegui changed the status of subtask T156006: Move db1052 to row B3 from "Stalled" to "Open".Jan 23 2017, 3:44 PM

Change 333851 had a related patch set uploaded (by Marostegui):
db-codfw,db-eqiad.php: Add rack positions for s1

https://gerrit.wikimedia.org/r/333851

Change 333851 merged by jenkins-bot:
db-codfw,db-eqiad.php: Add rack positions for s1

https://gerrit.wikimedia.org/r/333851

Mentioned in SAL (#wikimedia-operations) [2017-01-24T08:26:19Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: wmf-config/db-eqiad.php Add rack positions - T155999 (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2017-01-24T08:28:56Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Add rack positions - T155999 (duration: 00m 41s)

Marostegui edited the task description. (Show Details)Jan 24 2017, 12:31 PM
Marostegui edited the task description. (Show Details)
Marostegui edited the task description. (Show Details)Jan 24 2017, 12:51 PM

Change 333914 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1073

https://gerrit.wikimedia.org/r/333914

Change 333914 merged by jenkins-bot:
db-eqiad.php: Depool db1073

https://gerrit.wikimedia.org/r/333914

Mentioned in SAL (#wikimedia-operations) [2017-01-24T14:50:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1073 - T155999 (duration: 00m 39s)

Mentioned in SAL (#wikimedia-operations) [2017-01-24T16:04:49Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1073 - T155999 (duration: 00m 48s)

Change 333952 had a related patch set uploaded (by Jcrespo):
mariadb: Move db1072 back to a normal slave

https://gerrit.wikimedia.org/r/333952

Change 333953 had a related patch set uploaded (by Jcrespo):
MariaDB: Setting db1065 as the new master of sanitarium2

https://gerrit.wikimedia.org/r/333953

Change 333952 merged by Jcrespo:
mariadb: Move db1072 back to a normal slave

https://gerrit.wikimedia.org/r/333952

Change 333953 merged by Jcrespo:
MariaDB: Setting db1065 as the new master of sanitarium2

https://gerrit.wikimedia.org/r/333953

Mentioned in SAL (#wikimedia-operations) [2017-01-24T18:10:19Z] <marostegui> restart mysql db1065 maintenance - https://phabricator.wikimedia.org/T155999)

Change 333976 had a related patch set uploaded (by Jcrespo):
mariadb: repool db1065 as dump/vslow & clean up s1 comments

https://gerrit.wikimedia.org/r/333976

Change 333976 merged by jenkins-bot:
mariadb: repool db1065 as dump/vslow & clean up s1 comments

https://gerrit.wikimedia.org/r/333976

For the record and tracking purposes: after lots of hours and hassle we were able to switch db1095's (new sanitarium) master from db1052 to db1065. It took a lot longer than expected because of db1072 having a different schema on one of the PKless tables (T156166)

Elitre added a subscriber: Elitre.Jan 25 2017, 6:56 PM

Change 335198 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Repool hosts in C2

https://gerrit.wikimedia.org/r/335198

Change 335198 merged by jenkins-bot:
db-eqiad.php: Repool hosts in C2

https://gerrit.wikimedia.org/r/335198

Mentioned in SAL (#wikimedia-operations) [2017-01-31T09:46:46Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool hosts in C2 - T155999 (duration: 00m 40s)

Marostegui closed this task as "Resolved".Feb 9 2017, 7:22 AM

All the initial actions listed on the original ticket to mitigate this issue have been completed, the only pending thing is: : T156475 which is a subtask.
I will be closing this major ticket but leaving the subtask open and it needs investigation

Thanks everyone involved here for all the help and support!