Page MenuHomePhabricator

Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic)
Closed, ResolvedPublic

Description

Ideally to apply all at once:

  1. Ugrade to jessie
  2. Upgrade to latest mariadb package
  3. enabling ssl
  4. enabling ferm
  5. enabling performance_schema
  6. setting ROW-based replication on codfw

some of these may not be feasible due to resources (reinstalling 150 machines short-term) or have hard blockers- ROW cannot be enabled on eqiad due to labs (maybe nowhere, if we are going to failover soon), performance schema has not been tested yet properly on the busiest servers and we cannot upgrade some API servers. ROW based replication is not a hard blocker for the restart, although it has to be done very carefully.

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to High.
jcrespo updated the task description. (Show Details)
jcrespo added projects: SRE, DBA.

Change 270133 had a related patch set uploaded (by Jcrespo):
Installing jessie on db1024

https://gerrit.wikimedia.org/r/270133

Change 270133 merged by Jcrespo:
Installing jessie on db1024

https://gerrit.wikimedia.org/r/270133

Change 270142 had a related patch set uploaded (by Jcrespo):
s2: bye bye coredb; hello mariadb 10 with jessie

https://gerrit.wikimedia.org/r/270142

Change 270142 merged by Jcrespo:
s2: bye bye coredb; hello mariadb 10 with jessie

https://gerrit.wikimedia.org/r/270142

Change 270752 had a related patch set uploaded (by Volans):
Depool of db1022 for maintenance

https://gerrit.wikimedia.org/r/270752

Change 270752 merged by jenkins-bot:
Depool of db1022 for maintenance

https://gerrit.wikimedia.org/r/270752

Change 270975 had a related patch set uploaded (by Volans):
Repool of db1022 after maintenance

https://gerrit.wikimedia.org/r/270975

Change 271249 had a related patch set uploaded (by Jcrespo):
Repool db1022 as regular traffic API

https://gerrit.wikimedia.org/r/271249

Change 271249 merged by Jcrespo:
Repool db1022 as regular traffic API

https://gerrit.wikimedia.org/r/271249

Change 270975 abandoned by Jcrespo:
Repool of db1022 after maintenance

Reason:
Forgot about this, I already merged https://gerrit.wikimedia.org/r/#/c/271249/

https://gerrit.wikimedia.org/r/270975

Change 273196 had a related patch set uploaded (by Jcrespo):
Repool db1021 and db1024, both with low/non critical load

https://gerrit.wikimedia.org/r/273196

Change 273196 merged by Jcrespo:
Repool db1021 and db1024, both with low/non critical load

https://gerrit.wikimedia.org/r/273196

When rolling restart also check the error log, if too big let's rotate it and compress/delete the old one based on size.

For the current situation of bigger error logs see T127636#2205361

Change 284516 had a related patch set uploaded (by Jcrespo):
Set db1031 as the local eqiad master and set it to ROW binlog

https://gerrit.wikimedia.org/r/284516

Change 284516 merged by Jcrespo:
Set db1031 as the local eqiad master and set it to ROW binlog

https://gerrit.wikimedia.org/r/284516

I am going to go ahead and say this is done, the rest of the servers will be done at a natural pace when upgraded to Jessie. Ferm has been applied to all eqiad core servers, including the masters and tls too (needs reviewing on its ticket). We may or may not reenable ROW on codfw.