Page MenuHomePhabricator

Replace some masters in eqiad while it is not active
Closed, ResolvedPublic

Description

There are some masters in eqiad that will eventually need to go away as per: T134476
The list of affected masters are:

s2 - db1018
s4 - db1040
s5 - db1049
s6 - db1050
s7 - db1041

Some of them are easier than others, not because of technical procedure (as it is the same) but for data consistency across the shard etc.
We are currently working on checksumming all the shards and fixing as many inconsistencies as possible but it is an slow process.

We'd also need to decide which server will need to be promoted to master, analyze its past history (mostly HW issues to make sure we promote a reliable one).
There is an initial draft commit about how the db-eqiad.php file would look like after all the decommissions and after moving servers around: https://gerrit.wikimedia.org/r/#/c/338996/

Proposed switchover summary:

  • s2 - db1054 (looks good and checksummed)
  • - s4 - db1068 (crashed once, unknown data state, alternatives? recloning it from the master or large servers to be 100% sure?) - DONE 20th April
  • s5 - db1063 (looks good, but needs cloning)
  • s6 - db1061 (crashed once, almost finished checksumming, but overally ok)
  • s7 - db1062 (looks good, unknown data state)

Related Objects

Event Timeline

For s2 the suggested host in: https://gerrit.wikimedia.org/r/#/c/338996/ was db1054.
I have been doing some research about its history and the only HW issue that it has (or was logged) is from 2 years ago: T89801
Apart from that, if we take a look at the pt-table-checksum results from s2 (T161510) it hasn't had any differences in any of the checks, so overall looks like a good candidate.

For s4 the suggested host in: https://gerrit.wikimedia.org/r/#/c/338996/ was db1068. Jaime mentioned there could be underlying issues with this host and the whole shard consistency, so we'd need to run pt-table-checksum on it once it is done on s4

Inconsistencies aside and the only issue reported was having HT disabled (T156140) but it looks like it is enabled now:

root@db1068:/srv/sqldata/commonswiki# dmidecode -t processor | grep -E '(Core Count|Thread Count)'
	Core Count: 10
	Thread Count: 20
	Core Count: 10
	Thread Count: 20

The host didn't have any logged HW issue.

For s5 the suggested host in: https://gerrit.wikimedia.org/r/#/c/338996/ was db1063
db1063 currently lives in s2, so it'd need to be recloned (which is fine). It is running jessie 8.7 and 10.0.29

Apart from some reports of HT not being enabled (T156140) which is no longer valid as it is enabled already:

root@db1063:~# dmidecode -t processor | grep -E '(Core Count|Thread Count)'
	Core Count: 8
	Thread Count: 16
	Core Count: 8
	Thread Count: 16

And some degraded RAID ticket, there are no other HW issues reported from this host, so it looks good to be a master.

For s6 the suggested host in: https://gerrit.wikimedia.org/r/#/c/338996/ was db1061
The recent pt-table-checksum ran on s6 (T160509) revealed no inconsistencies on that host.
HW wise:

  • There was a mysterious crash that happened once (back in Sept 2016 - so almost 6 months ago): T146018 which was never found the root cause, but it has not repeated either. Server uptime is now almost 200 days
root@db1061:~# uptime
 13:30:26 up 198 days,  1:56,  1 user,  load average: 0.35, 0.17, 0.11
  • There was a minor issue with the ILO, which was fixed: T138368
  • HT was reported to be disabled as in many hosts, but it is currently enabled:
root@db1061:~#  dmidecode -t processor | grep -E '(Core Count|Thread Count)'
	Core Count: 8
	Thread Count: 16
	Core Count: 8
	Thread Count: 16

This server is running jessie 8.6 and 10.0.23 (and currently lives in s6 itself, so no recloning would be needed)

For s7 the suggested host in: https://gerrit.wikimedia.org/r/#/c/338996/ was db1062
We haven't run pt-table-checksum on s7 yet, so data-wise we are uncertain about the consistency across the shard.
HW wise:
There was only a minor thing with the ILO that got fixed: T138368
Apart from that, there are no tickets logged or changesets regarding this server being broken.

The server has been up for 286 days now.
Running jessie 8.5 and 10.0.23

Let's coordinate with @ayounsi before attempting any switchover any of the masters to make sure T148506 and T162681 are not in the way of this.

Change 349249 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Promote db1068 to master

https://gerrit.wikimedia.org/r/349249

Change 349249 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Promote db1068 to master

https://gerrit.wikimedia.org/r/349249

db1063 is ready to be the master in s5.
binlog needs to be changed to STATEMENT

The following hosts will be affected by the recabling of row D (losing network connectivity for a bit): T162681 which is happening Wednesday 26th

db1068 (current s4 master)
db1063 (future s5 master)
db1061 (future s6 master)
db1062 (future s7 master)

I have migrated in advance dbstore1001 for s2, s5 and s6 (that means they are on the definitive host except for s7, x1 and the misc servers).

Change 350127 had a related patch set uploaded (by Jcrespo):
[operations/mediawiki-config@master] mariadb: Promote db1054 as the new s2 master on eqiad

https://gerrit.wikimedia.org/r/350127

Change 350130 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] mariadb: promote db1054 as the new s2 eqiad master

https://gerrit.wikimedia.org/r/350130

Change 350130 merged by Jcrespo:
[operations/puppet@production] mariadb: promote db1054 as the new s2 eqiad master

https://gerrit.wikimedia.org/r/350130

Change 350127 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Promote db1054 as the new s2 master on eqiad

https://gerrit.wikimedia.org/r/350127

Change 350136 had a related patch set uploaded (by Marostegui):
[operations/software@master] s2.hosts: Move db1054 as the new master

https://gerrit.wikimedia.org/r/350136

Change 350136 merged by Jcrespo:
[operations/software@master] s2.hosts: Move db1054 as the new master

https://gerrit.wikimedia.org/r/350136

Change 350143 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] Promote db1054 as s2 eqiad master

https://gerrit.wikimedia.org/r/350143

Change 350143 merged by Jcrespo:
[operations/puppet@production] Promote db1054 as s2 eqiad master

https://gerrit.wikimedia.org/r/350143

Change 350155 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] Change db1061 to be the s6 master on eqiad

https://gerrit.wikimedia.org/r/350155

Change 350155 merged by Jcrespo:
[operations/puppet@production] Change db1061 to be the s6 master on eqiad

https://gerrit.wikimedia.org/r/350155

Change 350164 had a related patch set uploaded (by Jcrespo):
[operations/mediawiki-config@master] mariadb: Depool db1022, promote db1061 as the s6 eqiad master

https://gerrit.wikimedia.org/r/350164

Change 350164 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: Depool db1022, promote db1061 as the s6 eqiad master

https://gerrit.wikimedia.org/r/350164

Change 350168 had a related patch set uploaded (by Marostegui):
[operations/software@master] s6.host: db1061 is the new master

https://gerrit.wikimedia.org/r/350168

Change 350168 merged by Jcrespo:
[operations/software@master] s6.host: db1061 is the new master

https://gerrit.wikimedia.org/r/350168

Change 350171 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] Change db1061 to be the s6 master on eqiad

https://gerrit.wikimedia.org/r/350171

jcrespo renamed this task from Analyze if we want to replace some masters in eqiad while it is not active to Replace some masters in eqiad while it is not active .Apr 25 2017, 1:50 PM

Change 350205 had a related patch set uploaded (by Jcrespo):
[operations/mediawiki-config@master] mariadb: switch s7 eqiad master from db1041 to db1062

https://gerrit.wikimedia.org/r/350205

Change 350209 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] mariadb: promote db1062 as the new master of s7 eqiad

https://gerrit.wikimedia.org/r/350209

Change 350171 merged by Jcrespo:
[operations/puppet@production] prometheus-mysqld-exporter: Change db1061 to be the s6 master on eqiad

https://gerrit.wikimedia.org/r/350171

Change 350209 merged by Jcrespo:
[operations/puppet@production] mariadb: promote db1062 as the new master of s7 eqiad

https://gerrit.wikimedia.org/r/350209

Change 350205 merged by jenkins-bot:
[operations/mediawiki-config@master] mariadb: switch s7 eqiad master from db1041 to db1062

https://gerrit.wikimedia.org/r/350205

Change 350221 had a related patch set uploaded (by Jcrespo):
[operations/software@master] Set db1062 as the last component of s7

https://gerrit.wikimedia.org/r/350221

Change 350221 merged by Jcrespo:
[operations/software@master] Set db1062 as the last component of s7

https://gerrit.wikimedia.org/r/350221

Change 350227 had a related patch set uploaded (by Jcrespo):
[operations/software@master] Set db1063 as the last server on s7

https://gerrit.wikimedia.org/r/350227

Change 350228 had a related patch set uploaded (by Jcrespo):
[operations/puppet@production] mariadb: promote db1063 as s5 master

https://gerrit.wikimedia.org/r/350228

Change 350230 had a related patch set uploaded (by Jcrespo):
[operations/mediawiki-config@master] mariadb: Promote db1063 as the master of s5 eqiad

https://gerrit.wikimedia.org/r/350230

Pending s5 eqiad master and dbstore1001 master change for s7.

Change 350228 merged by Jcrespo:
[operations/puppet@production] mariadb: promote db1063 as s5 master

https://gerrit.wikimedia.org/r/350228

Change 350227 merged by Jcrespo:
[operations/software@master] Set db1063 as the last server on s5

https://gerrit.wikimedia.org/r/350227

Change 350230 merged by Jcrespo:
[operations/mediawiki-config@master] mariadb: Promote db1063 as the master of s5 eqiad

https://gerrit.wikimedia.org/r/350230

jcrespo updated the task description. (Show Details)