Page MenuHomePhabricator

Migrate s4 from db1095 to db1102
Closed, ResolvedPublic

Description

db1095 doesn't have so much available space whereas db1102 still has around 2T.

The current size of commons on db1095 is:

root@db1095:/srv/sqldata# du -sh commonswiki/
714G	commonswiki/

Even though db1095 needs to be migrated to multi-instance soon, we should move ahead and place s4 on db1102 and point the labs hosts to replicate from it.

These is a draft of the steps needed:

  • Stop 's4' thread on db1095 and note down position and binlog
  • mysqldump (or mydumper, or xtrabackup) db1095 for commonswiki
  • Add s4 in db1102 puppet's files
  • Copy the content to db1102
  • Start replication on db1102 (its master will be db1064) and make sure it replicates fine.
  • Stop replication on db1064 and make sure db1095 and db1102 stopped at the same position
  • Move labsdb1009,10 and 11 s4 thread to replicate from db1102. Do not start replication yet
  • Note db1095 replication position for s4 thread and set default_master_connection='s4'; reset slave all;"
  • Configure and start replication on labsdb1009,10 and 11 for the new 's4' thread replicating from db1102
  • Start replication on db1102 from db1064
  • Leave it a few days and then remove commonswiki database from db1095

Event Timeline

Restricted Application added subscribers: PokestarFan, Aklapper. · View Herald TranscriptAug 10 2017, 2:05 PM
Marostegui triaged this task as Medium priority.Aug 10 2017, 2:05 PM
Marostegui moved this task from Triage to Next on the DBA board.
Marostegui updated the task description. (Show Details)

While this has to happen somehow, I now think that s4 has more priority, because it is lagging, and dragging other shard (mainly, s1) with it. So we should move probably either s1 or s4 first.

jcrespo updated the task description. (Show Details)Aug 11 2017, 8:40 AM

s4 could fit too. It is taking around 700G on db1095, so that fits in db1102 with no issues (2T available).
And that would leave db1095 with around 1T free.

Marostegui renamed this task from Migrate s5 from db1095 to db1102 to Migrate s4 from db1095 to db1102.Aug 11 2017, 9:59 AM
Marostegui updated the task description. (Show Details)

Not sure how xtrabackup would work (or how InnoDB would behave) given that db1095 has an ibdata1 for all the shards now. So copying it to db1102 and only bringing up s4 might be an issue or could be an issue in the future?
Speaking out loud, because I am not completely sure about it.
That is why I would prefer to do a logical copy.

check that mydumper copies the triggers, too.

check that mydumper copies the triggers, too.

Very good point!
Looks like we can do it:

--triggers, -G
       Dump triggers

Although before starting anything it would be good to run a redact_output and check_private_data just to be on the safe side

I am going to drop s6 from sanitarium- it is running out of space (probably because s3), and s6 is the "easier" to reimport if there is any problem. I want to keep db1069 around for longer, in case there are issues with the temporary substitutes.

I am going to drop s6 from sanitarium- it is running out of space (probably because s3), and s6 is the "easier" to reimport if there is any problem. I want to keep db1069 around for longer, in case there are issues with the temporary substitutes.

Just for the record, there is still some space available on the pv (not that I disagree with dropping s6 anyways!)

root@db1069:~# pvs
  PV         VG   Fmt  Attr PSize PFree
  /dev/sda3  tank lvm2 a--  3.23t 379.72g
root@db1069:~#

Change 372518 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] sanitarium3: Prepare db1102 to run s4 instance

https://gerrit.wikimedia.org/r/372518

Mentioned in SAL (#wikimedia-operations) [2017-08-21T07:13:18Z] <marostegui> Stop s4 replication thread on db1095 - T172996

Change 372518 merged by Marostegui:
[operations/puppet@production] sanitarium3: Prepare db1102 to run s4 instance

https://gerrit.wikimedia.org/r/372518

Marostegui moved this task from Next to In progress on the DBA board.

Stop 's4' thread on db1095 and note down position and binlog

That is wrong, and you are causing unnecessary lag on labs. Please start replication.

Stop 's4' thread on db1095 and note down position and binlog

That is wrong, and you are causing unnecessary lag on labs. Please start replication.

That is done already.

Stop 's4' thread on db1095 and note down position and binlog

That is wrong, and you are causing unnecessary lag on labs. Please start replication.

By the way, I don't think it is wrong. It maybe unnecessary or overkill, but I prefer to be on the safe side when copying such big amount of data, to avoid any unexpected issues when loading on the other host. It was stopped for 1 hour.

any unexpected issues when loading on the other host

Such as...? Will you do the same on mediawiki core?

any unexpected issues when loading on the other host

Such as...?

Such as mydumper not saving correctly the binlog position/ file on the metadata file. As this case was an "easy" one, I preferred to do it that way.

Will you do the same on mediawiki core?

No

Did you get a different number than the tool did automatically?

root@db1102:/srv/tmp/s4/s4$ grep -A3 s4 metadata 
        Connection name: s4
        Host: db1064.eqiad.wmnet
        Log: db1064-bin.003435
        Pos: 421548654

Did you get a different number than the tool did automatically?

root@db1102:/srv/tmp/s4/s4$ grep -A3 s4 metadata 
        Connection name: s4
        Host: db1064.eqiad.wmnet
        Log: db1064-bin.003435
        Pos: 421548654

No, it is the same.

If you do not trust mydumper, we would want to know ASAP, thus what a better way than using it in a non-critical place and breaking or running a compare tool? One of our goals is to "test" mydumper, and by not using it to the extent of its possibilities we are not "testing" it. By not testing it, we will never know if it works, and will keep it using as an unreliable tool.

unnecessary or overkill

Creating unnecessary disturbance to labs users. Sometimes it is needed and there is no other way. This was not it.

If you do not trust mydumper, we would want to know ASAP, thus what a better way than using it in a non-critical place and breaking or running a compare tool? One of our goals is to "test" mydumper, and by not using it to the extent of its possibilities we are not "testing" it. By not testing it, we will never know if it works, and will keep it using as an unreliable tool.

I did check the metadata file when I started it to make sure it did it correctly and to check if it was trustable.

unnecessary or overkill

Creating unnecessary disturbance to labs users. Sometimes it is needed and there is no other way. This was not it.

I know, yes. I thought 1 hour of lag on s4 was worth avoiding the risk of having to reimport this twice just in case.

db1102 is now replicating s4 and catching up. The triggers are in place and so are the filters. However I will run a check_private_data once it has caught up, to be on the safe side.

Change 372956 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/software@master] s4.hosts: Add db1102

https://gerrit.wikimedia.org/r/372956

Change 372956 merged by jenkins-bot:
[operations/software@master] s4.hosts: Add db1102

https://gerrit.wikimedia.org/r/372956

Change 373016 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad.php: Depool db1064

https://gerrit.wikimedia.org/r/373016

Change 373016 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad.php: Depool db1064

https://gerrit.wikimedia.org/r/373016

Mentioned in SAL (#wikimedia-operations) [2017-08-22T07:32:14Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1064 - T172996 (duration: 00m 52s)

Mentioned in SAL (#wikimedia-operations) [2017-08-22T07:34:15Z] <marostegui> Stop replication on db1064 sanitarium2 and sanitarium3 master to move labsdb1009,10 and 11 s4 from db1095 to db1102 - https://phabricator.wikimedia.org/T172996

Mentioned in SAL (#wikimedia-operations) [2017-08-22T08:24:52Z] <marostegui> Drop s4 from db1095 with NO replication - T172996

Marostegui closed this task as Resolved.Aug 22 2017, 8:49 AM

This has all been done and commonswiki dropped from db1095 and disk space is back to normal values.
Replication remains stopped on labsdb1010 (not in use) and I have made a not for myself to start it again in 24h to make sure nothing is broken on the other hosts.

Mentioned in SAL (#wikimedia-operations) [2017-08-22T09:34:44Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1064 - T172996 (duration: 00m 44s)