Upgrade m3 (phabricator) db servers
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Jun 23 2016, 7:14 AM

Description

db1043 and db1048 are still on precise and mysql 5.5. They have to be upgraded to jessie and MariaDB 10.

First schedule: Failover the slave (db1048), depool it and upgrade it: 29 June
Second schedule: Failover the master to the slave, depool the master: 21 July

Details

Subject	Repo	Branch	Lines +/-
Reenable jobs running on the phabricator db slave	operations/puppet	production	+1 -2
Set db1043 as the new slave of m3	operations/dns	master	+1 -1
Reconfigure m3 servers to use modern mysql configuration	operations/puppet	production	+10 -24
Set db1048 as the new phabricator master on config	operations/puppet	production	+9 -8
Set m3-master as an alias of dbproxy1003	operations/dns	master	+1 -1
Correct ip for db1043: 10.64.16.32, not 10.64.16.33	operations/puppet	production	+1 -1
Set db1048 as the primary master on the m3 proxy (not yet in use)	operations/puppet	production	+4 -4
Changes on m3 grants to include unpuppetized users & dbproxy1003	operations/puppet	production	+61 -3
Phab: properly disable crons for maintenance	operations/puppet	production	+21 -16
Remove /a directory from db1048	operations/puppet	production	+0 -1
Preparing db1048 for jessie install	operations/puppet	production	+4 -6
Disable crons using the phabricator db slave due to maintenance	operations/puppet	production	+19 -16
Disable cron script on the phab slave due to maintenance	operations/puppet	production	+2 -1

Revisions and Commits

rOPUP Wikimedia Puppet
	rOPUP1ff170cbcfa2 Specify home directory for phd user

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production
Resolved	jcrespo	T125028 reimage or decom db servers on precise
Resolved	jcrespo	T138460 Upgrade m3 (phabricator) db servers

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@jcrespo: it's still good to go as far as I am concerned. @Aklapper: any objections?

jcrespo mentioned this in rOPUPbd5386ada7fb: Disable crons using the phabricator db slave due to maintenance.Jul 6 2016, 9:31 AM

Change 296877 merged by Jcrespo:
Disable crons using the phabricator db slave due to maintenance

https://gerrit.wikimedia.org/r/296877

Change 297588 had a related patch set uploaded (by Jcrespo):
Preparing db1048 for jessie install

https://gerrit.wikimedia.org/r/297588

jcrespo mentioned this in rOPUP0e74dbc66ed8: Preparing db1048 for jessie install.Jul 6 2016, 1:29 PM

jcrespo mentioned this in rOPUPc36af832afbb: Preparing db1048 for jessie install.

Change 297588 merged by Jcrespo:
Preparing db1048 for jessie install

https://gerrit.wikimedia.org/r/297588

Change 297593 had a related patch set uploaded (by Jcrespo):
Remove /a directory from db1048

https://gerrit.wikimedia.org/r/297593

jcrespo mentioned this in rOPUP2c7fc9435aa8: Remove /a directory from db1048.Jul 6 2016, 2:06 PM

Change 297593 merged by Jcrespo:
Remove /a directory from db1048

https://gerrit.wikimedia.org/r/297593

Change 297802 had a related patch set uploaded (by Chad):
Phab: properly disable crons for maintenance

https://gerrit.wikimedia.org/r/297802

• demon mentioned this in rOPUP24d3b7d53f2a: Phab: properly disable crons for maintenance.Jul 7 2016, 3:22 PM

Are you sure they are still running?- I commented them on puppet and commented it from the server. See:

# HEADER: This file was autogenerated at 2016-07-06 09:47:10 +0000 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: collect_exim_stats_via_gmetric
* * * * * /usr/local/bin/exim-to-gmetric
# Puppet Name: phabstatscron_communitymetrics
#0 0 1 * * /usr/local/bin/community_metrics.sh
# Puppet Name: phab_dump
10 4 * * * rsync -zpt --bwlimit=40000 -4 /srv/dumps/phabricator_public.dump dataset1001.wikimedia.org::other_misc/ >/dev/null 2>&1
# Puppet Name: phabstatscron_projectchanges
#0 0 * * 1 /usr/local/bin/project_changes.sh

They were still running on phab2001, which was causing cronspam that @faidon alerted me to this morning.

phab2001 connects to m3-slave? That is even a worse problem! Are you using TLS?- the answer is no, because until now it did not work due to 5.5)

In T138460#2437887, @jcrespo wrote:

phab2001 connects to m3-slave? That is even a worse problem! Are you using TLS?- the answer is no, because until now it did not work due to 5.5)

I don't think the scripts should even be running on the non-primary instance. But that's outside the scope of what I'm trying to fix in gerrit 297802. Will follow-up in a separate patch.

• demon mentioned this in rOPUP20b8e37642c9: Phab: properly disable crons for maintenance.Jul 7 2016, 4:21 PM

• demon mentioned this in rOPUP53603d17ecd3: Phab: properly disable crons for maintenance.Jul 7 2016, 4:25 PM

@demon I agree (you will get mails twice). Not a huge issue because db1048 is mostly up, only depooled still because I found some data differences with its master. Let's deploy that.

Change 297802 merged by Jcrespo:
Phab: properly disable crons for maintenance

https://gerrit.wikimedia.org/r/297802

jcrespo mentioned this in rOPUPc3a2635fed99: Phab: properly disable crons for maintenance.Jul 7 2016, 4:30 PM

Please continue working on phab architecture. There is already a slave on codfw: db2012

I corrected some, but there is still some drift between db1043 and db1048. We cannot progress until that is fixed.

@jcrespo is there anything I can do to help?

greg mentioned this in T139950: Phabricator weekly report not generated (or at least sent).Jul 11 2016, 4:00 PM

Remember to revert https://gerrit.wikimedia.org/r/269447

I think I have fixed all slave differences between m3-master and m3-slave. Most were false positives due to 5.5 and 10 or tool limitations, but there were some legitimate differences. They were caused by tables in Aria ENGINE (non transactional). Tables such as phabricator_search.search_documentfield. These should be either converted to InnoDB or accept that eventually they will go corrupt.

They were also also issues because there where tables without a primary key/unique, so it was almost impossible to compare them/return them in a reliable order (e.g. there was a table with a row 100% duplicated on the slave, but not on master).

There is also probably some extra databases (non-phab) that I can archive and delete (please confirm):

information_schema (system)
bugzilla_migration
fab_migration
heartbeat (ops)
mysql (system)
percona (ops)
performance_schema (system)
rt_migration
test

Once this has been finished, we could enable again the long-running processes on the slave- but I would prefer to failover m3-master first from db1043 to db1048. I am blocked on you to proceed with this.

@jcrespo I think the phabricator_search tables are non-critical as they can be re-generated from other data.

I think that at least some of the migration tables are still used. I'm unsure about rt_migration but bugzilla_migration may be still in use. @chasemp might remember?

The idea was for rt_migration and bugzilla_migration to live "forever" or at least as long as anyone cared that at one time we used either. Those databases are actively used for user history reassignment when it happens in both cases, and then also have details on imported tasks not surfaced in phab. So if we really want to know the historical details on something we can look here instead of digging it up or going looking for whatever point in time data exists now for those systems. So those I would keep unless it's actively causing an issue, and then if we do get rid of them we need to acknowledge it's akin to saying we are not doing any more user history fixup.

The rest of them I have no special context on other than fab_migration which I know can be gotten rid of.

An yeah, it should be possible to rebuild the search index from scratch at any time, we had to do it in our flip flop from mysql->elastic->mysql

The idea was for rt_migration and bugzilla_migration to live "forever" or at least as long as anyone cared that at one time we used either.

That is ok (no issue if they are needed), but let's document it somewhere.

Thanks for the comment. I will ask robh/dzahn about rt_migration.

I got fab_migration and rt_migration mixed. No need to ask anything else- I will archive and drop fab_migration, percona and test.

I will be waiting for @mmodell or anyone at his team to greenlight and program a failover (which will require a small amount of downtime).

no worries, noted in irc I meant to keep rt_migration too :)

This go hand in hand w/ https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/phabricator/main.pp;92b104a89c16a1a2ff82cf3c2d35cffc5c851041$78-94

where would be a good visibility spot to put that relationship for DBA audits?

Mentioned in SAL [2016-07-13T17:27:46Z] <jynus> drop databases fab_migration, percona and test from m3 T138460

jcrespo moved this task from Pending comment to Blocked external/Not db team on the DBA board.Jul 13 2016, 5:29 PM

@jcrespo: There is a scheduled downtime every Thursday at 01:00 (AM) GMT. If you want to pick a better time for your convenience then let me know when is best for you and I can plan to be available to help with config changes and/or testing. If it's just the slave going offline then I don't think it will affect phabricator production. Is the plan to fail over from master to slave and then upgrade the master or just take the slave offline for upgrade? If it's just the slave then it can happen at any time.

This is master failover- we are going to take m3 and point it somewhere else. While connections finish (and probably for phab that may need a full service restart) the master will be in read-only mode.

The old (current) master is db1043, the new one will be db1048. Not only there is a switchover, db1048 has been upgraded to jessie/MariaDB 10, so while not usually, minor disruptions could happen due to incompatibilities, differences in permissions, etc. (e.g. we just did this with m1, and while nothing serios happened, the etherpad downtime was probably indirectly caused by the failover).

Once old master is idle, I will upgrade the master irrevocably to mariadb 10 (last db with precise/5.5).

Assuming everybody knows this is happening (disruptive maintenance happens every week)- we can do it tomorrow.

@jcrespo: Yes there is a one hour scheduled maintenance which is planned downtime although usually it takes no more than a few minutes. I'm game when you are :)

jcrespo moved this task from Blocked external/Not db team to In progress on the DBA board.Jul 14 2016, 2:27 PM

I do not feel confident with the current status- while I could rush it and do it today (the slave is ready), after checking that this shard is not yet under a proxy, I think it is better if we setup the m3-proxy first, which requires reimaging (including an upgrade to jessie that is blocked T125027), and a dns update, plus a setup of grants.

Sorry I could not check that before, but several outages happened the last days, plus I spent the afternoon rebooting servers for a different reason.

jcrespo updated the task description. (Show Details)Jul 14 2016, 11:27 PM

No worries, thanks Jaime.

jcrespo added a subtask: T125027: upgrade dbproxy1001/1002 to jessie .Jul 19 2016, 10:20 AM

Mentioned in SAL [2016-07-19T14:10:57Z] <jynus> reboot and reimage dbproxy1003 to jessie T125027 T138460

Change 299764 had a related patch set uploaded (by Jcrespo):
Set m3-master as an alias of dbproxy1003

https://gerrit.wikimedia.org/r/299764

Change 299796 had a related patch set uploaded (by Jcrespo):
Changes on m3 grants to include unpuppetized users & dbproxy1003

https://gerrit.wikimedia.org/r/299796

jcrespo mentioned this in rOPUP475bd4a36fcd: Changes on m3 grants to include unpuppetized users & dbproxy1003.Jul 19 2016, 5:20 PM

jcrespo mentioned this in rOPUP25dfbeb0afda: Changes on m3 grants to include unpuppetized users & dbproxy1003.Jul 19 2016, 5:25 PM

Change 299796 merged by Jcrespo:
Changes on m3 grants to include unpuppetized users & dbproxy1003

https://gerrit.wikimedia.org/r/299796

jcrespo mentioned this in rOPUPbc93fd17a7da: Changes on m3 grants to include unpuppetized users & dbproxy1003.Jul 19 2016, 5:45 PM

Change 299805 had a related patch set uploaded (by Jcrespo):
Set db1048 as the primary master on the m3 proxy (not yet in use)

https://gerrit.wikimedia.org/r/299805

jcrespo mentioned this in rOPUP332c0ba1b7d8: Set db1048 as the primary master on the m3 proxy (not yet in use).Jul 19 2016, 6:15 PM

Change 299805 merged by Jcrespo:
Set db1048 as the primary master on the m3 proxy (not yet in use)

https://gerrit.wikimedia.org/r/299805

jcrespo mentioned this in rOPUPbe2da294e44b: Set db1048 as the primary master on the m3 proxy (not yet in use).Jul 19 2016, 6:36 PM

Change 299856 had a related patch set uploaded (by Jcrespo):
Correct ip for db1043: 10.64.16.32, not 10.64.16.33

https://gerrit.wikimedia.org/r/299856

jcrespo mentioned this in rOPUP3cfcdc496bff: Correct ip for db1043: 10.64.16.32, not 10.64.16.33.Jul 19 2016, 8:51 PM

jcrespo mentioned this in rOPUPf468e12c2cb6: Correct ip for db1043: 10.64.16.32, not 10.64.16.33.Jul 19 2016, 8:56 PM

Change 299856 merged by Jcrespo:
Correct ip for db1043: 10.64.16.32, not 10.64.16.33

https://gerrit.wikimedia.org/r/299856

We are now ready to do this, unless some disaster happens. We just need to set m3-master as read only and merge this during the maintenance:

https://gerrit.wikimedia.org/r/299764

It is a dns failover, which is far from ideal, but it is needed to start using the proxy. I have also tested that using the proxy does work (grants).

jcrespo updated the task description. (Show Details)Jul 19 2016, 9:08 PM

jcrespo moved this task from In progress to Pending comment on the DBA board.Jul 20 2016, 11:03 AM

jcrespo removed a subtask: T125027: upgrade dbproxy1001/1002 to jessie .Jul 21 2016, 9:08 PM

Change 299764 merged by Jcrespo:
Set m3-master as an alias of dbproxy1003

https://gerrit.wikimedia.org/r/299764

Binary log at the time of the master change:

MariaDB  db1043.eqiad.wmnet (none) > SHOW MASTER STATUS\G
*************************** 1. row ***************************
            File: db1043-bin.001218
        Position: 90854095
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

MariaDB  db1048.eqiad.wmnet (none) > SHOW MASTER STATUS\G
*************************** 1. row ***************************
            File: db1048-bin.001197
        Position: 199491691
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

Change 300469 had a related patch set uploaded (by Jcrespo):
Set db1048 as the new phabricator master on config

https://gerrit.wikimedia.org/r/300469

Change 300469 merged by Jcrespo:
Set db1048 as the new phabricator master on config

https://gerrit.wikimedia.org/r/300469

jcrespo mentioned this in rOPUP54cb378ef97e: Set db1048 as the new phabricator master on config.Jul 22 2016, 3:35 AM

db1048 is now the new m3 master, and it is being used though the proxy dbproxy1003.

Change 300487 had a related patch set uploaded (by Jcrespo):
Reconfigure m3 servers to use modern mysql configuration

https://gerrit.wikimedia.org/r/300487

jcrespo mentioned this in rOPUP8744c618981f: Reconfigure m3 servers to use modern mysql configuration.Jul 22 2016, 4:41 AM

Change 300487 merged by Jcrespo:
Reconfigure m3 servers to use modern mysql configuration

https://gerrit.wikimedia.org/r/300487

jcrespo mentioned this in rOPUP7a979ae43343: Reconfigure m3 servers to use modern mysql configuration.Jul 22 2016, 4:46 AM

jcrespo moved this task from Pending comment to In progress on the DBA board.Jul 22 2016, 5:53 AM

m3 servers are all now on jessie/10, and we are no longer in degraded/reduced redundancy mode.

I've left a copy of the old master (db1043) on dbstore1002:/srv/tmp just in case.

I forgot we need to reenable slave jobs.

Change 300497 had a related patch set uploaded (by Jcrespo):
Set db1043 as the new slave of m3

https://gerrit.wikimedia.org/r/300497

• mmodell added a parent task: T112776: Implement phabricator database clustering support.Jul 22 2016, 6:18 AM

Change 300497 merged by Jcrespo:
Set db1043 as the new slave of m3

https://gerrit.wikimedia.org/r/300497

Change 300506 had a related patch set uploaded (by Jcrespo):
Reenable jobs running on the phabricator db slave

https://gerrit.wikimedia.org/r/300506

@mmodell @demon @Dzahn I've lost track after so many enables and disables of jobs. I think the only pending thing to enable is: https://gerrit.wikimedia.org/r/300506 but please review.

Once that is deployed, we can set this as resolved.

jcrespo mentioned this in rOPUP21dcf5ca92a1: Reenable jobs running on the phabricator db slave.Jul 22 2016, 7:50 AM

jcrespo moved this task from In progress to Done on the DBA board.Jul 22 2016, 8:01 AM

• mmodell mentioned this in rOPUPb9142c8a65da: Reenable jobs running on the phabricator db slave.Jul 22 2016, 4:29 PM

Change 300506 merged by Dzahn:
Reenable jobs running on the phabricator db slave

https://gerrit.wikimedia.org/r/300506

In T138460#2486271, @jcrespo wrote:

Once that is deployed, we can set this as resolved.

@jcrespo thank you! has been deployed, the cron has been created on iridium

Notice: /Stage[main]/Phabricator::Tools/Cron[/srv/phab/tools/public_task_dump.py]/ensure: created

setting to resolved per Jaime's comment. (and since last phabricator upgrade this should not mean i automatically claim the task )

(edit: yep, stays assigned to Jaime, as it should be :))

Dzahn removed a project: Patch-For-Review.Jul 22 2016, 4:49 PM

• mmodell removed a parent task: T112776: Implement phabricator database clustering support.Jul 25 2016, 4:22 PM

Dzahn added a commit: rOPUP1ff170cbcfa2: Specify home directory for phd user.Aug 25 2016, 1:30 AM

Upgrade m3 (phabricator) db serversClosed, ResolvedPublicActions

Description

Details

Revisions and Commits

Related ObjectsSearch...

Event Timeline

Upgrade m3 (phabricator) db servers
Closed, ResolvedPublic
Actions

Related Objects
Search...