Page MenuHomePhabricator

Upgrade m3 (phabricator) db servers
Closed, ResolvedPublic

Description

db1043 and db1048 are still on precise and mysql 5.5. They have to be upgraded to jessie and MariaDB 10.

  • First schedule: Failover the slave (db1048), depool it and upgrade it: 29 June
  • Second schedule: Failover the master to the slave, depool the master: 21 July

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@jcrespo: it's still good to go as far as I am concerned. @Aklapper: any objections?

Change 296877 merged by Jcrespo:
Disable crons using the phabricator db slave due to maintenance

https://gerrit.wikimedia.org/r/296877

Change 297588 had a related patch set uploaded (by Jcrespo):
Preparing db1048 for jessie install

https://gerrit.wikimedia.org/r/297588

Change 297588 merged by Jcrespo:
Preparing db1048 for jessie install

https://gerrit.wikimedia.org/r/297588

Change 297593 had a related patch set uploaded (by Jcrespo):
Remove /a directory from db1048

https://gerrit.wikimedia.org/r/297593

Change 297593 merged by Jcrespo:
Remove /a directory from db1048

https://gerrit.wikimedia.org/r/297593

Change 297802 had a related patch set uploaded (by Chad):
Phab: properly disable crons for maintenance

https://gerrit.wikimedia.org/r/297802

Are you sure they are still running?- I commented them on puppet and commented it from the server. See:

# HEADER: This file was autogenerated at 2016-07-06 09:47:10 +0000 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: collect_exim_stats_via_gmetric
* * * * * /usr/local/bin/exim-to-gmetric
# Puppet Name: phabstatscron_communitymetrics
#0 0 1 * * /usr/local/bin/community_metrics.sh
# Puppet Name: phab_dump
10 4 * * * rsync -zpt --bwlimit=40000 -4 /srv/dumps/phabricator_public.dump dataset1001.wikimedia.org::other_misc/ >/dev/null 2>&1
# Puppet Name: phabstatscron_projectchanges
#0 0 * * 1 /usr/local/bin/project_changes.sh
demon added a subscriber: faidon.EditedJul 7 2016, 4:14 PM

They were still running on phab2001, which was causing cronspam that @faidon alerted me to this morning.

phab2001 connects to m3-slave? That is even a worse problem! Are you using TLS?- the answer is no, because until now it did not work due to 5.5)

demon added a comment.Jul 7 2016, 4:20 PM

phab2001 connects to m3-slave? That is even a worse problem! Are you using TLS?- the answer is no, because until now it did not work due to 5.5)

I don't think the scripts should even be running on the non-primary instance. But that's outside the scope of what I'm trying to fix in gerrit 297802. Will follow-up in a separate patch.

@demon I agree (you will get mails twice). Not a huge issue because db1048 is mostly up, only depooled still because I found some data differences with its master. Let's deploy that.

Change 297802 merged by Jcrespo:
Phab: properly disable crons for maintenance

https://gerrit.wikimedia.org/r/297802

Please continue working on phab architecture. There is already a slave on codfw: db2012

I corrected some, but there is still some drift between db1043 and db1048. We cannot progress until that is fixed.

@jcrespo is there anything I can do to help?

I think I have fixed all slave differences between m3-master and m3-slave. Most were false positives due to 5.5 and 10 or tool limitations, but there were some legitimate differences. They were caused by tables in Aria ENGINE (non transactional). Tables such as phabricator_search.search_documentfield. These should be either converted to InnoDB or accept that eventually they will go corrupt.

They were also also issues because there where tables without a primary key/unique, so it was almost impossible to compare them/return them in a reliable order (e.g. there was a table with a row 100% duplicated on the slave, but not on master).

There is also probably some extra databases (non-phab) that I can archive and delete (please confirm):

information_schema (system)
bugzilla_migration
fab_migration
heartbeat (ops)
mysql (system)
percona (ops)
performance_schema (system)
rt_migration
test

Once this has been finished, we could enable again the long-running processes on the slave- but I would prefer to failover m3-master first from db1043 to db1048. I am blocked on you to proceed with this.

@jcrespo I think the phabricator_search tables are non-critical as they can be re-generated from other data.

I think that at least some of the migration tables are still used. I'm unsure about rt_migration but bugzilla_migration may be still in use. @chasemp might remember?

The idea was for rt_migration and bugzilla_migration to live "forever" or at least as long as anyone cared that at one time we used either. Those databases are actively used for user history reassignment when it happens in both cases, and then also have details on imported tasks not surfaced in phab. So if we really want to know the historical details on something we can look here instead of digging it up or going looking for whatever point in time data exists now for those systems. So those I would keep unless it's actively causing an issue, and then if we do get rid of them we need to acknowledge it's akin to saying we are not doing any more user history fixup.

The rest of them I have no special context on other than fab_migration which I know can be gotten rid of.

An yeah, it should be possible to rebuild the search index from scratch at any time, we had to do it in our flip flop from mysql->elastic->mysql

The idea was for rt_migration and bugzilla_migration to live "forever" or at least as long as anyone cared that at one time we used either.

That is ok (no issue if they are needed), but let's document it somewhere.

Thanks for the comment. I will ask robh/dzahn about rt_migration.

I got fab_migration and rt_migration mixed. No need to ask anything else- I will archive and drop fab_migration, percona and test.

I will be waiting for @mmodell or anyone at his team to greenlight and program a failover (which will require a small amount of downtime).

no worries, noted in irc I meant to keep rt_migration too :)

This go hand in hand w/ https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/role/manifests/phabricator/main.pp;92b104a89c16a1a2ff82cf3c2d35cffc5c851041$78-94

where would be a good visibility spot to put that relationship for DBA audits?

Mentioned in SAL [2016-07-13T17:27:46Z] <jynus> drop databases fab_migration, percona and test from m3 T138460

@jcrespo: There is a scheduled downtime every Thursday at 01:00 (AM) GMT. If you want to pick a better time for your convenience then let me know when is best for you and I can plan to be available to help with config changes and/or testing. If it's just the slave going offline then I don't think it will affect phabricator production. Is the plan to fail over from master to slave and then upgrade the master or just take the slave offline for upgrade? If it's just the slave then it can happen at any time.

jcrespo added a comment.EditedJul 13 2016, 6:12 PM

This is master failover- we are going to take m3 and point it somewhere else. While connections finish (and probably for phab that may need a full service restart) the master will be in read-only mode.

The old (current) master is db1043, the new one will be db1048. Not only there is a switchover, db1048 has been upgraded to jessie/MariaDB 10, so while not usually, minor disruptions could happen due to incompatibilities, differences in permissions, etc. (e.g. we just did this with m1, and while nothing serios happened, the etherpad downtime was probably indirectly caused by the failover).

Once old master is idle, I will upgrade the master irrevocably to mariadb 10 (last db with precise/5.5).

Assuming everybody knows this is happening (disruptive maintenance happens every week)- we can do it tomorrow.

@jcrespo: Yes there is a one hour scheduled maintenance which is planned downtime although usually it takes no more than a few minutes. I'm game when you are :)

I do not feel confident with the current status- while I could rush it and do it today (the slave is ready), after checking that this shard is not yet under a proxy, I think it is better if we setup the m3-proxy first, which requires reimaging (including an upgrade to jessie that is blocked T125027), and a dns update, plus a setup of grants.

Sorry I could not check that before, but several outages happened the last days, plus I spent the afternoon rebooting servers for a different reason.

jcrespo updated the task description. (Show Details)Jul 14 2016, 11:27 PM
greg added a comment.Jul 15 2016, 12:09 AM

No worries, thanks Jaime.

Mentioned in SAL [2016-07-19T14:10:57Z] <jynus> reboot and reimage dbproxy1003 to jessie T125027 T138460

Change 299764 had a related patch set uploaded (by Jcrespo):
Set m3-master as an alias of dbproxy1003

https://gerrit.wikimedia.org/r/299764

Change 299796 had a related patch set uploaded (by Jcrespo):
Changes on m3 grants to include unpuppetized users & dbproxy1003

https://gerrit.wikimedia.org/r/299796

Change 299796 merged by Jcrespo:
Changes on m3 grants to include unpuppetized users & dbproxy1003

https://gerrit.wikimedia.org/r/299796

Change 299805 had a related patch set uploaded (by Jcrespo):
Set db1048 as the primary master on the m3 proxy (not yet in use)

https://gerrit.wikimedia.org/r/299805

Change 299805 merged by Jcrespo:
Set db1048 as the primary master on the m3 proxy (not yet in use)

https://gerrit.wikimedia.org/r/299805

Change 299856 had a related patch set uploaded (by Jcrespo):
Correct ip for db1043: 10.64.16.32, not 10.64.16.33

https://gerrit.wikimedia.org/r/299856

Change 299856 merged by Jcrespo:
Correct ip for db1043: 10.64.16.32, not 10.64.16.33

https://gerrit.wikimedia.org/r/299856

We are now ready to do this, unless some disaster happens. We just need to set m3-master as read only and merge this during the maintenance:

https://gerrit.wikimedia.org/r/299764

It is a dns failover, which is far from ideal, but it is needed to start using the proxy. I have also tested that using the proxy does work (grants).

jcrespo updated the task description. (Show Details)Jul 19 2016, 9:08 PM
jcrespo moved this task from In progress to Next on the DBA board.Jul 20 2016, 11:03 AM

Change 299764 merged by Jcrespo:
Set m3-master as an alias of dbproxy1003

https://gerrit.wikimedia.org/r/299764

Binary log at the time of the master change:

MariaDB  db1043.eqiad.wmnet (none) > SHOW MASTER STATUS\G
*************************** 1. row ***************************
            File: db1043-bin.001218
        Position: 90854095
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

MariaDB  db1048.eqiad.wmnet (none) > SHOW MASTER STATUS\G
*************************** 1. row ***************************
            File: db1048-bin.001197
        Position: 199491691
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

Change 300469 had a related patch set uploaded (by Jcrespo):
Set db1048 as the new phabricator master on config

https://gerrit.wikimedia.org/r/300469

Change 300469 merged by Jcrespo:
Set db1048 as the new phabricator master on config

https://gerrit.wikimedia.org/r/300469

db1048 is now the new m3 master, and it is being used though the proxy dbproxy1003.

Change 300487 had a related patch set uploaded (by Jcrespo):
Reconfigure m3 servers to use modern mysql configuration

https://gerrit.wikimedia.org/r/300487

Change 300487 merged by Jcrespo:
Reconfigure m3 servers to use modern mysql configuration

https://gerrit.wikimedia.org/r/300487

jcrespo moved this task from Next to In progress on the DBA board.Jul 22 2016, 5:53 AM
jcrespo closed this task as Resolved.Jul 22 2016, 5:56 AM

m3 servers are all now on jessie/10, and we are no longer in degraded/reduced redundancy mode.

I've left a copy of the old master (db1043) on dbstore1002:/srv/tmp just in case.

jcrespo reopened this task as Open.Jul 22 2016, 6:13 AM

I forgot we need to reenable slave jobs.

Change 300497 had a related patch set uploaded (by Jcrespo):
Set db1043 as the new slave of m3

https://gerrit.wikimedia.org/r/300497

Change 300497 merged by Jcrespo:
Set db1043 as the new slave of m3

https://gerrit.wikimedia.org/r/300497

Change 300506 had a related patch set uploaded (by Jcrespo):
Reenable jobs running on the phabricator db slave

https://gerrit.wikimedia.org/r/300506

jcrespo added a subscriber: Dzahn.Jul 22 2016, 7:49 AM

@mmodell @demon @Dzahn I've lost track after so many enables and disables of jobs. I think the only pending thing to enable is: https://gerrit.wikimedia.org/r/300506 but please review.

Once that is deployed, we can set this as resolved.

jcrespo moved this task from In progress to Done on the DBA board.Jul 22 2016, 8:01 AM

Change 300506 merged by Dzahn:
Reenable jobs running on the phabricator db slave

https://gerrit.wikimedia.org/r/300506

Dzahn added a comment.Jul 22 2016, 4:46 PM

Once that is deployed, we can set this as resolved.

@jcrespo thank you! has been deployed, the cron has been created on iridium

Notice: /Stage[main]/Phabricator::Tools/Cron[/srv/phab/tools/public_task_dump.py]/ensure: created

Dzahn closed this task as Resolved.EditedJul 22 2016, 4:48 PM

setting to resolved per Jaime's comment. (and since last phabricator upgrade this should not mean i automatically claim the task )

(edit: yep, stays assigned to Jaime, as it should be :))