Page MenuHomePhabricator

Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10
Closed, ResolvedPublic

Description

The beta cluster wiki databases are hosted on deployment-db1 and deployment-db2. They are Ubuntu Precise hosts and as such come with version 5.5.34+maria-1~precise.

One will want to create a couple Jessie instances and conduct the database migrations to the new hosts.

Would let us do things like T139044 for https://gerrit.wikimedia.org/r/#/c/289985/

Rough plan mostly going under the responsibility / leadership of Release-Engineering-Team

  • find an owner inside Release-Engineering-Team
  • create new Jessie instances with role::mariadb::beta
  • apply role::labs::lvm::srv for disk space?
  • prepare puppet.git and mediawiki-config.git patches to change the IP addresses
  • figure out potential impacts
  • schedule a maintenance window with @jcrespo
  • stop database and pair with @jcrespo providing context / monitoring etc
  • resume service and verify

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 27 2016, 6:55 PM
Reedy added a subscriber: Reedy.Jun 28 2016, 8:29 AM

Would this be better done after upgrading to a newer Ubuntu release, and/or a swap to Debian?

hashar triaged this task as Normal priority.Jun 28 2016, 8:30 AM
hashar added a subscriber: hashar.

The beta cluster wiki databases are hosted on deployment-db1 and deployment-db2. They are Ubuntu Precise hosts and as such come with version 5.5.34+maria-1~precise.

One will want to create a couple Jessie instances and conduct the database migrations to the new hosts.

Please prepare the instances, I will assist you with no problem on data migration/config. I have done this same process on 170 production hosts.

hashar renamed this task from Upgrade mariadb in deployment-prep and use GTID to Upgrade mariadb in deployment-prep from Precise/MariaDB 5.5 to Jessie/MariaDB 5.10.Jun 30 2016, 1:37 PM
hashar updated the task description. (Show Details)

I have repurposed this task to solely track the upgraded of Beta-Cluster-Infrastructure databases to Jessie/MariaDB 5.10.

The GTID from @aaron is now in sub task T138778.

I have added that task to Release-Engineering-Team weekly meeting. Maybe we can organize some kind of sprint and enroll a few people that might be interested.

We would definitely need DBA to be around for support.

After some discussion with @jcrespo he will be able to do the migration of the MariaDB migration. There are a few pre requisites that should be driven by Release-Engineering-Team , going to edit the task detail to list them out.

hashar updated the task description. (Show Details)Jun 30 2016, 1:59 PM
greg added a subscriber: greg.Jul 15 2016, 10:33 PM

@jcrespo and @aaron: what do you two think the timeline should be for this upgrade? Do it within the next week? Within this next month? Within this quarter? Next quarter? :) I'm asking because we (RelEng) have a few other priorities right now and we're trying to prioritize correctly. Thanks.

greg added a comment.Jul 19 2016, 6:10 PM

From today's Technology Management meeting:

Cross-DC alternative to MASTER_POS_WAIT() via I1dfc0210: Added GTID support to slave lag methods (Aaron), to be tested on testwiki since Beta does not have MASTER_GTID_WAIT() support (needs MariaDB 10)

I guess that means that this isn't a strict blocker as they are using a config switch for it. But, we shouldn't let this linger for a long time.

So I guess that means "this quarter", probably, yes?

Upgrading a database (specially if it is not production) takes no more than 1 hour.

Whatever you decide, I am just your humble servant.

greg assigned this task to dduvall.Aug 1 2016, 5:26 PM

Assigning to Dan per meeting discussion.

We've reached our vcpu and memory quota for deployment-prep so I'm unable to provision a second replacement instance. @Andrew or @chasemp, think we could get an increase?

greg updated the task description. (Show Details)Aug 3 2016, 5:30 PM

I've seen a new instance called deployment-db01... But it should be deployment-db03. Has anything been done on it yet or can it be easily replaced?

Родительские Задачи T139044: включить gtid-на бета-версию mariadb кластера после обновления

Change 305668 had a related patch set uploaded (by Dduvall):
beta: Create and mount LVM volumes for maridb

https://gerrit.wikimedia.org/r/305668

Change 305675 had a related patch set uploaded (by Dduvall):
beta: Configure storage cluster for migrated databases

https://gerrit.wikimedia.org/r/305675

dduvall updated the task description. (Show Details)Aug 24 2016, 6:22 PM

I think we're ready to roll with this on the puppet/config front.

@jcrespo, once we coordinate a day/time for the migration, I'll send out an email regarding the maintenance window. What's a good day/time to pair on this and how much time do you think we'll need? I'm available Wed/Thurs this week or any day next week.

jcrespo added a comment.EditedAug 29 2016, 5:11 PM

@dduvall, assuming everyhing is prepared (the machines are currently up, there is network connectivity between them; etc.) one our per server more or less. We can avoid most of the outage by setting a different active master each time, but I do not know what are the expectations of availability for labs.

Easiest way is 2 hours of read-only mode.

I would prefer any time when you are available and I am at European times.

@jcrespo, great! I've set up a 2 hour calendar event for Thursday @ 1500 UTC. Let me know if that works for you and I'll send out an email about the read-only maintenance window.

Scratch that @jcrespo. I had a conflict after all. I will try to reschedule for next week.

dduvall updated the task description. (Show Details)Sep 9 2016, 6:18 PM

We're on for 9/15 @ 1500-1700 UTC and I've sent an announcement to wikitech-l and wmfall.

We had to abort the migration due to time constraints today—an unexpected production db crash took half the window and the dump took a bit longer than expected.

Luckily, @jcrespo was able to show me a faster method of migrating using innobackupex, which I feel comfortable doing on my own. I'll schedule a follow-up window for next week, finding a time when he'll be available in case we still need him but with the plan to take the lead.

@dduvall I leave you here the full tutorial for the cloning (which should take no more than 5 minutes): https://www.percona.com/doc/percona-xtrabackup/2.4/howtos/recipes_ibkx_local.html

Note that there is 5.5.34 on db03 and db04, and I highly recommend MariaDB 10 there to get full advantage of the performance_schema features I mentioned. Not sure if it is a conscientious decision or puppet hadn't been run yet.

@dduvall I leave you here the full tutorial for the cloning (which should take no more than 5 minutes): https://www.percona.com/doc/percona-xtrabackup/2.4/howtos/recipes_ibkx_local.html

Thanks!

Note that there is 5.5.34 on db03 and db04, and I highly recommend MariaDB 10 there to get full advantage of the performance_schema features I mentioned. Not sure if it is a conscientious decision or puppet hadn't been run yet.

Oh fun. That was definitely an oversight. I'll prepare the necessary puppet changes and make they're applied before the next window.

Change 310360 had a related patch set uploaded (by Dduvall):
beta: Install MariaDB 10

https://gerrit.wikimedia.org/r/310360

Mentioned in SAL (#wikimedia-releng) [2016-09-15T16:18:03Z] <hashar> prometheus enabled on all beta cluster instance. Does not support Precise hence puppet will fail on the last two Precise instances deployment-db1 and deployment-db2 until they are migrated to Jessie T138778

Mentioned in SAL (#wikimedia-releng) [2016-09-20T14:20:09Z] <marxarelli> disabling beta cluster jenkins jobs in preparation for data migration (T138778)

Mentioned in SAL (#wikimedia-releng) [2016-09-20T16:31:41Z] <marxarelli> cherry picking operations/puppet patches (T138778) to deployment-puppetmaster

Change 305675 merged by jenkins-bot:
beta: Configure storage cluster for migrated databases

https://gerrit.wikimedia.org/r/305675

dduvall added a comment.EditedSep 20 2016, 11:15 PM

The migration to deployment-db03 and deployment-db04 was completed this morning at around 1815 UTC. Here's a rough transcript of what I did:

  • disabled jenkins jobs
  • put beta cluster master database in read-only mode
> set global read_only = 1;
  • used innobackupex to backup to compressed tar file (didn't notice the compress and parallel options for innobackex that would have made compression faster and much easier then using --stream)
innobackupex --stream=tar ./ | pv | gzip > /mnt/backup/2019-09-20_15-00-00.tar.gz
  • copied compressed tarball to new database server
  • untar'd and applied innodb transaction logs to backup
cd /srv/backup/2019-09-20_15-00-00
tar --ignore-zeros -zxf 2019-09-20_15-00-00.tar.gz
innobackupex --apply-log ./
  • shutdown mariadb on new server
  • moved data files to mariadb datadir and ensured correct ownership
mv /srv/backup/2019-09-20_15-00-00/* /srv/sqldata/
chown -R mysql:mysql /srv/sqldata
  • started mariadb and ensured clean startup via journalctl -fu mysql and verified basic integrity by selecting some rows
  • shutdown mariadb
  • installed new WMF mariadb 10 packages via puppet after re-enabling puppet
  • purged old mariadb 5.5 server packages
apt-get --purge remove mariadb-server-5.5 mariadb-server-core-5.5
  • ran /opt/wmf-mariadb10/install
  • reloaded systemd with systemctl daemon-reload to read the new unit file
  • started mariadb again
  • ran /opt/wmf-mariadb10/bin/mysql_upgrade
  • restarted mariadb
  • checked basic integrity again
  • stopped mariadb
  • rsync'd datadir to new replica instance
  • started mariadb on master and created replication account
  • repeated basic steps to start with new datadir on replica and started replication
  • verified replication status
  • scap synced new mediawiki-config but messed that up somehow so ran a full scap via the jenkins job
  • re-enabled jenkins jobs

The operations/puppet patches still need a review/merge—for now, they're cherry-picked on deployment-puppetmaster—and the old instances need to be retired, but we should wait a few days on that just to be safe.

dduvall updated the task description. (Show Details)Sep 20 2016, 11:22 PM
greg added a comment.Sep 22 2016, 9:30 PM

I went ahead and marked this goal as done. We still need the puppet change(s) merged, I believe though.

There was a refactoring of mariadb's puppet code, and manual merge got complex due to that, so I went ahead and merged https://gerrit.wikimedia.org/r/312471 instead. That should make https://gerrit.wikimedia.org/r/305668 and https://gerrit.wikimedia.org/r/310360 obsolete and could be abandon.

Please check the changes, the final result should be the same.

Change 305668 abandoned by Dduvall:
beta: Create and mount LVM volumes for mariadb

Reason:
Incorporated into I7383440288bd3394cd0660fec0f402f55009ce19

https://gerrit.wikimedia.org/r/305668

Change 310360 abandoned by Dduvall:
beta: Install MariaDB 10

Reason:
Incorporated into I7383440288bd3394cd0660fec0f402f55009ce19

https://gerrit.wikimedia.org/r/310360

dduvall closed this task as Resolved.Sep 23 2016, 6:26 PM

Mentioned in SAL (#wikimedia-releng) [2016-10-01T09:41:47Z] <hashar> beta: shutdown deployment-db1 and deployment-db2 . Databases have been migrated to other hosts T138778