reimage or decom db servers on precise
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Dzahn
	Jan 28 2016, 12:55 AM

Description

these database servers are on precise but should become jessie. dbproxy* have a separate ticket

db1001.eqiad.wmnet: True
db1023.eqiad.wmnet: True
db1024.eqiad.wmnet: True
db1029.eqiad.wmnet: True
db1033.eqiad.wmnet: True
db1038.eqiad.wmnet: True
db1040.eqiad.wmnet: True
db1043.eqiad.wmnet: True
db1048.eqiad.wmnet: True
db1052.eqiad.wmnet: True
db1058.eqiad.wmnet: True
db1069.eqiad.wmnet: True

Details

Subject	Repo	Branch	Lines +/-
Prepare db1023 for reimage	operations/puppet	production	+3 -8
Depool db1023 for reimage	operations/mediawiki-config	master	+2 -2
Prepare db1058 for jessie reimage	operations/puppet	production	+3 -8
Depool db1058 for reimage	operations/mediawiki-config	master	+4 -4
Repool db1040 after maintenance	operations/mediawiki-config	master	+4 -4
Repool db1038, increase weight of new hardware slaves db107[4-8]	operations/mediawiki-config	master	+5 -5
Config changes for db1038 (old s3 master) reimaging	operations/puppet	production	+4 -10
Repool db1052 (old s1-master) with low weight	operations/mediawiki-config	master	+1 -1
Upgrade db1052 to new puppet core class + MariaDB10 + jessie	operations/puppet	production	+1 -8
Reimage db1052 as jessie	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Dzahn	T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production
Resolved	• jcrespo	T125028 reimage or decom db servers on precise
Resolved	• jcrespo	T133398 Install, configure and provision recently arrived db core machines
Resolved	• Cmjohnson	T135253 Rack and set up 16 db's db1079-1094
Resolved	• jcrespo	T134349 Upgrade db1069
Resolved	• jcrespo	T134555 db1033 (old s7 master) needs backup and reimage
Resolved	• jcrespo	T135973 Upgrade m1 db servers
Resolved	• jcrespo	T106312 m1-master switch from db1001 to db1016
Resolved	• jcrespo	T138460 Upgrade m3 (phabricator) db servers

Event Timeline

Dzahn created this task.Jan 28 2016, 12:55 AM

Dzahn raised the priority of this task from to Needs Triage.

Dzahn updated the task description. (Show Details)

Dzahn added projects: SRE, DBA.

Dzahn added subscribers: Krenair, Ricordisamoa, hashar and 4 others.

As per Ops Meeting.

• jcrespo added a subscriber: faidon.Jan 29 2016, 10:05 AM

In T125028#1981051, @jcrespo wrote:

As per Ops Meeting.

Could you elaborate please? I don't know what this means. :)

hashar unsubscribed.Apr 8 2016, 11:46 AM

yes please, what was the outcome. should we link a decom ticket instead?

I gladly explain, I think it was clear for everyone involved in this ticket:

"Upgrade db servers to jessie" to me is as clear or useful as a ticket saying "have a recent kernel installed on all machines".

I mentioned the details to @faidon, which he later mentioned them to all on the ops session. Upgrading the servers to jessie was not part of last quarter's goal. The goal was only approved if explicitly those were not part of it (because I strongly opposed otherwise to having that as a quarter goal). In particular, I explained the impossibility of achieving this particular upgrades any time soon due to several reasons, mostly blocked by https://phabricator.wikimedia.org/T105135#2098755 This was discussed on the goal discussion sessions. The servers cannot be upgraded- a failover is done and other servers take its role. By the time that happens, the original servers will probably be decommissioned (so there is never an upgrade). Failover requires an upgrade to MariaDB 10- not my decision, what I found after taking over Sean's direction (which I must say I must agree). Upgrading to MariaDB 10 is very costly and is done as an every day task.

If this is not part of the goal, of course this will be "done" (no precise systems will be left)- but it makes no sense to have "TODO" tasks- I think we agreed to that informally on IRC (e.g. "Upgrade servers to $JESSIE+1 when it is stable and everybody agrees to it"). As a regular thing, I close all "you should upgrade DB boxes to MariaDB 11/upgrade to kernel 24" because it has X Y Z features. Of course those will be done (eventually)- I am doing that every day, but tickets are not a good place to track those. A roadmap is (which Phabricator doesn't have good tools for, so it is tracked on several internal documents).

If you need to know the upgrade plan, please ask me (it is on several ops documents)- MySQL masters will be failovered to (mostly trusty, newest one to jessie) new servers during the datacenter switchover. It cannot be done before. Why trusty? Because s1, s4, s5, s6 and s7 have pending hardware upgrades: T131368 Tracking this on a ticket is useless and creates unnecesary noise. A more general, useful and short-term actuable task is tracked on T120122.

The reason I did not created a lengthy explanation is because every time I try to explain why upgrading databases is not easy or immediate, @mark says to me "No need to justify yourself, we already know the issues [with stateful services]", so I assumed that was generally known by all teamates.

In the case of non-core servers, same things applies, except it has even more issues tangled to each application's requirements (and there are literally dozens of them). This is again part of regular maintenance, but has to be examined on a case by case basis.

So, in summary, this is invalid because a) It is not part of the goal with a specific outcome b) it is not a useful ticket it is a TODO/wishlist kind of ticket c) its work is already tracked by a handful of other tickets, and I cannot merge it to anyone in particular. The ticket, with its current title and blocker "will not be done". Upgrading all db servers to jessie is a 2-man-year task (not counting the 2 years of work that have already been done) with the current tools and procedures (which is always in progress). I report about it on every op session, only do public statements when specific milestones are achieved (e.g. first MySQL master (s2) with MariaDB 10/jessie). The next milestone will be "all masters failovered to MariaDB 10/trusty or jessie" after the codfw switchover.

I will wait to mark it as invalid again to avoid 3RR. I want you to convince you to avoid such a generic tickets, and use roadmap-like tools/documents.

• jcrespo added a comment.Apr 15 2016, 8:37 AM

This comment was removed by • jcrespo.

All masters are now in jessie or trusty; precise old masters now to be reimaged. Current trusty masters *will not be* upgraded to jessie (invalid still applies @Dzahn), only the trusty slaves/decommissioned.

• jcrespo added a subtask: T133398: Install, configure and provision recently arrived db core machines.Apr 22 2016, 3:39 PM

@jcrespo Thank you! I should clarify,for the purposes of this ticket it was only about killing precise, not trusty. All it ever was was to track when we got rid of all precise. Basically just running a salt command to find all precise and then split them into groups and one ticket for each and "db" was just one of them without knowing any background.

would renaming it to "reimage remaining precise db servers" be better? Or do you prefer it just to be closed completely?

Dzahn renamed this task from upgrade db servers to jessie to reimage db servers on precise.Apr 22 2016, 9:58 PM

Dzahn renamed this task from reimage db servers on precise to reimage or decom db servers on precise.

Dzahn set Security to None.

Change 285168 had a related patch set uploaded (by Jcrespo):
Reimage db1052 as jessie

https://gerrit.wikimedia.org/r/285168

gerritbot added a project: Patch-For-Review.Apr 25 2016, 11:23 AM

• jcrespo triaged this task as Medium priority.Apr 25 2016, 11:24 AM

• jcrespo moved this task from Triage to In progress on the DBA board.

Change 285168 merged by Jcrespo:
Reimage db1052 as jessie

https://gerrit.wikimedia.org/r/285168

Change 285183 had a related patch set uploaded (by Jcrespo):
Upgrade db1052 to new puppet core class MariaDB10 jessie

https://gerrit.wikimedia.org/r/285183

Change 285183 merged by Jcrespo:
Upgrade db1052 to new puppet core class MariaDB10 jessie

https://gerrit.wikimedia.org/r/285183

Change 285344 had a related patch set uploaded (by Jcrespo):
Repool db1052 (old s1-master) with low weight

https://gerrit.wikimedia.org/r/285344

Change 285344 merged by Jcrespo:
Repool db1052 (old s1-master) with low weight

https://gerrit.wikimedia.org/r/285344

• jcrespo moved this task from In progress to Pending comment on the DBA board.Apr 27 2016, 10:27 AM

• jcrespo moved this task from Pending comment to In progress on the DBA board.Apr 28 2016, 10:43 AM

Mentioned in SAL [2016-04-28T10:50:27Z] <jynus> stopping and restarting db1038 for backup and upgrade T125028

Change 285928 had a related patch set uploaded (by Jcrespo):
Config changes for db1038 (old s3 master) reimaging

https://gerrit.wikimedia.org/r/285928

Change 285928 merged by Jcrespo:
Config changes for db1038 (old s3 master) reimaging

https://gerrit.wikimedia.org/r/285928

Change 286129 had a related patch set uploaded (by Jcrespo):
Repool db1038, increase weight of new hardware slaves db107[4-8]

https://gerrit.wikimedia.org/r/286129

Change 286129 merged by Jcrespo:
Repool db1038, increase weight of new hardware slaves db107[4-8]

https://gerrit.wikimedia.org/r/286129

Change 286592 had a related patch set uploaded (by Jcrespo):
Repool db1040 after maintenance

https://gerrit.wikimedia.org/r/286592

Change 286592 merged by Jcrespo:
Repool db1040 after maintenance

https://gerrit.wikimedia.org/r/286592

Change 286792 had a related patch set uploaded (by Jcrespo):
Depool db1058 for reimage

https://gerrit.wikimedia.org/r/286792

Change 286792 merged by Jcrespo:
Depool db1058 for reimage

https://gerrit.wikimedia.org/r/286792

Change 286795 had a related patch set uploaded (by Jcrespo):
Prepare db1058 for jessie reimage

https://gerrit.wikimedia.org/r/286795

Change 286795 merged by Jcrespo:
Prepare db1058 for jessie reimage

https://gerrit.wikimedia.org/r/286795

• jcrespo created subtask T134349: Upgrade db1069.May 4 2016, 9:04 AM

Mentioned in SAL [2016-05-04T10:23:18Z] <jynus> restarting db1058 for reimaging to jessie T125028

Change 287066 had a related patch set uploaded (by Jcrespo):
Depool db1023 for reimage

https://gerrit.wikimedia.org/r/287066

Change 287066 merged by jenkins-bot:
Depool db1023 for reimage

https://gerrit.wikimedia.org/r/287066

Change 287092 had a related patch set uploaded (by Jcrespo):
Prepare db1023 for reimage

https://gerrit.wikimedia.org/r/287092

Change 287092 merged by Jcrespo:
Prepare db1023 for reimage

https://gerrit.wikimedia.org/r/287092

• jcrespo created subtask T134555: db1033 (old s7 master) needs backup and reimage.May 6 2016, 7:44 AM

• jcrespo closed subtask T134349: Upgrade db1069 as Resolved.May 7 2016, 3:15 PM

• jcrespo closed subtask T134555: db1033 (old s7 master) needs backup and reimage as Resolved.May 20 2016, 12:13 PM

• jcrespo created subtask T135973: Upgrade m1 db servers.May 23 2016, 8:20 AM

• jcrespo moved this task from In progress to Pending comment on the DBA board.Jun 3 2016, 2:14 PM

• jcrespo closed subtask T133398: Install, configure and provision recently arrived db core machines as Resolved.Jun 16 2016, 8:46 AM

• jcrespo mentioned this in rOPUP56692f19642b: Prepare db1023 for reimage.Jun 17 2016, 6:10 PM

• jcrespo mentioned this in rOPUPbf1046161926: Config changes for db1038 (old s3 master) reimaging.

After m1 failover, the only precise hosts left are:

db1043.eqiad.wmnet: True
db1048.eqiad.wmnet: True

Which are m3 (phabricator) db hosts, and it requires Phabricator admins help.

In T125028#2400265, @jcrespo wrote:

After m1 failover, the only precise hosts left are:

db1043.eqiad.wmnet: True
db1048.eqiad.wmnet: True

Which are m3 (phabricator) db hosts, and it requires Phabricator admins help.

@mmodell let's see how and when we can plan this (I am pinging you but in reality I want to ping all phab admins, please help me contact them).

@jcrespo: The not-exactly-official list of phabricator admins would be myself, @demon and @Aklapper.

I don't think there should be any issue with moving phabricator. There is a downtime scheduled every Wednesday night/Early Thursday morning (at 1:00AM UTC) or we can schedule a different time slot if that time isn't good for you.

I'm happy to be on hand to assist in any failovers. I don't expect phabricator to need much other than a config patch if the database master host names change.

It is a bit more complex than that- we need to failover the slave actions to the master (and use only the master). Then (for example, the following week) we need to do that for the master, and probably restart the service.

Let's schedule the first action for next Wednesday (and meet at that time) if possible. I will take care of everything, but I need you to be around to troubleshoot if something goes wrong (usually that only means reload the config and restart the service due to persistent connections).

I have created T138460 specifically for Phabricator. Related to T137928#2389155, too.

• jcrespo removed subscribers: • demon, • mmodell, Aklapper.Jun 23 2016, 7:16 AM

@jcrespo: Thanks, sounds good to me! Is the 01:00 AM UTC time slot ok for you? It's evening in my time zone but I know that's super late for europe and I'm not sure what time zone you are located in. It's not a problem to schedule it for earlier in the day if that's better for you.