Page MenuHomePhabricator

reimage or decom db servers on precise
Closed, ResolvedPublic

Description

these database servers are on precise but should become jessie. dbproxy* have a separate ticket

db1001.eqiad.wmnet: True
db1023.eqiad.wmnet: True
db1024.eqiad.wmnet: True
db1029.eqiad.wmnet: True
db1033.eqiad.wmnet: True
db1038.eqiad.wmnet: True
db1040.eqiad.wmnet: True
db1043.eqiad.wmnet: True
db1048.eqiad.wmnet: True
db1052.eqiad.wmnet: True
db1058.eqiad.wmnet: True
db1069.eqiad.wmnet: True

Event Timeline

Dzahn raised the priority of this task from to Needs Triage.
Dzahn updated the task description. (Show Details)
Dzahn added projects: SRE, DBA.
Dzahn added subscribers: Krenair, Ricordisamoa, hashar and 4 others.
jcrespo claimed this task.
jcrespo subscribed.

As per Ops Meeting.

As per Ops Meeting.

Could you elaborate please? I don't know what this means. :)

yes please, what was the outcome. should we link a decom ticket instead?

I gladly explain, I think it was clear for everyone involved in this ticket:

"Upgrade db servers to jessie" to me is as clear or useful as a ticket saying "have a recent kernel installed on all machines".

I mentioned the details to @faidon, which he later mentioned them to all on the ops session. Upgrading the servers to jessie was not part of last quarter's goal. The goal was only approved if explicitly those were not part of it (because I strongly opposed otherwise to having that as a quarter goal). In particular, I explained the impossibility of achieving this particular upgrades any time soon due to several reasons, mostly blocked by https://phabricator.wikimedia.org/T105135#2098755 This was discussed on the goal discussion sessions. The servers cannot be upgraded- a failover is done and other servers take its role. By the time that happens, the original servers will probably be decommissioned (so there is never an upgrade). Failover requires an upgrade to MariaDB 10- not my decision, what I found after taking over Sean's direction (which I must say I must agree). Upgrading to MariaDB 10 is very costly and is done as an every day task.

If this is not part of the goal, of course this will be "done" (no precise systems will be left)- but it makes no sense to have "TODO" tasks- I think we agreed to that informally on IRC (e.g. "Upgrade servers to $JESSIE+1 when it is stable and everybody agrees to it"). As a regular thing, I close all "you should upgrade DB boxes to MariaDB 11/upgrade to kernel 24" because it has X Y Z features. Of course those will be done (eventually)- I am doing that every day, but tickets are not a good place to track those. A roadmap is (which Phabricator doesn't have good tools for, so it is tracked on several internal documents).

If you need to know the upgrade plan, please ask me (it is on several ops documents)- MySQL masters will be failovered to (mostly trusty, newest one to jessie) new servers during the datacenter switchover. It cannot be done before. Why trusty? Because s1, s4, s5, s6 and s7 have pending hardware upgrades: T131368 Tracking this on a ticket is useless and creates unnecesary noise. A more general, useful and short-term actuable task is tracked on T120122.

The reason I did not created a lengthy explanation is because every time I try to explain why upgrading databases is not easy or immediate, @mark says to me "No need to justify yourself, we already know the issues [with stateful services]", so I assumed that was generally known by all teamates.

In the case of non-core servers, same things applies, except it has even more issues tangled to each application's requirements (and there are literally dozens of them). This is again part of regular maintenance, but has to be examined on a case by case basis.

So, in summary, this is invalid because a) It is not part of the goal with a specific outcome b) it is not a useful ticket it is a TODO/wishlist kind of ticket c) its work is already tracked by a handful of other tickets, and I cannot merge it to anyone in particular. The ticket, with its current title and blocker "will not be done". Upgrading all db servers to jessie is a 2-man-year task (not counting the 2 years of work that have already been done) with the current tools and procedures (which is always in progress). I report about it on every op session, only do public statements when specific milestones are achieved (e.g. first MySQL master (s2) with MariaDB 10/jessie). The next milestone will be "all masters failovered to MariaDB 10/trusty or jessie" after the codfw switchover.

I will wait to mark it as invalid again to avoid 3RR. I want you to convince you to avoid such a generic tickets, and use roadmap-like tools/documents.

This comment was removed by jcrespo.

All masters are now in jessie or trusty; precise old masters now to be reimaged. Current trusty masters *will not be* upgraded to jessie (invalid still applies @Dzahn), only the trusty slaves/decommissioned.

@jcrespo Thank you! I should clarify,for the purposes of this ticket it was only about killing precise, not trusty. All it ever was was to track when we got rid of all precise. Basically just running a salt command to find all precise and then split them into groups and one ticket for each and "db" was just one of them without knowing any background.

would renaming it to "reimage remaining precise db servers" be better? Or do you prefer it just to be closed completely?

Dzahn renamed this task from upgrade db servers to jessie to reimage db servers on precise.Apr 22 2016, 9:58 PM
Dzahn renamed this task from reimage db servers on precise to reimage or decom db servers on precise.
Dzahn set Security to None.

Change 285168 had a related patch set uploaded (by Jcrespo):
Reimage db1052 as jessie

https://gerrit.wikimedia.org/r/285168

jcrespo moved this task from Triage to In progress on the DBA board.

Change 285168 merged by Jcrespo:
Reimage db1052 as jessie

https://gerrit.wikimedia.org/r/285168

Change 285183 had a related patch set uploaded (by Jcrespo):
Upgrade db1052 to new puppet core class MariaDB10 jessie

https://gerrit.wikimedia.org/r/285183

Change 285183 merged by Jcrespo:
Upgrade db1052 to new puppet core class MariaDB10 jessie

https://gerrit.wikimedia.org/r/285183

Change 285344 had a related patch set uploaded (by Jcrespo):
Repool db1052 (old s1-master) with low weight

https://gerrit.wikimedia.org/r/285344

Change 285344 merged by Jcrespo:
Repool db1052 (old s1-master) with low weight

https://gerrit.wikimedia.org/r/285344

Mentioned in SAL [2016-04-28T10:50:27Z] <jynus> stopping and restarting db1038 for backup and upgrade T125028

Change 285928 had a related patch set uploaded (by Jcrespo):
Config changes for db1038 (old s3 master) reimaging

https://gerrit.wikimedia.org/r/285928

Change 285928 merged by Jcrespo:
Config changes for db1038 (old s3 master) reimaging

https://gerrit.wikimedia.org/r/285928

Change 286129 had a related patch set uploaded (by Jcrespo):
Repool db1038, increase weight of new hardware slaves db107[4-8]

https://gerrit.wikimedia.org/r/286129

Change 286129 merged by Jcrespo:
Repool db1038, increase weight of new hardware slaves db107[4-8]

https://gerrit.wikimedia.org/r/286129

Change 286592 had a related patch set uploaded (by Jcrespo):
Repool db1040 after maintenance

https://gerrit.wikimedia.org/r/286592

Change 286592 merged by Jcrespo:
Repool db1040 after maintenance

https://gerrit.wikimedia.org/r/286592

Change 286792 had a related patch set uploaded (by Jcrespo):
Depool db1058 for reimage

https://gerrit.wikimedia.org/r/286792

Change 286792 merged by Jcrespo:
Depool db1058 for reimage

https://gerrit.wikimedia.org/r/286792

Change 286795 had a related patch set uploaded (by Jcrespo):
Prepare db1058 for jessie reimage

https://gerrit.wikimedia.org/r/286795

Change 286795 merged by Jcrespo:
Prepare db1058 for jessie reimage

https://gerrit.wikimedia.org/r/286795

Mentioned in SAL [2016-05-04T10:23:18Z] <jynus> restarting db1058 for reimaging to jessie T125028

Change 287066 had a related patch set uploaded (by Jcrespo):
Depool db1023 for reimage

https://gerrit.wikimedia.org/r/287066

Change 287066 merged by jenkins-bot:
Depool db1023 for reimage

https://gerrit.wikimedia.org/r/287066

Change 287092 had a related patch set uploaded (by Jcrespo):
Prepare db1023 for reimage

https://gerrit.wikimedia.org/r/287092

Change 287092 merged by Jcrespo:
Prepare db1023 for reimage

https://gerrit.wikimedia.org/r/287092

After m1 failover, the only precise hosts left are:

db1043.eqiad.wmnet: True
db1048.eqiad.wmnet: True

Which are m3 (phabricator) db hosts, and it requires Phabricator admins help.

After m1 failover, the only precise hosts left are:

db1043.eqiad.wmnet: True
db1048.eqiad.wmnet: True

Which are m3 (phabricator) db hosts, and it requires Phabricator admins help.

@mmodell let's see how and when we can plan this (I am pinging you but in reality I want to ping all phab admins, please help me contact them).

@jcrespo: The not-exactly-official list of phabricator admins would be myself, @demon and @Aklapper.

I don't think there should be any issue with moving phabricator. There is a downtime scheduled every Wednesday night/Early Thursday morning (at 1:00AM UTC) or we can schedule a different time slot if that time isn't good for you.

I'm happy to be on hand to assist in any failovers. I don't expect phabricator to need much other than a config patch if the database master host names change.

It is a bit more complex than that- we need to failover the slave actions to the master (and use only the master). Then (for example, the following week) we need to do that for the master, and probably restart the service.

Let's schedule the first action for next Wednesday (and meet at that time) if possible. I will take care of everything, but I need you to be around to troubleshoot if something goes wrong (usually that only means reload the config and restart the service due to persistent connections).

@jcrespo: Thanks, sounds good to me! Is the 01:00 AM UTC time slot ok for you? It's evening in my time zone but I know that's super late for europe and I'm not sure what time zone you are located in. It's not a problem to schedule it for earlier in the day if that's better for you.

@mmodell please let's use the more specific T138460 for coordinating this, so you do not get spamed with other server's activity.