Page MenuHomePhabricator

schedule downtime for contint2001
Closed, DuplicatePublic

Description

Hi releng,

we need to schedule downtime for contint2001.wikimedia.org for a DRAC firmware upgrade.

This should fix those flapping alerts we keep seeing in Icinga for contint2001.mgmt and others (T283582).

But 2001 is currently the primary for Jenkins CI so we need to schedule it somehow.

The of course we can downtime it in Icinga via cumin, shut it down and tell dcops to go ahead.


@hashar points out these things already:

<+hashar> mutante: contint2001 is the primary for Jenkins CI. We could switch over but the runbook as some issues related to file permissions :-\

^ This is what is actually needed for this ticket, it should be limited to unblocking the firmware upgrade for now.

<+hashar> we should be making it way faster by simply dishing out the build history (but still keep the last build number)

^ This probably should be its own ticket.

Event Timeline

contint2001.wikimedia.org is indeed the primary for CI (Jenkins and Zuul). We could switch over to the other host but the runbook has some issue ( T256396 ). Ideally it would be fixed first but I don't think it is any trivial unfortunately :-\

I imagine the DRAC upgrade + reboot + fsck would be potentially reasonably fast and we can schedule an hour+ downtime. Would need to be sneaked in between deployment windows at https://wikitech.wikimedia.org/wiki/Deployments and announced ahead of time.

< mutante> then let's just tell @Papaul what time is ok, basically
< mutante> or a time where all can be around with him in DC
<+hashar> anything in his morning, and we gotta pick a time that fit in https://wikitech.wikimedia.org/wiki/Deployments
<+hashar> but maybe we have a no deploy week coming . @thcipriani would know


< Spookreeeno> mutante: 2nd or 11th Nov is the next no deploy days

also see T256422 - switch contint prod server back from contint2001 to contint1001

In my experience it is better done during low CI traffic, start of morning in Dallas will work just fine. We would then send a mail stating Gerrit is unavailable for X time at Y and handle a few enquiries happening on IRC as to whether CI has troubles. But at least Gerrit will still be up, and we can then retrigger any CI workflow that went missing.

start of morning in Dallas will work just fine.

Cool, thanks! So, @Papaul maybe you want to suggest a date to us that would work for you?

@Dzahn Next week Monday 1st at 9:30 am CT

It is an holiday here in France (All-saints) , then I am not critical to the DRAC upgrade ;) I will make arrangement, it will be 16:30 CET if I am not mistaken which sounds good.

@hashar I am wondering if you need me around (for mgmt access / root / +2 ). I have a request to be off that day but it's not sure yet.

After re-thinking this and chatting some more on IRC I now think we should not do this and close my own request as invalid.

It's not worth the time and to take the risk of touching this old server with an important service on it when switching over is not trivial and we are in the process of maybe replacing the entire box anyways, when all of this is just for a flapping monitoring alert which we can easily downtime for 3 months.

Let's focus on the more important part and replace this hardware that is quite old. Then let's bring up new hardware in parallel and do the switch and forget about this one.

Suggesting to do this once T256422 is resolved or T294276 or CI does not run on contint* servers anymore, whichever is first.

Until then just downtime the flapping Icinga alert from T283582 which is currently closed now after most affected servers were fixed.

Be bold and reopen if you really think otherwise.

P.S. The actual "contint2001.mgmt" alert in Icinga is actually quite some time ago.. not worth it. but there are other alerts (IPMI Sensor Status) - internal IPMI error from sensors. But just more reason to not spend too much time on old hardware.