Page MenuHomePhabricator

Bringing mx2001 back into service
Closed, ResolvedPublic

Description

Due to T297127 and T297017 mx2001 was removed from service for the weekend.

The current state on mx2001 is puppet disabled, exim4 stopped and kernel has been downgraded (manually via grub boot menu AFAIK).

Additionally recent changes to MX priority and ldap routing have been reverted -- Mx1001 is currently the primary both in terms of MX and exim smarthost priorities.

AIUI we'll be sticking with the present 5.10.0-8-amd64 kernel for the holiday period, and postpone follow up kernel upgrades until the new year. So I think it'd be worthwhile to persist the kernel config, in case mx2001 is rebooted.

Creating this task to track steps to stabilize mx2001 and bring it back into service.

Event Timeline

re: making current kernel version persistent

The one running now was selected in grub but wasn't the default selection. Either edit grub menu or .. and I would consider this easier and more fail safe: just apt-get remove the broken, newer kernel?!

re: making current kernel version persistent

The one running now was selected in grub but wasn't the default selection. Either edit grub menu or .. and I would consider this easier and more fail safe: just apt-get remove the broken, newer kernel?!

I'll remove these fleet-wide via https://phabricator.wikimedia.org/T297180

I see the running kernel is now the only version installed, looks good. Is there anything else to do before re-enabling puppet on mx2001 & re-enabling exim? MX and smarthost priorities would still have mx1001 as primary and mx2001 as backup, but it would be good to restore redundancy for the MXes soon.

thanks for the reminder @herron I will do that now

herron claimed this task.

Resolving as everything in the description has now been done. Please reopen if anything else is needed!

Somebody clicked "disable active checks" on Icinga for mx2001. The opposite of "active checks" is "expect passive checks" though. That's not the proper way to disable monitoring since it means this will probably sit there forever as "OK" without actually checking anything and we would never notice.

Instead, use downtimes that have a pre-configured end. Those we won't forget. Or if that is not an option, rather do "disable notifications". These are at least a bit more obvious than the "fake passive checks".

I'll fix that and re-enable monitoring on mx2001.

Mentioned in SAL (#wikimedia-operations) [2021-12-14T17:21:27Z] <mutante> icinga - re-enabling active monitoring checks on mx2001 (T297128)