This task tracks putting netmon1003 (Bullseye) in service.
Prerequisites:
- Validate in Poonton that puppet runs cleanly in Bullseye, and all services are up
- Add prometheus-atlas-exporter for Bullseye in reprepo
- Ensure LibreNMS dependencies are installed for Debian Bullseye https://gerrit.wikimedia.org/r/c/operations/puppet/+/810106/
- Update ACLs on network devices, via Capirca https://wikitech.wikimedia.org/wiki/Homer#Capirca_%28ACL_generation%29
- Add suppport for multiple backup/passive nodes in Puppet
- Set netmon1003 as a backup/standby netmon host https://gerrit.wikimedia.org/r/c/operations/puppet/+/814848
- Apply the role::netmon to the netmon1003 host
- Confirm the netmon1003 host has DB privileges (confirmed with @Ladsgroup
- Open port in the netmon1003 firewall configuration https://gerrit.wikimedia.org/r/c/operations/puppet/+/820215
- Connect to the LibreNMS database
- Determine what (if any) data should to be carried over from netmon1002 to netmon1003
Failover:
- Failover to netmon1003
- Sync RRDs in /srv/librenms/rrd/
- Sync rancid in /var/lib/rancid/core and /var/lib/rancid/GIT
- Flip DNS for LibreNMS and Smokeping from netmon1002 to netmon1003 https://gerrit.wikimedia.org/r/c/operations/dns/+/819177
- Set netmon1003 as netmon_server and netmon1002 as a netmon_servers_failover in the Puppet repository https://gerrit.wikimedia.org/r/c/operations/puppet/+/819179
- Add the new host as a syslog destination in homer templates/common/system.conf https://gerrit.wikimedia.org/r/c/operations/homer/public/+/819124
- Use netmon1003's IP address for the librenms endpoint https://gerrit.wikimedia.org/r/822124
- Add the netmon1003 host to the alertmanager API rw https://gerrit.wikimedia.org/r/822126
Post-failover validations:
- Ensure metrics in the various tabs of https://librenms.wikimedia.org/poller are similar before and after the failover
- Ensure no device took too long to poll an alert
- Ensure graphs look healthy
- Ensure there are no active Icinga alerts for the netmon1003 host
- Ensure there are no red flags in the logs - https://wikitech.wikimedia.org/wiki/LibreNMS#Check_the_logs
- Ensure the new rancid directory works - https://wikitech.wikimedia.org/wiki/RANCID
- Update all docs with mentions of netmon1002 (and maybe a link to this task in on the LibreNMS wikitech page).
The following issues were found after the failover:
- LibreNMS seemingly not collecting data for many ports after migration to netmon1003
- Rancid on netmon1003 unable to login to network devices
- Logrotate is unable to rotate LibreNMS logs in the netmon instances due to insuficient permissions to read and write log files in /var/log/
- Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid
- Unable to clone the rancid git configuration in netmon1003
The checked issues have patches that resolve them and prevent them from happening again when doing another netmon instance deployment.