Page MenuHomePhabricator

Put netmon1003 in service
Closed, ResolvedPublic

Description

This task tracks putting netmon1003 (Bullseye) in service.

Prerequisites:

Failover:

Post-failover validations:


The following issues were found after the failover:

The checked issues have patches that resolve them and prevent them from happening again when doing another netmon instance deployment.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Approaches for putting netmon1003 in service.

a. Regular Active/Passive nodes:

  1. Fail primary to codfw.
  2. Set the netmon1003 instance as backup.
  3. Fail back to the netmon1003 instance.
  4. Live troubleshoot any issues.

b. Multiple passive nodes:

  1. Adapt Puppet to have multiple backup/passive nodes.
  2. Set the netmon1003 instance as passive node.
  3. Live troubleshoot any issues.

c. Having netmon1003 down for maintenance:

  1. Stop the netmon1002 and netmon2001 instance services.
  2. Deploy the new netmon1003 instance.
  3. Configure the netmon1003 instance as active node and netmon2001 as passive node.
  4. Live troubleshoot any issues.

After talking with the team in the weekly team we decided to proceed with approach 'b'.
More detailed steps are being added to a Google Docs document for feedback and review.

Change 814848 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Add suppport for multiple backup/passive nodes in Puppet

https://gerrit.wikimedia.org/r/814848

Change 814848 merged by Andrea Denisse:

[operations/puppet@production] netmon: Add suppport for multiple backup/passive nodes in Puppet

https://gerrit.wikimedia.org/r/814848

Change 802593 merged by Andrea Denisse:

[operations/puppet@production] Add role::netmon to the netmon1003 instance.

https://gerrit.wikimedia.org/r/802593

Change 817887 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping

https://gerrit.wikimedia.org/r/817887

Change 817887 merged by Andrea Denisse:

[operations/puppet@production] netmon: Add netmon instances to Acme-chief's Hieradata for Smokeping

https://gerrit.wikimedia.org/r/817887

Change 817890 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Add netmon1003 to scap's hieradata.

https://gerrit.wikimedia.org/r/817890

Change 817890 merged by Andrea Denisse:

[operations/puppet@production] netmon: Add netmon1003 to scap's hieradata.

https://gerrit.wikimedia.org/r/817890

Change 819124 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/homer/public@master] netmon: Add the netmon1003 host as a syslog destination in homer

https://gerrit.wikimedia.org/r/819124

lmata changed the task status from Open to In Progress.Aug 1 2022, 7:57 PM
lmata moved this task from Inbox to In progress on the SRE Observability (FY2022/2023-Q1) board.
lmata subscribed.

Change 819177 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] netmon: failover to netmon1003

https://gerrit.wikimedia.org/r/819177

Change 819179 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: failover to netmon1003

https://gerrit.wikimedia.org/r/819179

I think it's great.

For the sake of completeness you should list the verification and cleanup actions to perform after the failover to ensure everything is working as expected. For example:

Change 820215 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Open firewall port to connecto to the LibreNMS database.

https://gerrit.wikimedia.org/r/820215

Change 820215 merged by Andrea Denisse:

[operations/puppet@production] netmon: Open firewall port to connecto to the LibreNMS database.

https://gerrit.wikimedia.org/r/820215

Ping for @Ladsgroup as Manuel is on vacations. Obs team/netops requested overseeing to make sure grants etc. and db connections are working in general after maintenance, as they are going to switch the active mysql client to another server (librenms/m1). As it is behind a proxy probably not many changes are needed, but heads up.

I will also be standby to take a custom librenms/m1 backup before maintenance happens just in case, on their request, always a good idea to avoid problems and making recovery immediate if any problem arises (e.g. I think an upgrade will happen at the same time). :-)

Change 820554 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Add DNS entries to test LibreNMS in Debian Bullseye

https://gerrit.wikimedia.org/r/820554

A few data precaution items that shouldn't affect the proper planned maintenance:

So backing up librenms takes around 7 minutes- if you give me some advance time for awareness ahead of the maintenance, I can take a newer last backup just after the application goes down for fast recovery. A backup was anyway taken today at 01:46:24.

Normally we also stop replication on a passive db for even faster recovery- my intention is to stop replication on db1117:m1 for some time until the new setup is confirmed as working as intended and with all data.

None of these are normally used or needed, but better safe than sorry :-D.

Mentioned in SAL (#wikimedia-operations) [2022-08-09T13:23:30Z] <jynus> stop replication on db1117:m1 T309074

Change 819177 merged by Andrea Denisse:

[operations/dns@master] netmon: failover to netmon1003

https://gerrit.wikimedia.org/r/819177

Change 821746 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/dns@master] netmon: Smokeping failover to netmon1003

https://gerrit.wikimedia.org/r/821746

Change 821746 merged by Andrea Denisse:

[operations/dns@master] netmon: Smokeping failover to netmon1003

https://gerrit.wikimedia.org/r/821746

Change 819179 merged by Andrea Denisse:

[operations/puppet@production] netmon: failover to netmon1003

https://gerrit.wikimedia.org/r/819179

Change 819124 merged by jenkins-bot:

[operations/homer/public@master] netmon: Add the netmon1003 host as a syslog destination in homer

https://gerrit.wikimedia.org/r/819124

Change 820554 abandoned by Andrea Denisse:

[operations/puppet@production] netmon: Add DNS entries to test LibreNMS in Debian Bullseye

Reason:

Required changes were merged and it seems to be working properly now.

https://gerrit.wikimedia.org/r/820554

Mentioned in SAL (#wikimedia-operations) [2022-08-10T08:13:20Z] <jynus> restart replication on db1117:m1 T309074

Change 822124 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Use netmon1003's IP address for the librenms endpoint

https://gerrit.wikimedia.org/r/822124

Change 822126 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Add the netmon1003 host to the alertmanager API rw

https://gerrit.wikimedia.org/r/822126

Change 822124 merged by Filippo Giunchedi:

[operations/puppet@production] netmon: Use netmon1003's IP address for the librenms endpoint

https://gerrit.wikimedia.org/r/822124

Change 822126 merged by Filippo Giunchedi:

[operations/puppet@production] netmon: Add the netmon1003 host to the alertmanager API rw

https://gerrit.wikimedia.org/r/822126

Change 823764 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Create LibreNMS logs file.

https://gerrit.wikimedia.org/r/823764

@andrea.denisse hey, I've hit another issue with netmon1003 and rancid.

Cloning the configs using git no longer works:

cmooney@wikilap:~/repos$ git clone ssh://netmon1003.wikimedia.org:/var/lib/rancid/core/ rancid-configs
Cloning into 'rancid-configs'...
fatal: '/var/lib/rancid/core/' does not appear to be a git repository
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

I suspect the issue is the permissions on the rancid directory:

cmooney@netmon1003:~$ cd /var/lib/rancid/core
-bash: cd: /var/lib/rancid/core: Permission denied

If you cd in there as root it is a git repo, but possibly the permissions need to be different to allow us clone it using our normal user accounts.

I opened the T316569 task to track the issue.
Thanks for the report @cmooney !

andrea.denisse updated the task description. (Show Details)

Change 865708 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Remove rsync quickdatacopy failover restrictions

https://gerrit.wikimedia.org/r/865708

Change 865708 merged by Andrea Denisse:

[operations/puppet@production] netmon: Remove rsync quickdatacopy failover restrictions

https://gerrit.wikimedia.org/r/865708

Change 887409 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Add netmon1003 to the ganeti rapi nodes list

https://gerrit.wikimedia.org/r/887409

Change 887409 merged by Andrea Denisse:

[operations/puppet@production] netmon: Add netmon1003 to the ganeti rapi nodes list

https://gerrit.wikimedia.org/r/887409

Change 928900 had a related patch set uploaded (by BCornwall; author: BCornwall):

[operations/dns@master] Remove leftover TODO item

https://gerrit.wikimedia.org/r/928900

Change 928900 merged by BCornwall:

[operations/dns@master] Remove leftover TODO item

https://gerrit.wikimedia.org/r/928900