Page MenuHomePhabricator

Upgrade Fastnetmon to 1.2.4
Closed, ResolvedPublic

Description

1.2.4 is out and is bringing a useful feature: Native Prometheus support

If it fits this doc: https://fastnetmon.com/monitoring-fastnetmon-via-prometheus/ using it might help us monitor its health better as well as more efficiently monitor DDoS attacks.

We're moving these to Bookworm which has 1.2.4 natively:

  • netflow1002.eqiad.wmnet
  • netflow2003.codfw.wmnet
  • netflow3002.esams.wmnet
  • netflow4002.ulsfo.wmnet
  • netflow5002.eqsin.wmnet
  • netflow6001.drmrs.wmnet
  • decom netflow2002

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ayounsi added a parent task: Restricted Task.Mar 1 2023, 3:13 PM

I tried to backport FNM 1.2.4, it does some tricky things with Boost and so far I couldn't force cmake to accept Bullseye's Boost libs. Given that Bookworm is close (unstable has 1.2.4, testing 1.2.3 not sure whether it will still migrate), I'm inclined to move forward with Bookworm netflow VMs (even if 1.2.4 isn't accepted into Bookworm any more, the backport should be simple on Bookworm).

That's fine for me! Is it as easy as a re-image?

That's fine for me! Is it as easy as a re-image?

It'll take a bit to sort out the setup, my proposal would be to add an additional netflow* VM in one of the sites, get the stack ready and then reimage the netflow* VMs in the other sites.

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors:

  • netflow2003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors:

  • netflow2003 (FAIL)
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors:

  • netflow2003 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Unable to disable Puppet, the host may have been unreachable
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors:

  • netflow2003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors:

  • netflow2003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors:

  • netflow2003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors:

  • netflow2003 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details

Mentioned in SAL (#wikimedia-operations) [2023-05-19T08:09:33Z] <moritzm> copy samplicator from bullseye-wikimedia to bookworm-wikimedia T330884

@ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage the other netflow* VMs in-place once Bookworm is stable.

I copied over samplicator from bullseye-wikimedia to bookworm-wikimedia (the only dependency is glibc itself), but there wasn't a source package on apt.wikimedia.org, do you by chance still have it on your laptop or the build host so that we can import it?

Change 921375 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/homer/public@master] codfw: use new netflow server

https://gerrit.wikimedia.org/r/921375

Change 921375 merged by jenkins-bot:

[operations/homer/public@master] codfw: use new netflow server

https://gerrit.wikimedia.org/r/921375

Change 921384 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Kafka: add netflow2003 to the allowed sources

https://gerrit.wikimedia.org/r/921384

Change 921384 merged by Ayounsi:

[operations/puppet@production] Kafka: add netflow2003 to the allowed sources

https://gerrit.wikimedia.org/r/921384

@ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage the other netflow* VMs in-place once Bookworm is stable.

I pointed the DFW routers to the new instance and both fastnetmon and nfacctd are working fine!

Example Prometheus metrics (After enabling it briefly) are visible on {P48400}

I copied over samplicator from bullseye-wikimedia to bookworm-wikimedia (the only dependency is glibc itself), but there wasn't a source package on apt.wikimedia.org, do you by chance still have it on your laptop or the build host so that we can import it?

Unfortunately, no, if it can help I think I got it from https://github.com/sleinen/samplicator
Some people tried to generate deb files as well, see https://github.com/hfeeki/samplicator-debian

Change 921390 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Fastnetmon: enable Prometheus exporter

https://gerrit.wikimedia.org/r/921390

Change 921394 had a related patch set uploaded (by Ayounsi; author: Ayounsi):

[operations/puppet@production] Prometheus: fetch FastNetMon metrics

https://gerrit.wikimedia.org/r/921394

I copied over samplicator from bullseye-wikimedia to bookworm-wikimedia (the only dependency is glibc itself), but there wasn't a source package on apt.wikimedia.org, do you by chance still have it on your laptop or the build host so that we can import it?

Unfortunately, no, if it can help I think I got it from https://github.com/sleinen/samplicator
Some people tried to generate deb files as well, see https://github.com/hfeeki/samplicator-debian

Ok, the current package will work fine for now (given that libc6 is the only depdendency), I'm filing a low prio task to eventually also import a source package (we might also simply upload it to Debian at some point)

@ayounsi There's now netflow2003 running Bookworm with FNM 1.2.4. If that works fine, we can reimage the other netflow* VMs in-place once Bookworm is stable.

Bookworm is released, so these are good to go. Is it acceptable to reimage these in place (meaning a loss of newflow data for like an hour each) or do we need the create-VMs-and-failover dance?

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow4002.ulsfo.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow4002.ulsfo.wmnet with OS bookworm completed:

  • netflow4002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306120901_jmm_2496087_netflow4002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow5002.eqsin.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow5002.eqsin.wmnet with OS bookworm completed:

  • netflow5002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306121127_jmm_2625752_netflow5002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow6001.drmrs.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow6001.drmrs.wmnet with OS bookworm completed:

  • netflow6001 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306121250_jmm_2737911_netflow6001.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change 929340 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove netflow2002 from Kafka config

https://gerrit.wikimedia.org/r/929340

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow3002.esams.wmnet with OS bookworm

Change 929340 merged by Muehlenhoff:

[operations/puppet@production] Remove netflow2002 from Kafka config

https://gerrit.wikimedia.org/r/929340

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow3002.esams.wmnet with OS bookworm completed:

  • netflow3002 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306130732_jmm_3899502_netflow3002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host netflow1002.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host netflow1002.eqiad.wmnet with OS bookworm completed:

  • netflow1002 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202306130835_jmm_3975116_netflow1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
MoritzMuehlenhoff updated the task description. (Show Details)

Change 921390 merged by Ayounsi:

[operations/puppet@production] Fastnetmon: enable Prometheus exporter

https://gerrit.wikimedia.org/r/921390

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: netflow2002.codfw.wmnet

  • netflow2002.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
MoritzMuehlenhoff claimed this task.
MoritzMuehlenhoff updated the task description. (Show Details)

All newflow hosts are migrated to Bookworm and thus FNM 1.2.4

Change 921394 merged by Ayounsi:

[operations/puppet@production] Prometheus: fetch FastNetMon metrics

https://gerrit.wikimedia.org/r/921394

Mentioned in SAL (#wikimedia-operations) [2023-07-10T11:14:32Z] <moritzm> remove unused VM netflow6002 T330884