Page MenuHomePhabricator

LibreNMS seemingly not collecting data for many ports after migration to netmon1003
Closed, ResolvedPublic

Description

Here's the overall "core ports" traffic graph for the past week:

image.png (338×1 px, 495 KB)

Since Tuesday ~13:00, librenms seems to not be scraping or aggregating a lot of data from various ports.

This seems to be affecting all types of ports: transit, peering, and transport.

At a glance, it does look to be related to specific ports rather than entire routers: for instance contrast:
https://librenms.wikimedia.org/graphs/to=1660187400/id=11605/type=port_bits/from=1659582600/
vs
https://librenms.wikimedia.org/graphs/to=1660187400/id=6841/type=port_bits/from=1659582600/
both on cr2-eqiad.

Event Timeline

CDanis renamed this task from LibreNMS seemingly not scraping many devices after migration to netmon1003 to LibreNMS seemingly not collecting data for many ports after migration to netmon1003.Aug 11 2022, 3:13 AM
CDanis triaged this task as High priority.

Looks like permission issues:

netmon1003
ayounsi@netmon1003:/srv/librenms/rrd/cr2-eqiad.wikimedia.org$ ls -al  | egrep "(6841|11605)"
-rw-rw-r--   1 deploy-librenms librenms 1849688 Aug 11 03:42 port-id11605.rrd
-rw-r--r--   1 deploy-librenms librenms 1849688 Aug  9 13:26 port-id6841.rrd
netmon1002
ayounsi@netmon1002:/srv/librenms/rrd/cr2-eqiad.wikimedia.org$ ls -al  | egrep "(6841|11605)"
-rw-rw-r--   1 librenms librenms 1849688 Aug  9 13:51 port-id11605.rrd
-rw-r--r--   1 librenms librenms 1849688 Aug  9 13:51 port-id6841.rrd

So the permissions are the same but the file owner changed.

Mentioned in SAL (#wikimedia-operations) [2022-08-11T03:51:03Z] <cwhite> chown librenms /srv/librenms/rrd/* on netmon1003 T314972

Mentioned in SAL (#wikimedia-operations) [2022-08-11T03:55:17Z] <denisse|m> chown -R librenms /srv/librenms/rrd/ on netmon1003 T314972

Change 822204 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Set correct owner for the LibreNMS rrd directory.

https://gerrit.wikimedia.org/r/822204

I think that the owner is override to 'deploy-librenms' during the rsync::quickdatacopy process...

The reason being that in the Puppet repository that folder belongs to the 'www-data' user and there do not seem to be other relevant references of the deploy-librenms user aside from the scap deploy.

I think that enforcing librenms:librenms 0775 in Puppet should be enough to solve the issue. Sent in patch #822204.

My apologies! I ran the quickdatacopy the other day ahead of the failover and was too hasty, I'll take a look at the patch, thank you @andrea.denisse

I looked into why quickdatacopy didn't do the right thing:

  • the rsync server module is configured with chroot = yes
  • this means rsync can't look up uid/gid mapping to users when transferring files
  • thus the destination files were written with numeric uid/gid. it so happens that librenms is uid 496 on netmon1002 and deploy-librenms is 496 on netmon1003

We can either:

  • drop chroot = yes on the rsync module
  • (IMHO simpler) pass --chown librenms:librenms to the rsync command when syncing data
  • define a fleetwide uid/gid for librenms (which I think we should do anyways, but will be effective from next reimages, unless we manually change uid/gid)

Change 822204 abandoned by Andrea Denisse:

[operations/puppet@production] netmon: Set correct owner for the LibreNMS rrd directory.

Reason:

Abandoning in favor of a patch that will tackle the issue correctly.

https://gerrit.wikimedia.org/r/822204

Change 823748 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] quickdatacopy: Added simple username/groupname mapping for the Rsync server

https://gerrit.wikimedia.org/r/823748

Change 823752 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Set correct username/groupname mappings for LibreNMS

https://gerrit.wikimedia.org/r/823752

Change 823759 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Set correct username/groupname mappings for Rancid

https://gerrit.wikimedia.org/r/823759

Hello team, I submitted the following patches for this issue:

  1. quickdatacopy: Added simple username/groupname mapping for the Rsync server to allow us to explicitly set the USER:GROUP mapping from the quickdatacopy module.
  2. netmon: Set correct username/groupname mappings for LibreNMS to explicitly set the files inside the /srv/librenms/rrd folder to belong to librenms:librenms.
  3. netmon: Set correct username/groupname mappings for Rancid to explicitly set the files inside the /var/lib/rancid folder to belong to rancid:rancid.

This should allow us to deploy more netmon instances in the future and to have the correct USER:GROUP mappings for the files even if the uid/gid differ between the different netmon instances.

I also opened the T315388 task for having fleetwide uid and gid mappings for the netmon instances but I think it can be considered low priority for now.

Change 823748 merged by Andrea Denisse:

[operations/puppet@production] quickdatacopy: Added simple username/groupname mapping for the Rsync server

https://gerrit.wikimedia.org/r/823748

Change 824284 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Set correct username/groupname mappings for LibreNMS

https://gerrit.wikimedia.org/r/824284

Change 824286 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] netmon: Set correct username/groupname mappings for Rancid

https://gerrit.wikimedia.org/r/824286

Change 823759 abandoned by Andrea Denisse:

[operations/puppet@production] netmon: Set correct username/groupname mappings for Rancid

Reason:

Abandoning in favor of a more simple change.

https://gerrit.wikimedia.org/r/823759

Change 823752 abandoned by Andrea Denisse:

[operations/puppet@production] netmon: Set correct username/groupname mappings for LibreNMS

Reason:

Abandoning in favor of a more simple change.

https://gerrit.wikimedia.org/r/823752

Change 824286 merged by Andrea Denisse:

[operations/puppet@production] netmon: Set correct username/groupname mappings for Rancid

https://gerrit.wikimedia.org/r/824286

Change 824284 merged by Andrea Denisse:

[operations/puppet@production] netmon: Set correct username/groupname mappings for LibreNMS

https://gerrit.wikimedia.org/r/824284

Looks like this is resolved...?