Page MenuHomePhabricator

Permission error affecting nrpe2nodexp-ferm_active
Open, Needs TriagePublic

Description

As discussed on IRC, while investigating T403615 I spotted the following error on es* hosts that are not alarming (some of which are running in production).

Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]: --- Logging error ---
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]: Traceback (most recent call last):
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:   File "/usr/lib/python3.11/logging/__init__.py", line 1110, in emit
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:     msg = self.format(record)
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:           ^^^^^^^^^^^^^^^^^^^
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:   File "/usr/lib/python3.11/logging/__init__.py", line 953, in format
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:     return fmt.format(record)
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:            ^^^^^^^^^^^^^^^^^^
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:   File "/usr/local/bin/nrpe2nodexp", line 169, in format
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:     kind, outcome, etype = self.statuscode_to_kind_outcome_type(record.returncode)
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:   File "/usr/local/bin/nrpe2nodexp", line 140, in statuscode_to_kind_outcome_type
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:     etype = "change" if self.detect_status_change(returncode) else "info"
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:   File "/usr/local/bin/nrpe2nodexp", line 96, in detect_status_change
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:     if not (p.exists() and p.is_file() and os.access(p, os.R_OK)):
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:             ^^^^^^^^^^
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:   File "/usr/lib/python3.11/pathlib.py", line 1236, in exists
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:     self.stat()
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:   File "/usr/lib/python3.11/pathlib.py", line 1014, in stat
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:     return os.stat(self, follow_symlinks=follow_symlinks)
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 01 10:30:50 es2026 nrpe2nodexp-ferm_active[2424160]: PermissionError: [Errno 13] Permission denied: '/var/lib/prometheus/node.d/check_ferm_active.prom'

See for example ssh es2026.codfw.wmnet sudo journalctl -u nrpe2nodexp-ferm_active

Event Timeline

Permissions on /var/lib/prometheus are too strict on nodes that include prometheus::mysqld_exporter:

root@cumin1002:~# cumin A:all "ls -ld --time-style=+ /var/lib/prometheus"
===== NODE GROUP =====
(266) an-mariadb[1001-1002].eqiad.wmnet,an-test-coord1001.eqiad.wmnet,cloudcontrol[2005-2006,2010]-dev.codfw.wmnet,cloudcontrol[1006-1007,1011].eqiad.wmnet,db[2142-2159,2161-2196,2202-2238,2240-2244].codfw.wmnet,db[1151-1153,1156-1170,1172-1182,1184-1207,1209-1215,1218-1224,1226-1238,1241-1244,1247-1259].eqiad.wmnet,es[2026-2049].codfw.wmnet,es[1026-1048].eqiad.wmnet,matomo1003.eqiad.wmnet,pc[2011-2018].codfw.wmnet,pc[1011-1018].eqiad.wmnet
----- OUTPUT of 'ls -ld --time-st...r/lib/prometheus' -----
dr-xr-x--- 4 prometheus prometheus 4096  /var/lib/prometheus

On other nodes:

===== NODE GROUP =====
(2078) acmechief2002.codfw.wmnet,...
----- OUTPUT of 'ls -ld --time-st...r/lib/prometheus' -----
drwxr-xr-x 4 prometheus prometheus 4096  /var/lib/prometheus
file { '/var/lib/prometheus':
    ensure  => directory,
    mode    => '0550',
    require => Package['prometheus-mysqld-exporter'],
    owner   => 'prometheus',
    group   => 'prometheus',
}

Still checking how and where the directory /var/lib/prometheus is created on nodes that don’t include the mysqld_exporter.

Change #1184544 had a related patch set uploaded (by Federico Ceratto; author: Federico Ceratto):

[operations/puppet@production] mysqld_exporter.pp: fix /var/log/prometheus perms

https://gerrit.wikimedia.org/r/1184544

The /var/lib/prometheus directory is created with 0755 permissions by the prometheus-node-exporter Debian package, along with the prometheus user that owns it.

The related PR ready - quoting from https://gerrit.wikimedia.org/r/c/operations/puppet/+/1184544

If/when we start handling sensitive data we can later on discuss implement a dedicated directory, exporter, etc without reusing the same directories

Change #1184544 merged by Federico Ceratto:

[operations/puppet@production] mysqld_exporter.pp: make /var/log/prometheus 0775

https://gerrit.wikimedia.org/r/1184544