Page MenuHomePhabricator

toolforge: prometheus issue is filling up email queue
Closed, ResolvedPublic

Description

Today I checked a shinken email alert * PROBLEM alert - tools-mail/exim queue length is WARNING **.

at tools-mail.eqiad.wmflabs, I saw the exim queue is rather big, full of bounce emails from prometheus@tools.wmflabs.org:

aborrero@tools-mail:~$ sudo exim -bpc
1128

aborrero@tools-mail:~$ sudo exim -bp
[...]
48h  2.3K 1fNuq2-0002j2-3q <> *** frozen ***
          prometheus@tools.wmflabs.org

48h  2.3K 1fNur1-0002jm-7u <> *** frozen ***
          prometheus@tools.wmflabs.org

48h  2.3K 1fNurx-0002kx-TG <> *** frozen ***
          prometheus@tools.wmflabs.org

48h  2.3K 1fNusw-0002lW-JG <> *** frozen ***
          prometheus@tools.wmflabs.org
[....]

Some of the emails are:

aborrero@tools-mail:~$ sudo exim -Mvb 1fOdvq-00079D-R9
1fOdvq-00079D-R9-D
This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  prometheus@tools.wmflabs.org
    Unrouteable address

------ This is a copy of the message, including all the headers. ------

Return-path: <prometheus@tools.wmflabs.org>
Received: from tools-bastion-03.tools.eqiad.wmflabs ([10.68.23.58] ident=Debian-exim)
	by mail.tools.wmflabs.org with esmtp (Exim 4.82)
	(envelope-from <prometheus@tools.wmflabs.org>)
	id 1fOdvq-000797-9C
	for prometheus@tools.wmflabs.org; Fri, 01 Jun 2018 06:53:02 +0000
Received: from prometheus by tools-bastion-03.tools.eqiad.wmflabs with local (Exim 4.82)
	(envelope-from <prometheus@tools.wmflabs.org>)
	id 1fOdvq-0007WV-6v
	for prometheus@tools.wmflabs.org; Fri, 01 Jun 2018 06:53:02 +0000
From: root@tools.wmflabs.org (Cron Daemon)
To: prometheus@tools.wmflabs.org
Subject: Cron <prometheus@tools-bastion-03> /usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom
Content-Type: text/plain; charset=ANSI_X3.4-1968
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/var/lib/prometheus>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=prometheus>
Message-Id: <E1fOdvq-0007WV-6v@tools-bastion-03.tools.eqiad.wmflabs>
Date: Fri, 01 Jun 2018 06:53:02 +0000

Traceback (most recent call last):
  File "/usr/local/bin/prometheus-puppet-agent-stats", line 117, in <module>
    sys.exit(main())
  File "/usr/local/bin/prometheus-puppet-agent-stats", line 111, in main
    write_to_textfile(args.outfile, registry)
  File "/usr/lib/python2.7/dist-packages/prometheus_client/exposition.py", line 115, in write_to_textfile
    with open(tmppath, 'wb') as f:
IOError: [Errno 13] Permission denied: u'/var/lib/prometheus/node.d/puppet_agent.prom.28919.139863252813632'

Also, not sure why this is happening:

From: root@tools.wmflabs.org (Cron Daemon)
To: prometheus@tools.wmflabs.org

Event Timeline

aborrero triaged this task as Medium priority.Jun 1 2018, 7:07 AM

Change 436752 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] toollabs: add /etc/aliases file for tools-mail server

https://gerrit.wikimedia.org/r/436752

This is what I did:

  • I tried adding prometheus: rootto the /etc/aliases file so these emails could reach us
  • Then I generated a patch to try to persist that. Reviews welcome.
  • I tried to figure out what's going on with the /usr/local/bin/prometheus-puppet-agent-stats script. It seems perms are right in that directory. I executed it by hand several times and read the source code, but couldn't find anything wrong.
  • I emptied the mail queue with exim -bp | exiqgrep -i | xargs exim -Mrm (it was full of pending prometheus emails, plus 2 or 3 weird things, like spam)

Change 436752 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] toollabs: add /etc/aliases file for tools-mail server

https://gerrit.wikimedia.org/r/436752

@zhuyifei1999 pointed we could probably use the mailalias puppet module instead.

Trying a fix for the source of the emails, which currently is changing the perms on the output dir to match the LDAP prometheus user. It was the local passwd prometheus user for at least some of it. It may turn out that I need to go the other way?

Overall, the key here is that getent passwd prometheus gets a different UID than what it is in passwd.

Overall, the key here is that getent passwd prometheus gets a different UID than what it is in passwd.

Which is related to T170178#3436626 -- there was an existing uid=prometheus LDAP user before we installed the prometheus packages and it never got cleaned up well.

So I was able to stop it for a bit by changing to the local UID. It appears to be using the LDAP UID again? What a mess.

Trying something goofy. The directory is group-writable. So I set it to be owned by the local user with the LDAP group.

That seems to have done it. Now it can write either by UID when it uses the passwd file version of itself, and it can use its group membership to write if it starts using the LDAP version.

So, who is the proper user (ldap vs local)? Perhaps we should persist this into puppet.

So, who is the proper user (ldap vs local)? Perhaps we should persist this into puppet.

T158968: 'prometheus' service user vs. actual human account 'prometheus'

The prometheus LDAP user:

aborrero@tools-bastion-03:~$ ldapsearch -H ldap://ldap-labs.eqiad.wikimedia.org:389 -D cn=proxyagent,ou=profile,dc=wikimedia,dc=org -W uid=prometheus
Enter LDAP Password: 
# extended LDIF
#
# LDAPv3
# base <dc=wikimedia,dc=org> (default) with scope subtree
# filter: uid=prometheus
# requesting: ALL
#

# prometheus, people, wikimedia.org
dn: uid=prometheus,ou=people,dc=wikimedia,dc=org
objectClass: inetOrgPerson
objectClass: posixAccount
uidNumber: 14736
gidNumber: 500
homeDirectory: /var/lib/prometheus
loginShell: /bin/false
uid: prometheus
cn: Prometheus daemon
sn: Prometheus daemon
description: Hack to clean up T170178

# search result
search: 2
result: 0 Success

# numResponses: 2
# numEntries: 1

Since every node who does prometheus stuff has his local user, I don't see the point in having one LDAP.
On the other hand, if we delete the LDAP entry every node will have a different numeric id/gid and we should ensure nothing breaks by running a massive chown, which may be impractical. Or is this just a matter of the 3 toolforge bastions?
If we keep the LDAP user, deleting the local user may be simpler, but still may require some ownership checking.

I just deleted the prometheus LDAP user, since there is an LDAP account for jeberkley:

# jberkley, people, wikimedia.org
dn: uid=jberkley,ou=people,dc=wikimedia,dc=org
sn: Jermaineberkley
cn: Jermaineberkley
objectClass: inetOrgPerson
objectClass: person
objectClass: ldapPublicKey
objectClass: posixAccount
objectClass: shadowAccount
gidNumber: 500
loginShell: /bin/bash
sshPublicKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDCFKj8MC6s5Ckw1w0/K+zjvX+/
 dRK9lwgvVpZtpu04CzGLRt7M+V7w8Cd4AYdnpsPrvAWTZQKA1HmBQFPSk1+u7UHRLs3T8lm9TpsyW
 bq+RpkPh9We8dqjVUWbu3z6d/zP0J0GerTtmyxEap4cbvb6zw9zQmEIuwvXlJO/MMdTwaOT/0jT73
 /FyoJX8szkkwKUI8U6c8PJlcdfwXaNcnzxwtPhykfdjsND49L4NE2lQqp2zgHBJEG5S8OWn3xyk8q
 bxo0Gl4agC8tFZI+VQscTGLsnVOv0Mtq2CEhAbwEUOwY6voqgyApgPwFvlzL+27XCZ1kUpmaa5TC/
 mwqnqGCH jermaineberkley8@gmail.com
mail: jermaineberkley8@gmail.com
uid: jberkley
homeDirectory: /home/jberkley
uidNumber: 17517

Doing now chown in the bastions.

aborrero closed this task as Resolved.EditedJun 5 2018, 5:25 PM

I did this on the 3 toolforge bastions (-02, -03, -05):

aborrero@tools-bastion-02:~$ sudo adduser --quiet --system --home /var/lib/prometheus --no-create-home --group --gecos "Prometheus daemon" prometheus
aborrero@tools-bastion-02:~$ sudo chown -R prometheus:prometheus /var/lib/prometheus
aborrero@tools-bastion-02:~$ sudo chown -R prometheus:prometheus /var/log/prometheus

(well, in -03 the user was already there, so no need for adduser) This is what the prometheus-node-exporter debian package does in his postinst script.

Now we have something like:

aborrero@tools-bastion-02:~$ sudo ls -nlR /var/lib/prometheus/
/var/lib/prometheus/:
total 8
drwxrwx--- 2 115 121 4096 Jun  5 17:13 node.d
drwxr-xr-x 2 115 121 4096 Jun  5 17:04 node-exporter

/var/lib/prometheus/node.d:
total 4
-rw-rw-r-- 1 115 121 784 Jun  5 17:13 puppet_agent.prom

/var/lib/prometheus/node-exporter:
total 4
-rw-r--r-- 1 115 121 920 Jun  2  2017 README.textfile

@bd808 mentioned that other nodes may require a prometheus user and dont have a local one (that being the reason behind the LDAP user). Let's see if we find more issues.

Lots of puppet failures, we require a massive chown:

aborrero@tools-exec-1427:~$ sudo puppet agent -t -v
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tools-exec-1427.tools.eqiad.wmflabs
Notice: /Stage[main]/Base::Environment/Tidy[/var/tmp/core]: Tidying 0 files
Info: Applying configuration version '1528219632'
Notice: /Stage[main]/Prometheus::Node_puppet_agent/Cron[prometheus_puppet_agent_stats]/ensure: created
crontab:  user `prometheus' unknown
Error: Could not find user prometheus
Error: /Stage[main]/Prometheus::Node_exporter/File[/var/lib/prometheus/node.d]/owner: change from 14736 to prometheus failed: Could not find user prometheus

Mentioned in SAL (#wikimedia-cloud) [2018-06-05T17:39:24Z] <arturo> T196137 clush: delete prometheus user and re-create it locally. Then, chown prometheus dirs

After some clush love, all nodes seems to be using the local prometheus user:

aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo getent passwd prometheus'
tools-docker-registry-02.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-clushmaster-01.tools.eqiad.wmflabs: prometheus:x:114:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-elastic-02.tools.eqiad.wmflabs: prometheus:x:104:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-docker-builder-05.tools.eqiad.wmflabs: prometheus:x:114:119:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-elastic-03.tools.eqiad.wmflabs: prometheus:x:104:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-elastic-01.tools.eqiad.wmflabs: prometheus:x:104:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-docker-registry-01.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-bastion-03.tools.eqiad.wmflabs: prometheus:x:114:119:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-cron-01.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-checker-02.tools.eqiad.wmflabs: prometheus:x:115:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-flannel-etcd-03.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-bastion-02.tools.eqiad.wmflabs: prometheus:x:115:121:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-bastion-05.tools.eqiad.wmflabs: prometheus:x:115:121:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1010.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1011.tools.eqiad.wmflabs: prometheus:x:105:121:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-checker-01.tools.eqiad.wmflabs: prometheus:x:115:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1411.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-flannel-etcd-02.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1402.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1009.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1412.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1403.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1012.tools.eqiad.wmflabs: prometheus:x:105:121:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-grid-shadow.tools.eqiad.wmflabs: prometheus:x:112:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1401.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-k8s-etcd-01.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-logs-02.tools.eqiad.wmflabs: prometheus:x:105:119:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1409.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-k8s-etcd-02.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1013.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1014.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1413.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1015.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1414.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1416.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1415.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-mail.tools.eqiad.wmflabs: prometheus:x:112:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1016.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-k8s-etcd-03.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1410.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1018.tools.eqiad.wmflabs: prometheus:x:105:121:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1417.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-k8s-master-01.tools.eqiad.wmflabs: prometheus:x:104:122:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1017.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-package-builder-01.tools.eqiad.wmflabs: prometheus:x:114:119:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1020.tools.eqiad.wmflabs: prometheus:x:105:121:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1019.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1418.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1001.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1420.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-master-01.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1421.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1003.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1419.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1021.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1022.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1423.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1023.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1025.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1002.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-grid-master.tools.eqiad.wmflabs: prometheus:x:112:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1005.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1422.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1007.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1424.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1010.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1006.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1027.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1425.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1026.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1426.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1028.tools.eqiad.wmflabs: prometheus:x:114:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1427.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1428.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1429.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-proxy-01.tools.eqiad.wmflabs: prometheus:x:104:123:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1019.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1016.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-prometheus-02.tools.eqiad.wmflabs: prometheus:x:105:53103:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1013.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1433.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-prometheus-01.tools.eqiad.wmflabs: prometheus:x:105:53063:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-proxy-02.tools.eqiad.wmflabs: prometheus:x:104:122:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1431.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-paws-worker-1017.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-redis-1001.tools.eqiad.wmflabs: prometheus:x:104:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1432.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1430.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1436.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1437.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1440.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-static-12.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-redis-1002.tools.eqiad.wmflabs: prometheus:x:104:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1435.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-services-02.tools.eqiad.wmflabs: prometheus:x:112:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1438.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1439.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-services-01.tools.eqiad.wmflabs: prometheus:x:112:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-static-13.tools.eqiad.wmflabs: prometheus:x:108:112:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1442.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-flannel-etcd-01.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1441.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1406.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-gift-trusty-01.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1405.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-generic-1401.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1404.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-generic-1404.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1404.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1406.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-generic-1403.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-generic-1402.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1410.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1402.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1407.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1408.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1405.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-exec-1434.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1409.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1415.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1422.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1418.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1412.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1407.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1425.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1001.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs: prometheus:x:113:116:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs: prometheus:x:114:117:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1006.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1002.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1008.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1003.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1005.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1007.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false
tools-worker-1004.tools.eqiad.wmflabs: prometheus:x:105:120:Prometheus daemon,,,:/var/lib/prometheus:/bin/false

We have this exact same problem in all other cloud vps project.

Mentioned in SAL (#wikimedia-cloud) [2018-06-06T09:49:36Z] <arturo> T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid

Mentioned in SAL (#wikimedia-cloud) [2018-06-07T11:01:27Z] <arturo> T196137 force rotate all exim panilog files to avoid rootspam aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'

Mentioned in SAL (#wikimedia-cloud) [2018-06-08T07:46:08Z] <arturo> T196137 more rootspam today, restarting again prometheus-node-exporter and force rotating exim4 paniclog in 12 nodes

I still see some cronspam today, from:

toolsbeta-grid-master.toolsbeta.eqiad.wmflabs
toolsbeta-bastion-01.toolsbeta.eqiad.wmflabs
toolsbeta-exec-1401.toolsbeta.eqiad.wmflabs
toolsbeta-webgrid-lighttpd-1401.toolsbeta.eqiad.wmflabs
toolsbeta-webgrid-generic-1401.toolsbeta.eqiad.wmflabs
tools-exec-1401.tools.eqiad.wmflabs
tools-exec-1407.tools.eqiad.wmflabs
tools-exec-1408.tools.eqiad.wmflabs
tools-exec-1430.tools.eqiad.wmflabs
tools-exec-1431.tools.eqiad.wmflabs
tools-exec-1432.tools.eqiad.wmflabs
tools-exec-1435.tools.eqiad.wmflabs
tools-exec-1438.tools.eqiad.wmflabs
tools-exec-1439.tools.eqiad.wmflabs
tools-exec-1441.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1402.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1407.tools.eqiad.wmflabs

Not sure why.

Mentioned in SAL (#wikimedia-cloud) [2018-06-11T10:11:02Z] <arturo> T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'

Again more rootspam today aboyt the exim4 paniclog for the same reasons.

@Andrew any idea? I'm running out of them.

The current rootspam is because the crontab that runs this is owned by the LDAP user. I'm setting up to chown it to the local user from clush on the hosts that alerted.

I ran this command: sudo chown prometheus:crontab /var/spool/cron/crontabs/prometheus via clush on the list of servers I know were affected. It is quite likely that newer hosts which had the crontab put in place since the prometheus LDAP user was created have the wrong UID on the crontab file as well. The command may need to be run for a wider set of hosts.

FYI, I also could not get into toolsbeta-grid-master.toolsbeta.eqiad.wmflabs, which does need the chown treatment.

@Chicocvenancio ran the command on tooolsbeta-grid-master. It still may be needed in other projects.

I got more spam today, from these servers:

tools-webgrid-lighttpd-1402.tools.eqiad.wmflabs
tools-exec-1401.tools.eqiad.wmflabs
tools-exec-1439.tools.eqiad.wmflabs
tools-exec-1438.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1407.tools.eqiad.wmflabs
tools-exec-1408.tools.eqiad.wmflabs
tools-exec-1432.tools.eqiad.wmflabs
tools-exec-1431.tools.eqiad.wmflabs
tools-exec-1441.tools.eqiad.wmflabs
tools-exec-1430.tools.eqiad.wmflabs
tools-exec-1407.tools.eqiad.wmflabs

The content of the email is the same:

exim paniclog /var/log/exim4/paniclog on tools-exec-1407.tools.eqiad.wmflabs has non-zero size, mail system might be broken. The last 10 lines are quoted below.

2018-06-13 06:35:02 Failed to get user name for uid 14736
2018-06-13 06:36:01 Failed to get user name for uid 14736
2018-06-13 06:37:02 Failed to get user name for uid 14736
2018-06-13 06:38:01 Failed to get user name for uid 14736
2018-06-13 06:39:02 Failed to get user name for uid 14736
2018-06-13 06:40:01 Failed to get user name for uid 14736
2018-06-13 06:41:02 Failed to get user name for uid 14736
2018-06-13 06:42:02 Failed to get user name for uid 14736
2018-06-13 06:43:02 Failed to get user name for uid 14736
2018-06-13 06:44:01 Failed to get user name for uid 14736

I checked some of them, and owner of /var/spool/cron/crontabs/prometheus looks fine:

aborrero@tools-webgrid-lighttpd-1402:~$ sudo ls -l /var/spool/cron/crontabs/prometheus
-rw------- 1 prometheus crontab 644 Jul  5  2017 /var/spool/cron/crontabs/prometheus

aborrero@tools-webgrid-lighttpd-1402:~$ sudo ls -ln /var/spool/cron/crontabs/prometheus
-rw------- 1 114 103 644 Jul  5  2017 /var/spool/cron/crontabs/prometheus

I would try rebooting the affected machines to restart every single process and wipe any user/email caching in memory.

So the problem (at this point...the crontab file was probably still a problem) is that exim4 is trying to find that uid for some process. I don't see anything in the queue, though, and restarting exim actually causes it to send more of the spam. I think reboots may just resent the spam since no process is running with that uid, unless there is some weird ldap cache interaction with exim??

Vvjjkkii renamed this task from toolforge: prometheus issue is filling up email queue to 2ubaaaaaaa.Jul 1 2018, 1:06 AM
Vvjjkkii removed aborrero as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from 2ubaaaaaaa to toolforge: prometheus issue is filling up email queue.Jul 2 2018, 1:47 PM
CommunityTechBot assigned this task to aborrero.
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.

We had more spam today, and I just realized we didn't really stop receiving it.

Subject: exim paniclog on tools-exec-1407.tools.eqiad.wmflabs has non-zero size
To: root@tools.wmflabs.org
From: root <root@tools.wmflabs.org>
Date: Wed, 04 Jul 2018 06:53:58 +0000

exim paniclog /var/log/exim4/paniclog on tools-exec-1407.tools.eqiad.wmflabs has non-zero size, mail system might be broken. The last 10 lines are quoted below.

2018-07-04 06:44:02 Failed to get user name for uid 14736
2018-07-04 06:45:01 Failed to get user name for uid 14736
2018-07-04 06:46:02 Failed to get user name for uid 14736
2018-07-04 06:47:01 Failed to get user name for uid 14736
2018-07-04 06:48:01 Failed to get user name for uid 14736
2018-07-04 06:49:02 Failed to get user name for uid 14736
2018-07-04 06:50:01 Failed to get user name for uid 14736
2018-07-04 06:51:02 Failed to get user name for uid 14736
2018-07-04 06:52:01 Failed to get user name for uid 14736
2018-07-04 06:53:02 Failed to get user name for uid 14736

hosts:

tools-exec-1407.tools.eqiad.wmflabs
tools-exec-1432.tools.eqiad.wmflabs
tools-exec-1408.tools.eqiad.wmflabs
tools-exec-1435.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1402.tools.eqiad.wmflabs
tools-exec-1431.tools.eqiad.wmflabs
tools-exec-1439.tools.eqiad.wmflabs
tools-exec-1441.tools.eqiad.wmflabs
tools-exec-1401.tools.eqiad.wmflabs
tools-exec-1438.tools.eqiad.wmflabs
tools-webgrid-lighttpd-1407.tools.eqiad.wmflabs
tools-exec-1430.tools.eqiad.wmflabs

I just ran 'truncate -s0 /var/log/exim4/paniclog' on all the affected hosts on the offhand chance that this is just leftover logs from before the issue is fixed. Probably it'll just cause the alerts to skip a day :(

We got more alerts today :-( I'm tempted to just reboot the hosts if that makes any sense.

It seems the issue might be caused by cron having uid=14736 cached somewhere (or nslcd is doing that? getent passwd prometheus returns the correct local user though).

I just restarted cron on tools-exec-1432 and there are no more errors running the prometheus crontab (edited):

Sep  7 02:10:01 tools-exec-1432 CRON[7722]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Sep  7 02:10:01 tools-exec-1432 CRON[7721]: (prometheus) MAIL (mailed 1 byte of output; but got status 0x0001, #012)
Sep  7 02:11:01 tools-exec-1432 CRON[7840]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Sep  7 02:11:01 tools-exec-1432 CRON[7839]: (prometheus) MAIL (mailed 1 byte of output; but got status 0x0001, #012)

Sep  7 02:11:19 tools-exec-1432 kernel: [10237129.532361] init: cron main process (1254) killed by TERM signal
Sep  7 02:11:19 tools-exec-1432 cron[7867]: (CRON) INFO (pidfile fd = 3)
Sep  7 02:11:19 tools-exec-1432 cron[7868]: (CRON) STARTUP (fork ok)
Sep  7 02:11:19 tools-exec-1432 cron[7868]: (CRON) INFO (Skipping @reboot jobs -- not system startup)

Sep  7 02:12:01 tools-exec-1432 CRON[7892]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Sep  7 02:13:01 tools-exec-1432 CRON[7931]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Sep  7 02:14:01 tools-exec-1432 CRON[7951]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)

That's the only host where I've restarted cron. I'll defer to @Bstorm and @aborrero on how/if this should be done to other nodes.

The crontab has been pretty good for months now--there used to be more direct errors from that. The errors from exim continue to come in day after day. They come in bursts that start and stop on their own schedule. Every time we have thought we stopped them, they are there the next day (so far). We've even rebooted most of these since this time IIRC due to a variety of issues around NFS. I think the description of this ticket needs updating to reflect the comments and the changing nature of this task, but maybe not. It's a weird one!

Wait. I wonder if all of these relay email through the same host (I think they do).

@Bstorm explained to me the context I was missing about what was tried before, thanks! I was still a bit stubborn and decided to give restarting cron and truncating the paniclog file a last try to satisfy my curiosity :)

Just executed sudo service cron restart && sudo truncate -s 0 /var/log/exim4/paniclog on all tools-exec-* servers. We should see the results in tomorrow's emails (or lack of).

Mentioned in SAL (#wikimedia-operations) [2018-09-08T09:45:08Z] <gtirloni> tools restarted cron and truncated /var/log/exim4/paniclog (T196137)

Mentioned in SAL (#wikimedia-cloud) [2018-09-08T10:35:08Z] <gtirloni> restarted cron and truncated /var/log/exim4/paniclog (T196137)

First Stashbot message was just me recording what I did on Friday and testing if the bot would accept my request. Second one was me testing why a request from paladox wasn't logged.

I noticed there were no more emails from tools-exec-* hosts today so that was good.

No more emails in the past few days. Closing.