Page MenuHomePhabricator

Email flood due to a some email issue and a full disk on tools prometheus
Closed, ResolvedPublic

Description

I'm getting spammed pretty badly by Mailer-Daemon@tools.wmflabs.org. The messages seem to be from promethus@tools.wmflabs.org so this is probably just tools admins getting the flood. A quick search didn't turn up a duplicate ticket, so here's one. If I can find some time, I'll see if I can find out what's wrong, but my inbox is hurting. Please fix this if you can. Thank you!!

The ultimate problem seems to be OSError: [Errno 28] No space left on device on tools-prometheus-05 (and perhaps the other in the pair), but the email alias/list or something like that clearly has some kind of issue.

An example message is below.

This message was created automatically by mail delivery software.

A message that you sent could not be delivered to one or more of its
recipients. This is a permanent error. The following address(es) failed:

  root@wmcloud.org
    (ultimately generated from prometheus@tools.wmflabs.org)
    all hosts for 'wmcloud.org' have been failing for a long time (and retry time not reached)
----------------------------------------------
message/delivery-status
----------------------------------------------
Reporting-MTA: dns; mail.tools.wmflabs.org

Action: failed
Final-Recipient: rfc822;root@wmcloud.org
Status: 5.0.0
----------------------------------------------
message/rfc822
----------------------------------------------
Return-path: <prometheus@tools.wmflabs.org>
Received: from tools-prometheus-05.tools.eqiad1.wikimedia.cloud ([172.16.0.103])
	by mail.tools.wmflabs.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256)
	(Exim 4.92)
	(envelope-from <prometheus@tools.wmflabs.org>)
	id 1n4blr-0005C8-S6
	for prometheus@tools.wmflabs.org; Tue, 04 Jan 2022 04:50:04 +0000
Received: from prometheus by tools-prometheus-05.tools.eqiad1.wikimedia.cloud with local (Exim 4.92)
	(envelope-from <prometheus@tools.wmflabs.org>)
	id 1n4blr-00071b-Pi
	for prometheus@tools.wmflabs.org; Tue, 04 Jan 2022 04:50:03 +0000
From: root@tools.wmflabs.org (Cron Daemon)
To: prometheus@tools.wmflabs.org
Subject: Cron <prometheus@tools-prometheus-05> /usr/local/bin/prometheus-labs-targets --port 9051 --prefix tools-flannel-etcd- > /srv/prometheus/tools/targets/etcd_flannel.$$ && mv /srv/prometheus/tools/targets/etcd_flannel.$$ /srv/prometheus/tools/targets/etcd_flannel.yml
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/var/lib/prometheus>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=prometheus>
Message-Id: <E1n4blr-00071b-Pi@tools-prometheus-05.tools.eqiad1.wikimedia.cloud>
Date: Tue, 04 Jan 2022 04:50:03 +0000

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
OSError: [Errno 28] No space left on device

Event Timeline

Bstorm renamed this task from Email flood due to a naming issue and a full disk on tools prometheus to Email flood due to a some email issue and a full disk on tools prometheus.Jan 4 2022, 4:57 AM
Bstorm edited projects, added Toolforge; removed Cloud-Services.

Mentioned in SAL (#wikimedia-cloud) [2022-01-04T08:12:17Z] <taavi> disable puppet & exim4 on T298501

I stopped exim on the host until someone with more time can look into this

prometheus should limit its storage to 250G

root@tools-prometheus-05:/srv/prometheus/tools# systemctl status prometheus@tools
● prometheus@tools.service - Prometheus server (instance tools)
   Loaded: loaded (/lib/systemd/system/prometheus@tools.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Tue 2022-01-04 08:11:20 UTC; 2min 15s ago
  Process: 3441 ExecStart=/usr/bin/prometheus --storage.tsdb.path /srv/prometheus/tools/metrics --web.listen-address 127.0.0.1:9902 --web.external-url https://tools-prometheus.wmflabs.org/tools --storage.tsdb.retention.time 1000d --storage.tsdb.retention.size 250GB --config.file /srv/prometheus/tools/prometheus.yml [...]
 Main PID: 3441 (code=exited, status=1/FAILURE)

but it's not

root@tools-prometheus-05:/srv/prometheus/tools/metrics# du -sh .
262G	.
dcaro changed the task status from Open to In Progress.Jan 4 2022, 8:57 AM
dcaro claimed this task.
dcaro added a project: User-dcaro.
dcaro moved this task from To refine to Today on the User-dcaro board.
dcaro moved this task from Today to Doing on the User-dcaro board.

I'm manually removed some old data (from /srv/prometheus/tools/metrics, around 2.5G), and started the prometheus service again, that allowed it to do the compaction properly and cleanup some space, maybe we sholud decrease the retention so it has more space for temporary operations.

Changed the retention to 220G, and ran puppet, that cleaned up some more space, hopefully enough:

/dev/mapper/vd-second--local--disk                                 276G  223G   40G  85% /srv

Thanks @dcaro! My inbox is restored to proper function. I guess the rest of the mystery is the email bouncing.

It's back. No space left on device. I didn't get any time to look myself today. TF must have grown a bit. Maybe the disk needs to be bigger?

I did not get any emails, the host does not seem to be out of disk:

/dev/mapper/vd-second--local--disk                                 276G  225G   38G  86% /srv

And I don't see any messages in the logs, can you forward me one of the emails?

So I think the email bounces are happening because the srv-networktests account (added to tools.admin in T294955) has root@wmcloud.org as its email address and that address doesn't route anywhere. We could probably fix that by replacing the tools.admin and cloudinfra memberships with some security::access::config definitions in Puppet to let it log in to only the hosts it actually needs.

It's been a few days since the disk issue has not happened, please open a new task if anything else arises.