Page MenuHomePhabricator

webperf2001 is running out of disk space
Closed, ResolvedPublic

Description

22:11 <+icinga-wm> PROBLEM - Disk space on webperf2001 is CRITICAL: DISK CRITICAL - free space: / 1556 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
22:28 <+icinga-wm> PROBLEM - Check systemd state on webperf2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.

times in UTC

17G	/var/log/messages
12G	/var/log/syslog
17G	/var/log/user.log

Event Timeline

jijiki created this task.Apr 20 2019, 10:34 PM
Restricted Application added subscribers: Gilles, Aklapper. · View Herald TranscriptApr 20 2019, 10:34 PM
jijiki triaged this task as High priority.Apr 20 2019, 10:34 PM
jijiki added a subscriber: Krinkle.
jijiki updated the task description. (Show Details)Apr 20 2019, 10:42 PM
Peachey88 updated the task description. (Show Details)Apr 20 2019, 10:46 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-21T05:19:39Z] <marostegui> Clean up some space on webperf2001 - T221508

The host was fully full:

root@webperf2001:/var/log# df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs  3.9G     0  3.9G   0% /dev
tmpfs          tmpfs     799M   81M  719M  11% /run
/dev/vda1      ext4       49G   49G     0 100% /
tmpfs          tmpfs     4.0G   12K  4.0G   1% /dev/shm
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs     4.0G     0  4.0G   0% /sys/fs/cgroup
tmpfs          tmpfs     799M     0  799M   0% /run/user/15343

So messages was just full of:

Apr 20 23:41:41 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a8ab450>> ignored
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,716 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 20 23:41:41 webperf2001 python[14715]: Traceback (most recent call last):
Apr 20 23:41:41 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 20 23:41:41 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 20 23:41:41 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,716 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 20 23:41:41 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a9eae90>> ignored
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,716 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 20 23:41:41 webperf2001 python[14715]: Traceback (most recent call last):
Apr 20 23:41:41 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 20 23:41:41 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 20 23:41:41 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,717 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 20 23:41:41 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a73fe90>> ignored
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,717 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 20 23:41:41 webperf2001 python[14715]: Traceback (most recent call last):
Apr 20 23:41:41 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 20 23:41:41 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 20 23:41:41 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,717 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 20 23:41:41 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38aa6ff10>> ignored
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,717 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 20 23:41:41 webperf2001 python[14715]: Traceback (most recent call last):
Apr 20 23:41:41 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 20 23:41:41 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 20 23:41:41 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,717 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 20 23:41:41 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a6bfe50>> ignored
Apr 20 23:41:41 webperf2001 python[14715]: 2019-04-20 23:41:41,718 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 20 23:41:41 webperf2001 python[14715]: Traceback (most recent call last):
Apr 20 23:41:41 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 20 23:41:41 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 20 23:41:41 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 20 23:41:41 webperf2001 python[14715]: IOError: [Errno 24] Too many open files

user.log is full with pretty much the same thing:

Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,265 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 21 05:20:54 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a694b90>> ignored
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,265 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 21 05:20:54 webperf2001 python[14715]: Traceback (most recent call last):
Apr 21 05:20:54 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 21 05:20:54 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 21 05:20:54 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,266 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 21 05:20:54 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38b3ab790>> ignored
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,266 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 21 05:20:54 webperf2001 python[14715]: Traceback (most recent call last):
Apr 21 05:20:54 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 21 05:20:54 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 21 05:20:54 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,266 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 21 05:20:54 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a71b050>> ignored
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,266 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 21 05:20:54 webperf2001 python[14715]: Traceback (most recent call last):
Apr 21 05:20:54 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 21 05:20:54 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 21 05:20:54 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,266 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 21 05:20:54 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a6df7d0>> ignored
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,267 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 21 05:20:54 webperf2001 python[14715]: Traceback (most recent call last):
Apr 21 05:20:54 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 21 05:20:54 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__
Apr 21 05:20:54 webperf2001 python[14715]: IOError: [Errno 24] Too many open files
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,267 [INFO] (run:526) Starting Kafka connection to brokers (kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092).
Apr 21 05:20:54 webperf2001 python[14715]: Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" in <bound method KafkaClient.__del__ of <kafka.client_async.KafkaClient object at 0x7fc38a6df810>> ignored
Apr 21 05:20:54 webperf2001 python[14715]: 2019-04-21 05:20:54,267 [ERROR] (run:599) Error in main loop, restarting consumer
Apr 21 05:20:54 webperf2001 python[14715]: Traceback (most recent call last):
Apr 21 05:20:54 webperf2001 python[14715]:   File "/srv/deployment/performance/coal-cache/revs/8766469862a342d1d07f96666e2f7bd0862d8281/coal/__init__.py", line 530, in run
Apr 21 05:20:54 webperf2001 python[14715]:     enable_auto_commit=False)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/consumer/group.py", line 340, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:     self._client = KafkaClient(metrics=self._metrics, **self.config)
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/client_async.py", line 188, in __init__
Apr 21 05:20:54 webperf2001 python[14715]:   File "/usr/lib/python2.7/dist-packages/kafka/vendor/selectors34.py", line 424, in __init__

I have truncated both, although they are still getting the same error, but at least we'll have some room now on the host.

This will get full in a matter of minutes again:

root@webperf2001:/var/log# ls -lh messages user.log
-rw-r----- 1 root adm 1.3G Apr 21 05:27 messages
-rw-r----- 1 root adm 989M Apr 21 05:27 user.log

I guess the following scripts should be killed until they are fixed:

nobody   14715  5.0  2.8 4881324 230952 ?      Ssl  Mar08 3153:55 /usr/bin/python /srv/deployment/performance/coal/run_coal.py --brokers kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092 --consumer-group coal_codfw --schema NavigationTiming --schema SaveTiming --schema PaintTiming --graphite-host graphite-in.eqiad.wmnet --graphite-port 2003 --graphite-prefix coal
nobody   21650  2.1  0.2 128668 20056 ?        Ssl  05:25   0:00 /usr/bin/python /srv/deployment/statsv/statsv/statsv.py --brokers kafka2001.codfw.wmnet:9092,kafka2002.codfw.wmnet:9092,kafka2003.codfw.wmnet:9092 --statsd statsd.eqiad.wmnet:8125 --topics statsv
nobody   21651  0.0  0.1  47004 13644 ?        S    05:25   0:00 /usr/bin/python /srv/deployment/statsv/statsv/statsv.py --brokers kafka2001.codfw.wmnet:9092,kafka2002.codfw.wmnet:9092,kafka2003.codfw.wmnet:9092 --statsd statsd.eqiad.wmnet:8125 --topics statsv
nobody   21652  0.0  0.1  47004 13644 ?        S    05:25   0:00 /usr/bin/python /srv/deployment/statsv/statsv/statsv.py --brokers kafka2001.codfw.wmnet:9092,kafka2002.codfw.wmnet:9092,kafka2003.codfw.wmnet:9092 --statsd statsd.eqiad.wmnet:8125 --topics statsv
root     21671  0.0  0.0  12784  1016 pts/0    S+   05:25   0:00 grep -i python
nobody   29240  0.0  0.2  64948 23400 ?        Ss   Apr13   2:03 /usr/bin/python /srv/deployment/performance/navtiming/run_navtiming.py --brokers kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092 --consumer-group navtiming --statsd-host statsd.eqiad.wmnet --statsd-port 8125

But I am not familiar with this service so I won't kill anything

There should be only one instance of each on a given webperfx001 instance:

  • statsv.py
  • navtiming.py
  • coal.py

It would appear there are multiple in that snapshot. Afaik that's managed by systemd.

The immediate causes for the disk being full seems to be the log files having grown rapidly with fatal errors from these services that are auto-restarted whenever they fail and produce the same error again in a very tight loop controlled by systemd.

I assume then that is the reason there are multiple instances of the same service in that ps listing, because normally that is not meant to be possible.

So the question is - why are the services unable to start?

I suspect there may also be a cascading set of issues, and it's not entirely clear what the root cause is.

The symptoms we have are:

  • A full disk (at this point, mainly due to error log growth; but might've also been a root cause from something else writing to disk earlier).
  • Multiple instances of the same systemd-controlled service. This is not meant to be possible, but could just be due to delay in ps and the processes spawning quicker than it traversing the processes.
  • Error from coal - Exception AttributeError: "'KafkaClient' object has no attribute '_closed'" .
  • Error from coal - IOError: [Errno 24] Too many open files

There seem to be several major points in time where something significant happened on webperf2001 in the last 7 days.

Comparing webperf1001 and webperf2001 which have the same role and services.

webperf1001webperf2001
  • [webperf2001] 2019-04-17 11:00 UTC:
    • Memory utilisation starts to grow from a stable 2GB to 7GB.
    • Network utilisation starts having regular 3MB/s spikes (previously always under 100kB/s).
  • [webperf2001] 2019-04-19 06:00 UTC:
    • Memory utilisation dropped briefly from 7GB to 6GB, then growing back up to 7GB.
  • [webperf2001] 2019-04-19 12:00 UTC:
    • Network utilisation spikes of 3MB/s are now gone.
  • [webperf2001] 2019-04-20 06:00 UTC:
    • Memory utilisation dropped briefly from 7GB to 6GB, stays at 6GB
  • [webperf2001] 2019-04-20 21:00 UTC:
    • Memory utilisation grows up again, from 6GB to 8GB this time and stays there.
    • Network utilisation grows from <100kB/s to a sustained 4 MB/s (not even spikes).
    • CPU utilisation (first time) grows from <5% to over 80%, sustained.

And that's where are are now.

I am afraid I do not know much either about the services on this server so to perform any actions. @Krinkle Is there something we can do for the time being? What problems are we having while this server is in this state?

Aklapper renamed this task from webperf2001 is running ouf of disk space to webperf2001 is running out of disk space.Apr 21 2019, 9:50 PM

The multiple statsv instances probably aren't due to the constant restarting, because the same is observed on webperf1001 with instances that were started a long time ago:

nobody   17356  3.1  0.2 203132 21208 ?        Ssl  Mar06 2118:02 /usr/bin/python /srv/deployment/statsv/statsv/statsv.py --brokers kafka1001.eqiad.wmnet:9092,kafka1002.eqiad.wmnet:9092,kafka1003.eqiad.wmnet:9092 --statsd statsd.eqiad.w
nobody   17369  1.5  0.1  53336 15760 ?        S    Mar06 1067:57 /usr/bin/python /srv/deployment/statsv/statsv/statsv.py --brokers kafka1001.eqiad.wmnet:9092,kafka1002.eqiad.wmnet:9092,kafka1003.eqiad.wmnet:9092 --statsd statsd.eqiad.w
nobody   17370  1.5  0.1  53336 15744 ?        S    Mar06 1067:54 /usr/bin/python /srv/deployment/statsv/statsv/statsv.py --brokers kafka1001.eqiad.wmnet:9092,kafka1002.eqiad.wmnet:9092,kafka1003.eqiad.wmnet:9092 --statsd statsd.eqiad.w
r
ro

The answer is that statsv spawns workers as separate processes. By default the amount of workers if half the amount of logical CPUs, which works out to 2 workers on those machines (in addition to the main process). So seeing 3 statsv processes is normal.

Gilles closed this task as Resolved.Apr 22 2019, 7:42 AM
Gilles claimed this task.

Restarting coal fixed it. I think it was still the consequence of the kafka maintenance last week, that left coal in a bad state. The restarted coal service has now caught up. I've truncated all 3 logs to liberate space.