Page MenuHomePhabricator

deployment-webperf21 puppet runs crashing with `Error: No space left on device`
Closed, ResolvedPublicBUG REPORT

Description

$ sudo -i puppet agent -tv
Error: No space left on device @ fptr_finalize_flush - /var/lib/puppet/ssl/ssl.lock
Error: No space left on device @ fptr_finalize_flush - /var/lib/puppet/ssl/ssl.lock

Event Timeline

bd808 changed the task status from Open to In Progress.Apr 7 2025, 4:50 PM
bd808 claimed this task.

Lots and lots of log spam in /var/log/{messages,syslog,user.log}

root@deployment-webperf21:/var/log# du -sh *|sort -h
...
33M     syslog.9.gz
826M    journal
2.3G    messages
2.3G    syslog
2.3G    user.log
2.7G    messages.1
2.7G    syslog.1
2.7G    user.log.1

It looks like Mar 28 00:01:07 deployment-webperf21 navtiming[2564521]: kafka.errors.NoBrokersAvailable: NoBrokersAvailable may be the problem that kicked this off?

Mentioned in SAL (#wikimedia-releng) [2025-04-07T16:56:48Z] <bd808> rm /var/log/user.log.1 on deployment-webperf21 (T391272)

Mentioned in SAL (#wikimedia-releng) [2025-04-07T16:58:42Z] <bd808> puppet agent -tv to catch up with missed puppet runs on deployment-webperf21 (T391272)

The log spam in the messages.1 file was so repetitive that gzip -9 messages.1 turned a 2.3G input into a 30M output!

T391273: navtiming: Loss of Kafka connection fills multiple log files with identical stack traces is the next level lower issue for the logging.

Mentioned in SAL (#wikimedia-releng) [2025-04-07T17:15:20Z] <bd808> Reboot deployment-webperf21 (T391272)

Mentioned in SAL (#wikimedia-releng) [2025-04-07T17:20:03Z] <bd808> service navtiming stop to halt "Unhandled exception in main loop, restarting consumer" crash loop (T391272)

The disk isn't filling up now that the broken service is shutdown. Follow up should happen in T391273: navtiming: Loss of Kafka connection fills multiple log files with identical stack traces