Page MenuHomePhabricator

Alarm on HDFS related script failures
Closed, DuplicatePublic

Description

Today we started seeing some HDFS datanode partitions on analytics brokers to fill up due to HDFS free space reduced to a tiny percentage. Over the past three months there was a steady increase that we didn't notice, mostly due to the refinery-drop-webrequest-partitions script (an1003) failing due to a file system permission issue.

We should alarm on refinery-drop-webrequest-partitions errors, do we need to do anything more than MAILTO in crontab? (we will need to review all scripts to see whether they put errors only on std out)