Page MenuHomePhabricator

Configure the HDFS Namenodes to use the log4j rolling gzip appender
Closed, ResolvedPublic

Description

I was able to make the Namenodes to rollover log files using gzip via

hdfs.audit.logger=INFO,console
hdfs.audit.log.maxfilesize=512000000
hdfs.audit.log.maxbackupindex=10
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false
log4j.appender.RFAAUDIT=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.RFAAUDIT.rollingPolicy=org.apache.log4j.rolling.FixedWindowRollingPolicy
log4j.appender.RFAAUDIT.rollingPolicy.maxIndex=${hdfs.audit.log.maxbackupindex}
log4j.appender.RFAAUDIT.rollingPolicy.FileNamePattern=${hadoop.log.dir}/hdfs-audit.log.%i.gz
log4j.appender.RFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.triggeringPolicy=org.apache.log4j.rolling.SizeBasedTriggeringPolicy
log4j.appender.RFAAUDIT.triggeringPolicy.MaxFileSize=${hdfs.audit.log.maxfilesize}

But I needed to download and execute:

sudo cp apache-log4j-extras-1.2.17.jar /usr/lib/hadoop-hdfs/lib/

Now the jar seems to be available in liblog4j-extras1.2-java but we need to add it in the classpath of daemons, to avoid something like:

elukey@an-test-master1001:~$ cat /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-test-master1001.out
ulimit -a for user hdfs
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514021
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65536
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
log4j:ERROR Could not instantiate class [org.apache.log4j.rolling.RollingFileAppender].
java.lang.ClassNotFoundException: org.apache.log4j.rolling.RollingFileAppender
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at org.apache.log4j.helpers.Loader.loadClass(Loader.java:198)
	at org.apache.log4j.helpers.OptionConverter.instantiateByClassName(OptionConverter.java:327)
	at org.apache.log4j.helpers.OptionConverter.instantiateByKey(OptionConverter.java:124)
	at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:785)
	at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
	at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
	at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
	at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
	at org.slf4j.impl.Log4jLoggerFactory.<init>(Log4jLoggerFactory.java:66)
	at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72)
	at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45)
	at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
	at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
	at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383)
	at org.apache.hadoop.conf.ReconfigurableBase.<clinit>(ReconfigurableBase.java:46)
log4j:ERROR Could not instantiate appender named "RFAAUDIT".

Event Timeline

@elukey if we puppetize the hadoop log4j file for this it will make T265126: Improve logging for HDFS Namenodes easier for @razzi.

export HADOOP_CLASSPATH=/usr/share/java/apache-log4j-extras.jar:${HADOOP_CLASSPATH} in hadoop-env did the trick!

Change 677576 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH

https://gerrit.wikimedia.org/r/677576

Change 677576 merged by Elukey:

[operations/puppet@production] hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH

https://gerrit.wikimedia.org/r/677576

Change 677803 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: improve the HDFS Namenode audit log4j config

https://gerrit.wikimedia.org/r/677803

Change 677803 merged by Elukey:

[operations/puppet@production] hadoop: improve the HDFS Namenode audit log4j config

https://gerrit.wikimedia.org/r/677803

The hdfs audit config has been deployed correctly on the prod master nodes, let's wait a couple of days to see how it goes. After that, we could move other logs to the same gzip rolling scheme to save space.

Change 677853 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: fix the HDFS Namenode audit log config

https://gerrit.wikimedia.org/r/677853

Change 677853 merged by Elukey:

[operations/puppet@production] hadoop: fix the HDFS Namenode audit log config

https://gerrit.wikimedia.org/r/677853

Quick check done to see how things are doing:

elukey@an-master1001:~$ du -hs /var/log/hadoop-hdfs/hdfs-audit.log*
22M	/var/log/hadoop-hdfs/hdfs-audit.log
28M	/var/log/hadoop-hdfs/hdfs-audit.log.10.gz
28M	/var/log/hadoop-hdfs/hdfs-audit.log.1.gz
27M	/var/log/hadoop-hdfs/hdfs-audit.log.2.gz
27M	/var/log/hadoop-hdfs/hdfs-audit.log.3.gz
28M	/var/log/hadoop-hdfs/hdfs-audit.log.4.gz
31M	/var/log/hadoop-hdfs/hdfs-audit.log.5.gz
29M	/var/log/hadoop-hdfs/hdfs-audit.log.6.gz
30M	/var/log/hadoop-hdfs/hdfs-audit.log.7.gz
29M	/var/log/hadoop-hdfs/hdfs-audit.log.8.gz
29M	/var/log/hadoop-hdfs/hdfs-audit.log.9.gz

first element in /var/log/hadoop-hdfs/hdfs-audit.log.10.gz (so earliest recorded):

2021-04-08 17:02:34,497

Now:

elukey@an-master1001:~$ date
Fri Apr  9 06:12:36 UTC 2021

We have a massive reduction in file size, but the buffering in the normal use case (so without a storm of RPCs for an unknown event) is still less than a day. I'll bump the max log file size to 5G, that should increase by 5 fold the space used (currently ~350M, so it will become ~1.6G) but it should allow us to have a decent buffering for events recorded.

Change 680383 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: improve default log4j config

https://gerrit.wikimedia.org/r/680383

Currently testing the new config on an-test-master1001 and an-test-worker1001, from the first results it looks good.

elukey triaged this task as Medium priority.Apr 19 2021, 7:56 AM

Change 680383 merged by Elukey:

[operations/puppet@production] hadoop: improve default log4j config

https://gerrit.wikimedia.org/r/680383

Verified on an-test-worker1001 that the Yarn NM uses the gzip appender now:

elukey@an-test-worker1001:~$ ls -lht /var/log/hadoop-yarn/*log*
-rw-r--r-- 1 yarn yarn 3.9M Apr 23 09:18 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log
-rw-r--r-- 1 yarn yarn  11M Apr 23 07:04 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.1.gz  <==== unzipped this is around 257MB
-rw-r--r-- 1 yarn yarn 257M Apr 16 08:21 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.2
-rw-r--r-- 1 yarn yarn 257M Apr 12 23:52 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.3
-rw-r--r-- 1 yarn yarn 257M Apr 12 11:37 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.4

I am going to roll restart a couple of daemons in prod now that the change is rolled out to further test, and then we'll need to roll restart everything to apply properly (the next jvm upgrade will likely be a good occasion).

Mentioned in SAL (#wikimedia-analytics) [2021-04-23T10:30:20Z] <elukey> restart hadoop daemons (NM, DN, JN) on an-worker1080 to further test the new log4j config - T276906

Mentioned in SAL (#wikimedia-analytics) [2021-04-23T14:23:01Z] <elukey> roll restart an-master100[1,2] daemons to pick up new lo4j settings - T276906

I'll keep monitoring the an-master100x nodes and an-worker1080, but we should be good from preliminary results. The complete rollout could be completed, as mentioned above, during the next jvm restart for upgrades.

Mentioned in SAL (#wikimedia-analytics) [2021-04-26T06:32:18Z] <elukey> roll restart of hadoop-yarn-nodemanagers to pick up new log4j settings - T276906

The last step is to roll restart all hdfs daemons (journal and data nodes), but we can wait for the next openjdk upgrade (that will require a roll restart of all the jvms).