Configure the HDFS Namenodes to use the log4j rolling gzip appender
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Mar 9 2021, 10:13 AM

Description

I was able to make the Namenodes to rollover log files using gzip via

hdfs.audit.logger=INFO,console
hdfs.audit.log.maxfilesize=512000000
hdfs.audit.log.maxbackupindex=10
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false
log4j.appender.RFAAUDIT=org.apache.log4j.rolling.RollingFileAppender
log4j.appender.RFAAUDIT.rollingPolicy=org.apache.log4j.rolling.FixedWindowRollingPolicy
log4j.appender.RFAAUDIT.rollingPolicy.maxIndex=${hdfs.audit.log.maxbackupindex}
log4j.appender.RFAAUDIT.rollingPolicy.FileNamePattern=${hadoop.log.dir}/hdfs-audit.log.%i.gz
log4j.appender.RFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.triggeringPolicy=org.apache.log4j.rolling.SizeBasedTriggeringPolicy
log4j.appender.RFAAUDIT.triggeringPolicy.MaxFileSize=${hdfs.audit.log.maxfilesize}

But I needed to download and execute:

sudo cp apache-log4j-extras-1.2.17.jar /usr/lib/hadoop-hdfs/lib/

Now the jar seems to be available in liblog4j-extras1.2-java but we need to add it in the classpath of daemons, to avoid something like:

elukey@an-test-master1001:~$ cat /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-test-master1001.out
ulimit -a for user hdfs
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514021
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65536
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
log4j:ERROR Could not instantiate class [org.apache.log4j.rolling.RollingFileAppender].
java.lang.ClassNotFoundException: org.apache.log4j.rolling.RollingFileAppender
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at org.apache.log4j.helpers.Loader.loadClass(Loader.java:198)
	at org.apache.log4j.helpers.OptionConverter.instantiateByClassName(OptionConverter.java:327)
	at org.apache.log4j.helpers.OptionConverter.instantiateByKey(OptionConverter.java:124)
	at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:785)
	at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
	at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
	at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
	at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
	at org.slf4j.impl.Log4jLoggerFactory.<init>(Log4jLoggerFactory.java:66)
	at org.slf4j.impl.StaticLoggerBinder.<init>(StaticLoggerBinder.java:72)
	at org.slf4j.impl.StaticLoggerBinder.<clinit>(StaticLoggerBinder.java:45)
	at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
	at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
	at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
	at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383)
	at org.apache.hadoop.conf.ReconfigurableBase.<clinit>(ReconfigurableBase.java:46)
log4j:ERROR Could not instantiate appender named "RFAAUDIT".

Details

Subject	Repo	Branch	Lines +/-
hadoop: improve default log4j config	operations/puppet	production	+18 -22
hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH	operations/puppet	production	+8 -0
hadoop: fix the HDFS Namenode audit log config	operations/puppet	production	+1 -1
hadoop: improve the HDFS Namenode audit log4j config	operations/puppet	production	+8 -5

Customize query in gerrit

Related Objects

Mentioned In: T279304: Incorrect log4j configuration on hive servers causing excessive logging
T265126: Improve logging for HDFS Namenodes
T278423: Upgrade the Hadoop masters to Debian Buster
Mentioned Here: T265126: Improve logging for HDFS Namenodes

Event Timeline

elukey created this task.Mar 9 2021, 10:13 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 9 2021, 10:13 AM

Ottomata edited projects, added Analytics-Clusters; removed Analytics.Mar 11 2021, 6:07 PM

Ottomata assigned this task to elukey.Mar 15 2021, 3:39 PM

Ottomata moved this task from Backlog to Q3 2020/2021 on the Analytics-Clusters board.

Ottomata added a project: Analytics-Kanban.

elukey mentioned this in T278423: Upgrade the Hadoop masters to Debian Buster.Mar 25 2021, 11:05 AM

@elukey if we puppetize the hadoop log4j file for this it will make T265126: Improve logging for HDFS Namenodes easier for @razzi.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 7 2021, 6:36 AM

elukey mentioned this in T265126: Improve logging for HDFS Namenodes.Apr 7 2021, 7:10 AM

export HADOOP_CLASSPATH=/usr/share/java/apache-log4j-extras.jar:${HADOOP_CLASSPATH} in hadoop-env did the trick!

Change 677576 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH

https://gerrit.wikimedia.org/r/677576

gerritbot added a project: Patch-For-Review.Apr 7 2021, 1:32 PM

Change 677576 merged by Elukey:

[operations/puppet@production] hadoop: add the liblog4j-extras1.2-java jar to HADOOP_CLASSPATH

https://gerrit.wikimedia.org/r/677576

Change 677803 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: improve the HDFS Namenode audit log4j config

https://gerrit.wikimedia.org/r/677803

Change 677803 merged by Elukey:

[operations/puppet@production] hadoop: improve the HDFS Namenode audit log4j config

https://gerrit.wikimedia.org/r/677803

Maintenance_bot removed a project: Patch-For-Review.Apr 8 2021, 8:10 AM

The hdfs audit config has been deployed correctly on the prod master nodes, let's wait a couple of days to see how it goes. After that, we could move other logs to the same gzip rolling scheme to save space.

Change 677853 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: fix the HDFS Namenode audit log config

https://gerrit.wikimedia.org/r/677853

gerritbot added a project: Patch-For-Review.Apr 8 2021, 11:47 AM

Change 677853 merged by Elukey:

[operations/puppet@production] hadoop: fix the HDFS Namenode audit log config

https://gerrit.wikimedia.org/r/677853

Quick check done to see how things are doing:

elukey@an-master1001:~$ du -hs /var/log/hadoop-hdfs/hdfs-audit.log*
22M	/var/log/hadoop-hdfs/hdfs-audit.log
28M	/var/log/hadoop-hdfs/hdfs-audit.log.10.gz
28M	/var/log/hadoop-hdfs/hdfs-audit.log.1.gz
27M	/var/log/hadoop-hdfs/hdfs-audit.log.2.gz
27M	/var/log/hadoop-hdfs/hdfs-audit.log.3.gz
28M	/var/log/hadoop-hdfs/hdfs-audit.log.4.gz
31M	/var/log/hadoop-hdfs/hdfs-audit.log.5.gz
29M	/var/log/hadoop-hdfs/hdfs-audit.log.6.gz
30M	/var/log/hadoop-hdfs/hdfs-audit.log.7.gz
29M	/var/log/hadoop-hdfs/hdfs-audit.log.8.gz
29M	/var/log/hadoop-hdfs/hdfs-audit.log.9.gz

first element in /var/log/hadoop-hdfs/hdfs-audit.log.10.gz (so earliest recorded):

2021-04-08 17:02:34,497

Now:

elukey@an-master1001:~$ date
Fri Apr  9 06:12:36 UTC 2021

We have a massive reduction in file size, but the buffering in the normal use case (so without a storm of RPCs for an unknown event) is still less than a day. I'll bump the max log file size to 5G, that should increase by 5 fold the space used (currently ~350M, so it will become ~1.6G) but it should allow us to have a decent buffering for events recorded.

elukey mentioned this in T279304: Incorrect log4j configuration on hive servers causing excessive logging.Apr 15 2021, 6:11 AM

Change 680383 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] hadoop: improve default log4j config

https://gerrit.wikimedia.org/r/680383

Currently testing the new config on an-test-master1001 and an-test-worker1001, from the first results it looks good.

elukey triaged this task as Medium priority.Apr 19 2021, 7:56 AM

Ottomata moved this task from Q3 2020/2021 to Q4 2020/2021 on the Analytics-Clusters board.Apr 19 2021, 3:51 PM

Change 680383 merged by Elukey:

[operations/puppet@production] hadoop: improve default log4j config

https://gerrit.wikimedia.org/r/680383

Verified on an-test-worker1001 that the Yarn NM uses the gzip appender now:

elukey@an-test-worker1001:~$ ls -lht /var/log/hadoop-yarn/*log*
-rw-r--r-- 1 yarn yarn 3.9M Apr 23 09:18 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log
-rw-r--r-- 1 yarn yarn  11M Apr 23 07:04 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.1.gz  <==== unzipped this is around 257MB
-rw-r--r-- 1 yarn yarn 257M Apr 16 08:21 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.2
-rw-r--r-- 1 yarn yarn 257M Apr 12 23:52 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.3
-rw-r--r-- 1 yarn yarn 257M Apr 12 11:37 /var/log/hadoop-yarn/yarn-yarn-nodemanager-an-test-worker1001.log.4

I am going to roll restart a couple of daemons in prod now that the change is rolled out to further test, and then we'll need to roll restart everything to apply properly (the next jvm upgrade will likely be a good occasion).

Mentioned in SAL (#wikimedia-analytics) [2021-04-23T10:30:20Z] <elukey> restart hadoop daemons (NM, DN, JN) on an-worker1080 to further test the new log4j config - T276906

Mentioned in SAL (#wikimedia-analytics) [2021-04-23T14:23:01Z] <elukey> roll restart an-master100[1,2] daemons to pick up new lo4j settings - T276906

I'll keep monitoring the an-master100x nodes and an-worker1080, but we should be good from preliminary results. The complete rollout could be completed, as mentioned above, during the next jvm restart for upgrades.

Mentioned in SAL (#wikimedia-analytics) [2021-04-26T06:32:18Z] <elukey> roll restart of hadoop-yarn-nodemanagers to pick up new log4j settings - T276906

The last step is to roll restart all hdfs daemons (journal and data nodes), but we can wait for the next openjdk upgrade (that will require a roll restart of all the jvms).

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Apr 26 2021, 8:41 AM

• fdans closed this task as Resolved.Apr 29 2021, 2:56 PM

Configure the HDFS Namenodes to use the log4j rolling gzip appenderClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Configure the HDFS Namenodes to use the log4j rolling gzip appender
Closed, ResolvedPublic
Actions