Page MenuHomePhabricator

Incorrect log4j configuration on hive servers causing excessive logging
Closed, ResolvedPublic

Description

Saw this on alerts.wikimedia.org:

DISK WARNING - free space: / 3084 MB (6% inode=86%): /tmp 3084 MB (6% inode=86%): /var/tmp 3084 MB (6% inode=86%)

instance: an-coord1001

Confirmed with df:

razzi@an-coord1001:~$ df -h
df: /mnt/hdfs: Input/output error
Filesystem                           Size  Used Avail Use% Mounted on
udev                                  32G     0   32G   0% /dev
tmpfs                                6.3G  666M  5.7G  11% /run
/dev/md0                              46G   41G  3.1G  94% /
tmpfs                                 32G     0   32G   0% /dev/shm
tmpfs                                5.0M     0  5.0M   0% /run/lock
tmpfs                                 32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/an--coord1001--vg-srv    102G   46G   56G  46% /srv
/dev/mapper/an--coord1001--vg-mysql   59G   49G   11G  82% /var/lib/mysql
tmpfs                                6.3G     0  6.3G   0% /run/user/13926
tmpfs                                6.3G     0  6.3G   0% /run/user/124
tmpfs                                6.3G     0  6.3G   0% /run/user/2129
tmpfs                                6.3G     0  6.3G   0% /run/user/26051

Looking at the logs, /var/log/hive is the largest by far:

du -sh /var/log/* | sort -h
...
12M     /var/log/auth.log.1
80M     /var/log/account
2.1G    /var/log/oozie
14G     /var/log/hive

The logs there date back to February, which is strange, considering log4j seems to be configured to only have 2 log files of 256m each: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/bigtop/templates/hive/hive-exec-log4j.properties.erb#16

Removing logs from February for now. We should find a long-term strategy for ensuring old logs get deleted.

We might want to look into the mysql data size as well.

Event Timeline

I had to run sudo find -mtime +15 -exec rm -f {} \; for /var/log/hive today, we probably want to have a gzip rolling appender like we did for T276906.

@razzi:

The hive-log4j.properties file controls the logging of the CLI or other locally launched components. The hive-exec-log4j.properties file controls the logging inside the MapReduce tasks.

We do have the hive-log4j.properties in /etc/hive/conf on an-coord1001, and it seems configured to roll files daily (that is consistent with what we see) but it is also configured with a max backup index 14, that seems not working.. Maybe we need to update those settings?

We should also revisit oozie's log settings, the oozie.log is gzipped but rolled over every hour afaics (maybe we should make it daily).

Ottomata triaged this task as Medium priority.
Ottomata moved this task from Backlog to Q4 2020/2021 on the Analytics-Clusters board.

Mentioned in SAL (#wikimedia-analytics) [2021-07-17T08:54:34Z] <elukey> run 'sudo find -type f -name '*.log*' -mtime +30 -delete' on an-coord1001:/var/log/hive to free space (root partition almost filled up) - T279304

If possible I'd raise the priority of this one, maybe with a short-term solution like having a systemd timer that executes a log cleanup daily (like removing -mtime +30 etc..)

I'm happy to take a look at this one this week if it helps @razzi
It looks like it will be back in a warning state in about a week and critical in just over three weeks, at the current rate.

image.png (730×1 px, 59 KB)

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=an-coord1001&var-datasource=thanos&var-cluster=analytics&from=1618911424246&to=1626687424247

I can leave for you if you want to investigate the hive-log4j.properties issue more thoroughly.

I've checked an-test-coord1001 and it's definitely exhibiting the same behaviour as an-coord1001 and an-coord1002 in that the setting of:

log4j.appender.DRFA.MaxBackupIndex=14

...doesn't seem to be getting applied correctly. If we can find a fix for this it will likely affect all hosts in the same way.

Possible clue at the top of /var/log/hive/hiver-server2.out on each of the three coordinator servers.

btullis@an-coord1001:/var/log/hive$ head /var/log/hive/hive-server2.out 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Reference mentioned: http://www.slf4j.org/codes.html#multiple_bindings

So far I haven't seen anything else which looks odd in the configuration or logfiles though.

@BTullis one thing that I noticed in the command line args of the hive server/metastore processes is -Dlog4j.configurationFile=hive-log4j2.properties, that in theory should refer to a file under /etc/hive/conf, that doesn't exist afaics. Maybe our filename is not the correct one and doesn't get picked up?

Should be easy to test - manual copy of the hive-log4j.properties file to hive-log4j2.properties on the test node and restart of the daemons.

Great catch @elukey. I'll try that test today if I can get agreement to restart the daemons.

As suggested, I have taken a manual copy of the configuration file:

btullis@an-test-coord1001:/etc/hive/conf$ sudo cp hive-log4j.properties hive-log4j2.properties

I restarted the services like this.

btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl stop hive-server2.service hive-metastore.service 
btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl start hive-metastore.service hive-server2.service

Interesting test results, in that the services seem to be up but nothing is being written to the logs.

We now see this at the top of the hive-metastore.out file. The same warning, but also one error about the [console] appender`

2021-07-28 10:45:42: Starting Hive Metastore Server
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
log4j:ERROR No output stream or file set for the appender named [console].

Then again, maybe it's fine. I think that we have most of the error levels set to WARN apart from our custom logging levels, which are all set to INFO. Some excerpts from the file.

hive.root.logger=WARN,DRFA

log4j.category.DataNucleus=ERROR,DRFA
log4j.category.Datastore=ERROR,DRFA
log4j.category.Datastore.Schema=ERROR,DRFA
log4j.category.JPOX.Datastore=ERROR,DRFA
log4j.category.JPOX.Plugin=ERROR,DRFA
log4j.category.JPOX.MetaData=ERROR,DRFA
log4j.category.JPOX.Query=ERROR,DRFA
log4j.category.JPOX.General=ERROR,DRFA
log4j.category.JPOX.Enhancer=ERROR,DRFA


# Silence useless ZK logs
log4j.logger.org.apache.zookeeper.server.NIOServerCnxn=WARN,DRFA
log4j.logger.org.apache.zookeeper.ClientCnxnSocketNIO=WARN,DRFA

#custom logging levels
log4j.logger.org.apache.hadoop.hive.ql.parse.SemanticAnalyzer=INFO
log4j.logger.org.apache.hadoop.hive.ql.Driver=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.ExecDriver=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.MapRedTask=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.Task=INFO
log4j.logger.org.apache.hadoop.hive.ql.session.SessionState=INFO

I'll see if I can trigger one of these INFO level logs.

I haven't managed to get anything to appear in those logs yet, although I've only tried a couple of test queries from the hive cli.

I'm wondering if we should potentially look at replacing our current hive-log4j.properties with a hive-log4j2.properties that is more similar to the upstream version:
https://github.com/apache/hive/blob/branch-2.3/data/conf/hive-log4j2.properties

I can see that we renamed the version from the CDH distribution when we switched to Bigtop:
https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/a3ab8b5caf02da75606924298e4e37b2281bbcbd

I'll give the vanilla upstream version a try on the test cluster, but I'm conscious that I don't want to undo any improvements done by @Ottomata in #T283126

Those improvements mostly ended up in a work around. Maybe switching to a log4j2 config would avoid them in the first place?

Thanks Ben!

I have found the templates that seem to have been supplied with the original bigtop distro and copied both the hive-log4j2.properties and hive-exec-log4j2.properties template files into place.

btullis@an-test-coord1001:/etc/hive/conf$ sudo cp ../conf.dist/hive-log4j2.properties.template ./hive-log4j2.properties
btullis@an-test-coord1001:/etc/hive/conf$ sudo cp ../conf.dist/hive-exec-log4j2.properties.template ./hive-exec-log4j2.properties

Interestingly, the hive-log4j2.properties file has this to say about discrete files for different cli sessions. (ref: https://phabricator.wikimedia.org/T283126)

# Use %pid in the filePattern to append <process-id>@<host-name> to the filename if you want separate log files for different CLI session
appender.DRFA.filePattern = ${sys:hive.log.dir}/${sys:hive.log.file}.%d{yyyy-MM-dd}

I'll restart the test hive services with these files anyway, to see how it behaves.

That manual restart seems to be working well with these vanilla log4j2 configurations.
I'll prepare a patch to deploy these with suitable tweaks, instead of the existing log4j files.

Change 708770 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Update hive to log4j version2 configuration files

https://gerrit.wikimedia.org/r/708770

BTullis renamed this task from Disk filling up on `/` on an-coord1001 to Incorrect log4j configuration on hive servers causing excessive logging.Jul 29 2021, 2:19 PM

Change 708770 merged by Btullis:

[operations/puppet@production] Update hive to log4j version2 configuration files

https://gerrit.wikimedia.org/r/708770

Change 709484 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Set hive default log4j version to 2

https://gerrit.wikimedia.org/r/709484

Change 709484 merged by Btullis:

[operations/puppet@production] Set hive default log4j version to 2

https://gerrit.wikimedia.org/r/709484

This change is now merged and deployed. As expected, the hive-server2 and hive-metastore processes have not been restarted automatically.

btullis@an-coord1001:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Retrieving locales
Info: Loading facts
Info: Caching catalog for an-coord1001.eqiad.wmnet
Info: Unable to serialize catalog to json, retrying with pson
Info: Applying configuration version '(f3cfa8e509) Btullis - Set hive default log4j version to 2'
Info: Computing checksum on file /etc/hive/conf.analytics-hadoop/hive-log4j.properties
Info: /Stage[main]/Bigtop::Hive/File[/etc/hive/conf.analytics-hadoop/hive-log4j.properties]: Filebucketed /etc/hive/conf.analytics-hadoop/hive-log4j.properties to puppet with sum b1f9b9018b7e38aec0f43a04c1e8341d
Notice: /Stage[main]/Bigtop::Hive/File[/etc/hive/conf.analytics-hadoop/hive-log4j.properties]/ensure: removed
Notice: /Stage[main]/Bigtop::Hive/File[/etc/hive/conf.analytics-hadoop/hive-log4j2.properties]/ensure: defined content as '{md5}c99a54f88f4a68545b3be6a05848a5cb'
Info: Computing checksum on file /etc/hive/conf.analytics-hadoop/hive-exec-log4j.properties
Info: /Stage[main]/Bigtop::Hive/File[/etc/hive/conf.analytics-hadoop/hive-exec-log4j.properties]: Filebucketed /etc/hive/conf.analytics-hadoop/hive-exec-log4j.properties to puppet with sum 2b684615ecdb565d4223bfc0ff48cad4
Notice: /Stage[main]/Bigtop::Hive/File[/etc/hive/conf.analytics-hadoop/hive-exec-log4j.properties]/ensure: removed
Notice: /Stage[main]/Bigtop::Hive/File[/etc/hive/conf.analytics-hadoop/hive-exec-log4j2.properties]/ensure: defined content as '{md5}683aedac2781aa469ae25161dfe3ecb0'
Notice: Applied catalog in 40.98 seconds
btullis@an-coord1001:~$ sudo systemctl status hive-server2
● hive-server2.service - LSB: Hive Server2
   Loaded: loaded (/etc/init.d/hive-server2; generated)
   Active: active (running) since Thu 2021-07-01 18:23:12 UTC; 1 months 1 days ago

I now need to look at scheduling a maintenance window for restarting the hive servers on an-coord1001.

Change 709956 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Failover analytics-hive service to standby server

https://gerrit.wikimedia.org/r/709956

I have prepared a DNS change to effect the failover from an-coord1001 to an-corrd1002.

https://gerrit.wikimedia.org/r/709956

The hive-server2 and hive-metastore services are running correctly on an-coord1002, but I will restart them at the beginning of the maintenance window to make sure that they restart successfully, then merge and deploy the DNS change.

The commands I will use to do the restart will be:

sudo systemctl stop hive-server2 && sudo systemctl stop hive-metatore

sudo systemctl start hive-metastore && sudo systemctl start hive-server2

Change 709956 merged by Btullis:

[operations/dns@master] Failover analytics-hive service to standby server

https://gerrit.wikimedia.org/r/709956

btullis@authdns1001:~$ sudo authdns-update 
Updating authdns1001.wikimedia.org (self)...
Pulling the current revision from https://gerrit.wikimedia.org/r/operations/dns.git
Reviewing 2fbdd2d71579caf4e346410557104c69317bfe3e...

 templates/wmnet | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git templates/wmnet templates/wmnet
index aec000ea..8f6cbfc7 100644
--- templates/wmnet
+++ templates/wmnet
@@ -59,7 +59,7 @@ tendril-backend 5M  IN CNAME    db1115.eqiad.wmnet.
 zarcillo-master 5M  IN CNAME    db1115.eqiad.wmnet.
 
 analytics-test-hive   5M  IN CNAME    an-test-coord1001.eqiad.wmnet.
-analytics-hive        5M  IN CNAME    an-coord1001.eqiad.wmnet.
+analytics-hive        5M  IN CNAME    an-coord1002.eqiad.wmnet.
 analytics-test-presto 5M  IN CNAME    an-test-coord1001.eqiad.wmnet.
 s1-analytics-replica  5M  IN CNAME    dbstore1003.eqiad.wmnet.
 s2-analytics-replica  5M  IN CNAME    dbstore1007.eqiad.wmnet.

Merge these changes? (yes/no)?

DNS change successfully applied.

btullis@marlin:~/wmf/dns$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any +short analytics-hive.eqiad.wmnet ; done
an-coord1002.eqiad.wmnet.
an-coord1002.eqiad.wmnet.
an-coord1002.eqiad.wmnet.

Restart of services on an-coord1001 complete.

tullis@an-coord1001:~$ sudo systemctl stop hive-server2.service hive-metastore.service
btullis@an-coord1001:~$ sudo systemctl start hive-metastore.service hive-server2.service
btullis@an-coord1001:~$ sudo systemctl status hive-server2 hive-metastore
● hive-server2.service - LSB: Hive Server2
   Loaded: loaded (/etc/init.d/hive-server2; generated)
   Active: active (running) since Wed 2021-08-04 09:12:16 UTC; 17s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 14348 ExecStart=/etc/init.d/hive-server2 start (code=exited, status=0/SUCCESS)
 Main PID: 14396 (java)
    Tasks: 64 (limit: 4915)
   Memory: 986.5M
   CGroup: /system.slice/hive-server2.service
           └─14396 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx256m -Xms8g -Xmx8g -XX:+UseG1GC -XX:+UseStringDeduplication -XX:MaxGCPauseMillis=1000 -Djava.net.preferIPv4Stack=false -javaagent:/usr/share/

Aug 04 09:12:13 an-coord1001 systemd[1]: Starting LSB: Hive Server2...
Aug 04 09:12:13 an-coord1001 runuser[14393]: pam_unix(runuser:session): session opened for user hive by (uid=0)
Aug 04 09:12:13 an-coord1001 runuser[14393]: pam_unix(runuser:session): session closed for user hive
Aug 04 09:12:16 an-coord1001 hive-server2[14348]: Started Hive Server2 (hive-server2):.
Aug 04 09:12:16 an-coord1001 systemd[1]: Started LSB: Hive Server2.

● hive-metastore.service - LSB: Hive Metastore
   Loaded: loaded (/etc/init.d/hive-metastore; generated)
   Active: active (running) since Wed 2021-08-04 09:12:16 UTC; 17s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 14347 ExecStart=/etc/init.d/hive-metastore start (code=exited, status=0/SUCCESS)
 Main PID: 14397 (java)
    Tasks: 71 (limit: 4915)
   Memory: 1.1G
   CGroup: /system.slice/hive-metastore.service
           └─14397 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xmx256m -Xms4g -Xmx4g -XX:+UseG1GC -XX:+UseStringDeduplication -XX:MaxGCPauseMillis=1000 -Djava.net.preferIPv4Stack=false -javaagent:/usr/share/

Aug 04 09:12:13 an-coord1001 systemd[1]: Starting LSB: Hive Metastore...
Aug 04 09:12:13 an-coord1001 hive-metastore[14347]: Starting Hive Metastore (hive-metastore):.
Aug 04 09:12:13 an-coord1001 runuser[14392]: pam_unix(runuser:session): session opened for user hive by (uid=0)
Aug 04 09:12:13 an-coord1001 runuser[14392]: pam_unix(runuser:session): session closed for user hive
Aug 04 09:12:16 an-coord1001 systemd[1]: Started LSB: Hive Metastore.

Change 709962 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/dns@master] Failback analytics-hive to the primary server

https://gerrit.wikimedia.org/r/709962

Change 709962 merged by Btullis:

[operations/dns@master] Failback analytics-hive to the primary server

https://gerrit.wikimedia.org/r/709962

DNS failback patch applied.

btullis@marlin:~/wmf/dns$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t any +short analytics-hive.eqiad.wmnet ; done
an-coord1001.eqiad.wmnet.
an-coord1001.eqiad.wmnet.
an-coord1001.eqiad.wmnet.

Tested with hive and beeline from stat1008 and all seems well with hive, following the fail-back.

I've sent an email to analytics-announce@lists.wikimedia.org to say that the maintenance is complete, but asking for feedback if any users notice anything unusual about the logging.
This might be the case if users were parsing the INFO level logs, for example.

Quick question - are the hive logs (under /var/log/hive) on an-coord100* nodes rotated/dropped correctly now?

Good question. It look from an-test-coord1001 like the pre-existing logs older than 14 days aren't being purged, so I could do this manually.
However hive-metastore.log.2021-08-03 and hive-server2.log.2021-08-03 both exist and only contain WARN messages, so I believe that the rotation is happening correctly and that the configuration file is being used now.

I should check back in 14 days to make sure that appender.DRFA.strategy.max = 14 is being applied, otherwise we're only part of the way there.