Page MenuHomePhabricator

Failures registered by drop_event on an-launcher1002
Closed, ResolvedPublic


The drop_event unit on an-launcher1002 has been failing a couple of times for:

NoViableAltException(340@[213:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
	at org.antlr.runtime.DFA.noViableAlt(
	at org.antlr.runtime.DFA.predict(
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(
	at org.apache.hadoop.hive.ql.parse.HiveParser.tableName(
	at org.apache.hadoop.hive.ql.parse.HiveParser.showStatement(
	at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(
	at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(
	at org.apache.hadoop.hive.ql.parse.HiveParser.statement(
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(
	at org.apache.hadoop.hive.ql.Driver.compile(
	at org.apache.hadoop.hive.ql.Driver.compileInternal(
	at org.apache.hadoop.hive.ql.Driver.runInternal(
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(
	at org.apache.hadoop.hive.cli.CliDriver.processLine(
	at org.apache.hadoop.hive.cli.CliDriver.processLine(
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(
	at org.apache.hadoop.hive.cli.CliDriver.main(
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(
	at java.lang.reflect.Method.invoke(
	at org.apache.hadoop.util.RunJar.main(
FAILED: ParseException line 1:17 cannot recognize input near '2021' '-' '05' in table name

Event Timeline

I'll paste what I wrote in analytics-alerts in response to this:

I reran this manually last night with no issue. It looks like somehow a bad table name is making it into the list of tables from which the refinery-drop-older-than script is trying to drop partitions for.

I suspect that for some reason the refinery python hive.tables() function is failing due to some logging issue, but the command returns successfully so the bad output is making into return value of the list of tables. I can't reproduce this so I don't know exactly what is going on. Let's keep watching.

Mentioned in SAL (#wikimedia-analytics) [2021-05-24T15:19:10Z] <ottomata> rm -rf /tmp/analytics/* on an-launcher1002 - T283126

So, bad output contains an error like:

ERROR Unable to move file /tmp/analytics/hive.log.2021-06-01 to /tmp/analytics/hive.log.2021-06-01

Looking at that file, I see logs totally unrelated to drop_event (they look more like some reportupdater hive job logs). I think what is happening is that concurrent Hive CLI sessions are running at the same time on an-launcher1002, and the log4j using DailyRollingFileAppender is trying to work with the same log file from the different Hive CLI sessions.

The hive-log4j.conf file we have even has the comment:

# Use the PidDailyerRollingFileAppend class instead if you want to use separate log files
# for different CLI session.

Let's try PidDailyerRollingFileAppend.

Change 697802 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Hive log4j -= Use PidDailyRollingFileAppender for CLI sessions

Change 697802 merged by Ottomata:

[operations/puppet@production] Hive log4j -= Use PidDailyRollingFileAppender for CLI sessions

Change 698519 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] drop_event job - Use dedicated Hive CLI log file

Change 698519 merged by Ottomata:

[operations/puppet@production] drop_event job - Use dedicated Hive CLI log file