Page MenuHomePhabricator

Reportupdater output can be corrupted by hive logging
Closed, ResolvedPublic

Description

In T273474, I found that the result column header was disappearing, and a random logline concatenated to the end. See this output,

Feb 25, 2021 12:18:48 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 7 ms. row count = 3899
Feb 25, 2021 12:18:48 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but i
s org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Feb 25, 2021 12:18:48 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: Record


report_date     wiki    edit_count_bucket       template_dialog_add_known_param_success template_dialog_add_unknown_param_success       template_dialog_add_known_param_abort
        template_dialog_add_unknown_param_abort
2021-02-23      arwiki  100-999 edits   576     0       0       0
2021-02-23      arwiki  1000+ edits     1280    0       0       0
[....]
2021-02-23      zh_yuewiki      1000+ edits     16      0       0       0
2021-02-23      zhwiki  1000+ edits     16      0       0       0
Reader initialized will read a total of 3898 records.
Feb 25, 2021 12:18:48 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block

This seems to be a collision between log buffering and script output, and should probably be solved by replacing the grep -v parquet.hadoop clause by explicitly silencing loggers.

Event Timeline

A related issue is that SLF4J is complaining, and I believe this topic is unmaskable without changing config. Can we fix the root cause?

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Playing with logging is uncovering more errors. When run like hive --hiveconf hive.root.logger=ERROR,console, you can see each of our boilerplate cast(split('2021-02-11', '-')[2] as int) calls included in every reportupdater query actually cause a shower of these internal errors:

2021-02-25T14:29:03,877 ERROR [e2497dad-77df-493c-ad49-f3ea5cc83a38 main] optimizer.ConstantPropagateProcFactory: Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFSplit@3f81621c. Return value unrecoginizable.

This error can be eliminated by doing the date parse in bash:

IFS='-' read -a date <<< $1

hive -e "
[....]
   WHERE MONTH = ${date[1]}

I believe we're falling back to the default logging set by the distribution, not sure if we want to customize a bit?

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-2.3.6.jar!/hive-log4j2.properties Async: true


I have much better results using beeline, I'll go ahead and rewrite my queries accordingly.

Change 666932 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/reportupdater-queries@master] Rewrite date match to avoid buggy UDF

https://gerrit.wikimedia.org/r/666932

Change 666942 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/reportupdater-queries@master] Switch to beeline to avoid stray logging

https://gerrit.wikimedia.org/r/666942

Change 666948 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] parquet logging falls back to default file handler

https://gerrit.wikimedia.org/r/666948

Change 666942 abandoned by Awight:
[analytics/reportupdater-queries@master] Switch to beeline to avoid stray logging

Reason:
Beeline can't handle "-- " comments, it looks like it collapses all newlines in inline scripts.

https://gerrit.wikimedia.org/r/666942

Change 666932 merged by Mforns:
[analytics/reportupdater-queries@master] Rewrite date match to avoid buggy UDF

https://gerrit.wikimedia.org/r/666932

We have several jobs which are currently blocked (and for over a month) because of this bug. The issue has been present since the beginning it seems, based on the workarounds found in all reportupdater-queries hive scripts, where output is filtered through 2> /dev/null | grep -v parquet.hadoop. Unfortunately, this is not a complete solution because of the interleaving problem.

I'm not certain about my puppet patch because I haven't been able to directly test, but something similar should work. The issue is simply that one facility is still logging to stdout.

There's evidence that someone already tried to address this in the file I've modified, but it's not having the intended effect:

# Override of global logging level is set to SEVERE
# to prevent logging not needed parquet info
java.util.logging.ConsoleHandler.level=SEVERE
awight triaged this task as High priority.Mar 30 2021, 11:18 AM

Commented in the patch about https://github.com/apache/hive/blob/branc-2.3/common/src/main/resources/parquet-logging.properties#L60, that is probably more up to date with our current hive version.

I also deployed on an-launcher1002 the same patch for T276121, that caused some troubles when executing hive via systemd. Likely unrelated but worth to check if anything changed.

In theory, I thought I would be able to test the patch by creating a copy of the properties file on a stat machine and running:

HADOOP_CLIENT_OPTS=-Djava.util.logging.config.file=java-logging.properties hive < toggles.hql

But I'm suddenly unable to reproduce the problem at all, either from reportupdater or from the commandline on stat1004. Is it possible that something got deployed today which fixes it? Or maybe it's related to parquet only logging when new tables are loaded into memory, and once the tables are cached no more logging happens?

Either way, @elukey please feel free to experiment with our jobs. They're currently failing every day AFAIK, so nothing will be hurt by running them a bit extra. These are codemirror, templatedata, templatewizard, and visualeditor. I have to go async for today but will be excited to look over any outputs.

I also deployed on an-launcher1002 the same patch for T276121, that caused some troubles when executing hive via systemd. Likely unrelated but worth to check if anything changed.

I rebuilt and deployed new hive packages on stat100x and an-launcher1002 for T276121, that was a problem related to garbled output when executing hive via systemd, this is why I was asking to re-check :D

Mentioned in SAL (#wikimedia-analytics) [2021-03-30T12:30:45Z] <elukey> restart reportupdater-codemirror on an-launcher1002 fro T275757

In any case let's fix the parquet logging too, what I'd need is a minimal script.hql able to reproduce this problem (with steps about how to do it etc..).

I'm strangely unable to produce any parquet logs at the moment, but will update here once I stumble across a good test case.

Even a daily failing query like codemirror/hive/toggles is working from the command line, no parquet logging shows up. In fact, I can't even get the logging to appear if I try. I'm curious to know whether the cron errors have changed? Cron failure can still be confirmed by exploring the output metrics, they have been null since Feb 24.

Checking the command-line tool's default environment,

hive -e 'set env:HADOOP_CLIENT_OPTS'

env:HADOOP_CLIENT_OPTS= \
    -Dproc_hivecli \
    -Djava.util.logging.config.file=/etc/hive/conf.analytics-hadoop/java-logging.properties \
    -Dlog4j.configurationFile=hive-log4j2.properties \
    -Djava.util.logging.config.file=/usr/lib/hive/bin/../conf/parquet-logging.properties

that parquet-logging.properties file doesn't exist. In fact, maybe this is the original bug? It would explain why the config file seems to do the right thing—once the property is overridden with a bad filename, the puppet-deployed file is never read?

Confirmed,

hive -e 'set system:java.util.logging.config.file'

system:java.util.logging.config.file=/usr/lib/hive/bin/../conf/parquet-logging.properties

This looks like the culprit, found in stat1004:/usr/lib/hive/bin/hive

if [ -f "${HIVE_CONF_DIR}/parquet-logging.properties" ]; then
  export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Djava.util.logging.config.file=${HIVE_CONF_DIR}/parquet-logging.properties "
else
  export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Djava.util.logging.config.file=$bin/../conf/parquet-logging.properties "
fi

I think the wrapper script is a distributed package file, so instead of changing it we could install the logging properties in the second expected location. I've updated the puppet patch to demonstrate what I mean. But still just guessing since I haven't been able to reproduce this afternoon.

Change 675907 had a related patch set uploaded (by Awight; author: Awight):

[operations/puppet@production] Let hive use the default logging config path

https://gerrit.wikimedia.org/r/675907

Change 675907 merged by Elukey:

[operations/puppet@production] hive: use the default logging config path for parquet logs

https://gerrit.wikimedia.org/r/675907

Applied the patch with some tweaks, it is definitely a cleaner a more modern approach, thanks!

I see some errors for the various timers though:

Apr 01 05:02:13 an-launcher1002 reportupdater-templatedata[8907]: ValueError: Invalid metric name "MediaWiki.TemplateData.template.template-description-change.actionCount.byEditCount.100-999 edits.byWiki.bowiki"

Apr 01 05:01:58 an-launcher1002 reportupdater-templatewizard[8911]: ValueError: Invalid metric name "MediaWiki.TemplateWizard.save.byEditCount.1000+ edits.byWiki.afwiki"

Apr 01 07:28:07 an-launcher1002 reportupdater-codemirror[8627]: 2021-04-01 07:28:07,337 - CRITICAL - Invalid metric name "MediaWiki.CodeMirror.sessions.byEditor.wikitext_2010.byEnabled.true.byEditCount.1000+ edits.byWiki.adywiki"

Applied the patch with some tweaks, it is definitely a cleaner a more modern approach, thanks!

I see some errors for the various timers though:

Apr 01 05:02:13 an-launcher1002 reportupdater-templatedata[8907]: ValueError: Invalid metric name "MediaWiki.TemplateData.template.template-description-change.actionCount.byEditCount.100-999 edits.byWiki.bowiki"

Apr 01 05:01:58 an-launcher1002 reportupdater-templatewizard[8911]: ValueError: Invalid metric name "MediaWiki.TemplateWizard.save.byEditCount.1000+ edits.byWiki.afwiki"

Apr 01 07:28:07 an-launcher1002 reportupdater-codemirror[8627]: 2021-04-01 07:28:07,337 - CRITICAL - Invalid metric name "MediaWiki.CodeMirror.sessions.byEditor.wikitext_2010.byEnabled.true.byEditCount.1000+ edits.byWiki.adywiki"

Thanks for forwarding these, I think they show that the queries are advancing to the final result parse, and that they were not interrupted by illegal CSV lines caused by this bug! I'll still hold off on closing this task until we see the full pipeline succeeding for a few days, if you agree.

I'll try to fix the remaining errors in a separate task.

Change 666948 abandoned by Awight:

[operations/puppet@production] Stop logging parquet to the console

Reason:

The predecessor now includes better changes to logging properties.

https://gerrit.wikimedia.org/r/666948

elukey@an-launcher1002:~$ ls -l /tmp/parquet-0.log
-rw-r--r-- 1 analytics analytics 7521 Apr  1 09:09 /tmp/parquet-0.log

This is a very nice confirmation about the new logging style, it seems working :)

@awight what are the things to check to see if the jobs are working? I tried to check the Graphite metrics but they are not updated, so not sure what to look for to see if the changes are working or not :)

This is a very nice confirmation about the new logging style, it seems working :)

Hehe, but as I was saying in Gerrit maybe we don't need these logs at all?

@awight what are the things to check to see if the jobs are working? I tried to check the Graphite metrics but they are not updated, so not sure what to look for to see if the changes are working or not :)

Thanks, it looks like the queries are broken downstream now, which is good. (T279046.) When they work, graphs like this will show new data: https://grafana.wikimedia.org/goto/2hvd4DlMz .

awight claimed this task.
awight removed a project: Patch-For-Review.

@elukey I'm seeing this in reportupdater logs, probably a bad sign:

Jun 1 06:29:17 an-launcher1002 reportupdater-codemirror[20213]: log4j:ERROR No output stream or file set for the appender named [console].

I can't say there's any logging obvious going missing (but this job is mysteriously failing so maybe?). Even stranger, I don't see the same line output from reportupdater jobs in the language directory.