Reportupdater output can be corrupted by hive logging
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Feb 25 2021, 12:29 PM

Description

In T273474, I found that the result column header was disappearing, and a random logline concatenated to the end. See this output,

Feb 25, 2021 12:18:48 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 7 ms. row count = 3899
Feb 25, 2021 12:18:48 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but i
s org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Feb 25, 2021 12:18:48 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: Record


report_date     wiki    edit_count_bucket       template_dialog_add_known_param_success template_dialog_add_unknown_param_success       template_dialog_add_known_param_abort
        template_dialog_add_unknown_param_abort
2021-02-23      arwiki  100-999 edits   576     0       0       0
2021-02-23      arwiki  1000+ edits     1280    0       0       0
[....]
2021-02-23      zh_yuewiki      1000+ edits     16      0       0       0
2021-02-23      zhwiki  1000+ edits     16      0       0       0
Reader initialized will read a total of 3898 records.
Feb 25, 2021 12:18:48 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block

This seems to be a collision between log buffering and script output, and should probably be solved by replacing the grep -v parquet.hadoop clause by explicitly silencing loggers.

Details

Subject	Repo	Branch	Lines +/-
hive: use the default logging config path for parquet logs	operations/puppet	production	+75 -21
Stop logging parquet to the console	operations/puppet	production	+0 -3
Rewrite date match to avoid buggy UDF	analytics/reportupdater-queries	master	+84 -45
Switch to beeline to avoid stray logging	analytics/reportupdater-queries	master	+28 -28

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		awight	T275757 Reportupdater output can be corrupted by hive logging
		Resolved		awight	T279046 Broken reportupdater queries: edit count bucket label contains illegal characters

Event Timeline

awight created this task.Feb 25 2021, 12:29 PM

Restricted Application added subscribers: Stang, Aklapper. · View Herald TranscriptFeb 25 2021, 12:29 PM

A related issue is that SLF4J is complaining, and I believe this topic is unmaskable without changing config. Can we fix the root cause?

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Playing with logging is uncovering more errors. When run like hive --hiveconf hive.root.logger=ERROR,console, you can see each of our boilerplate cast(split('2021-02-11', '-')[2] as int) calls included in every reportupdater query actually cause a shower of these internal errors:

2021-02-25T14:29:03,877 ERROR [e2497dad-77df-493c-ad49-f3ea5cc83a38 main] optimizer.ConstantPropagateProcFactory: Unable to evaluate org.apache.hadoop.hive.ql.udf.generic.GenericUDFSplit@3f81621c. Return value unrecoginizable.

This error can be eliminated by doing the date parse in bash:

IFS='-' read -a date <<< $1

hive -e "
[....]
   WHERE MONTH = ${date[1]}

I believe we're falling back to the default logging set by the distribution, not sure if we want to customize a bit?

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-2.3.6.jar!/hive-log4j2.properties Async: true

I have much better results using beeline, I'll go ahead and rewrite my queries accordingly.

Change 666932 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/reportupdater-queries@master] Rewrite date match to avoid buggy UDF

https://gerrit.wikimedia.org/r/666932

gerritbot added a project: Patch-For-Review.Feb 25 2021, 3:44 PM

Change 666942 had a related patch set uploaded (by Awight; owner: Awight):
[analytics/reportupdater-queries@master] Switch to beeline to avoid stray logging

https://gerrit.wikimedia.org/r/666942

awight added subscribers: mforns, Milimetric, JAllemandou.Feb 25 2021, 4:02 PM

Change 666948 had a related patch set uploaded (by Awight; owner: Awight):
[operations/puppet@production] parquet logging falls back to default file handler

https://gerrit.wikimedia.org/r/666948

Change 666942 abandoned by Awight:
[analytics/reportupdater-queries@master] Switch to beeline to avoid stray logging

Reason:
Beeline can't handle "-- " comments, it looks like it collapses all newlines in inline scripts.

https://gerrit.wikimedia.org/r/666942

Change 666932 merged by Mforns:
[analytics/reportupdater-queries@master] Rewrite date match to avoid buggy UDF

https://gerrit.wikimedia.org/r/666932

awight mentioned this in rARPQ27f2cc2067bf: Rewrite date match to avoid buggy UDF.Feb 26 2021, 5:10 PM

We have several jobs which are currently blocked (and for over a month) because of this bug. The issue has been present since the beginning it seems, based on the workarounds found in all reportupdater-queries hive scripts, where output is filtered through 2> /dev/null | grep -v parquet.hadoop. Unfortunately, this is not a complete solution because of the interleaving problem.

I'm not certain about my puppet patch because I haven't been able to directly test, but something similar should work. The issue is simply that one facility is still logging to stdout.

There's evidence that someone already tried to address this in the file I've modified, but it's not having the intended effect:

# Override of global logging level is set to SEVERE
# to prevent logging not needed parquet info
java.util.logging.ConsoleHandler.level=SEVERE

awight triaged this task as High priority.Mar 30 2021, 11:18 AM

Commented in the patch about https://github.com/apache/hive/blob/branc-2.3/common/src/main/resources/parquet-logging.properties#L60, that is probably more up to date with our current hive version.

I also deployed on an-launcher1002 the same patch for T276121, that caused some troubles when executing hive via systemd. Likely unrelated but worth to check if anything changed.

In theory, I thought I would be able to test the patch by creating a copy of the properties file on a stat machine and running:

HADOOP_CLIENT_OPTS=-Djava.util.logging.config.file=java-logging.properties hive < toggles.hql

But I'm suddenly unable to reproduce the problem at all, either from reportupdater or from the commandline on stat1004. Is it possible that something got deployed today which fixes it? Or maybe it's related to parquet only logging when new tables are loaded into memory, and once the tables are cached no more logging happens?

Either way, @elukey please feel free to experiment with our jobs. They're currently failing every day AFAIK, so nothing will be hurt by running them a bit extra. These are codemirror, templatedata, templatewizard, and visualeditor. I have to go async for today but will be excited to look over any outputs.

In T275757#6956022, @elukey wrote:

I also deployed on an-launcher1002 the same patch for T276121, that caused some troubles when executing hive via systemd. Likely unrelated but worth to check if anything changed.

I rebuilt and deployed new hive packages on stat100x and an-launcher1002 for T276121, that was a problem related to garbled output when executing hive via systemd, this is why I was asking to re-check :D

Mentioned in SAL (#wikimedia-analytics) [2021-03-30T12:30:45Z] <elukey> restart reportupdater-codemirror on an-launcher1002 fro T275757

In any case let's fix the parquet logging too, what I'd need is a minimal script.hql able to reproduce this problem (with steps about how to do it etc..).

I'm strangely unable to produce any parquet logs at the moment, but will update here once I stumble across a good test case.

Even a daily failing query like codemirror/hive/toggles is working from the command line, no parquet logging shows up. In fact, I can't even get the logging to appear if I try. I'm curious to know whether the cron errors have changed? Cron failure can still be confirmed by exploring the output metrics, they have been null since Feb 24.

Checking the command-line tool's default environment,

hive -e 'set env:HADOOP_CLIENT_OPTS'

env:HADOOP_CLIENT_OPTS= \
    -Dproc_hivecli \
    -Djava.util.logging.config.file=/etc/hive/conf.analytics-hadoop/java-logging.properties \
    -Dlog4j.configurationFile=hive-log4j2.properties \
    -Djava.util.logging.config.file=/usr/lib/hive/bin/../conf/parquet-logging.properties

that parquet-logging.properties file doesn't exist. In fact, maybe this is the original bug? It would explain why the config file seems to do the right thing—once the property is overridden with a bad filename, the puppet-deployed file is never read?

Confirmed,

hive -e 'set system:java.util.logging.config.file'

system:java.util.logging.config.file=/usr/lib/hive/bin/../conf/parquet-logging.properties

This looks like the culprit, found in stat1004:/usr/lib/hive/bin/hive

if [ -f "${HIVE_CONF_DIR}/parquet-logging.properties" ]; then
  export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Djava.util.logging.config.file=${HIVE_CONF_DIR}/parquet-logging.properties "
else
  export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Djava.util.logging.config.file=$bin/../conf/parquet-logging.properties "
fi

I think the wrapper script is a distributed package file, so instead of changing it we could install the logging properties in the second expected location. I've updated the puppet patch to demonstrate what I mean. But still just guessing since I haven't been able to reproduce this afternoon.

Change 675907 had a related patch set uploaded (by Awight; author: Awight):

[operations/puppet@production] Let hive use the default logging config path

https://gerrit.wikimedia.org/r/675907

awight added a project: WMDE-TechWish-Sprint-2021-03-31.Mar 31 2021, 8:21 AM

awight moved this task from Sprint Backlog to Watching (Stalled) on the WMDE-TechWish-Sprint-2021-03-31 board.

Change 675907 merged by Elukey:

[operations/puppet@production] hive: use the default logging config path for parquet logs

https://gerrit.wikimedia.org/r/675907

Applied the patch with some tweaks, it is definitely a cleaner a more modern approach, thanks!

I see some errors for the various timers though:

Apr 01 05:02:13 an-launcher1002 reportupdater-templatedata[8907]: ValueError: Invalid metric name "MediaWiki.TemplateData.template.template-description-change.actionCount.byEditCount.100-999 edits.byWiki.bowiki"

Apr 01 05:01:58 an-launcher1002 reportupdater-templatewizard[8911]: ValueError: Invalid metric name "MediaWiki.TemplateWizard.save.byEditCount.1000+ edits.byWiki.afwiki"

Apr 01 07:28:07 an-launcher1002 reportupdater-codemirror[8627]: 2021-04-01 07:28:07,337 - CRITICAL - Invalid metric name "MediaWiki.CodeMirror.sessions.byEditor.wikitext_2010.byEnabled.true.byEditCount.1000+ edits.byWiki.adywiki"

In T275757#6963571, @elukey wrote:

Applied the patch with some tweaks, it is definitely a cleaner a more modern approach, thanks!

I see some errors for the various timers though:

Apr 01 05:02:13 an-launcher1002 reportupdater-templatedata[8907]: ValueError: Invalid metric name "MediaWiki.TemplateData.template.template-description-change.actionCount.byEditCount.100-999 edits.byWiki.bowiki"

Apr 01 05:01:58 an-launcher1002 reportupdater-templatewizard[8911]: ValueError: Invalid metric name "MediaWiki.TemplateWizard.save.byEditCount.1000+ edits.byWiki.afwiki"

Apr 01 07:28:07 an-launcher1002 reportupdater-codemirror[8627]: 2021-04-01 07:28:07,337 - CRITICAL - Invalid metric name "MediaWiki.CodeMirror.sessions.byEditor.wikitext_2010.byEnabled.true.byEditCount.1000+ edits.byWiki.adywiki"

Thanks for forwarding these, I think they show that the queries are advancing to the final result parse, and that they were not interrupted by illegal CSV lines caused by this bug! I'll still hold off on closing this task until we see the full pipeline succeeding for a few days, if you agree.

I'll try to fix the remaining errors in a separate task.

awight moved this task from Watching (Stalled) to Doing on the WMDE-TechWish-Sprint-2021-03-31 board.Apr 1 2021, 9:38 AM

Change 666948 abandoned by Awight:

[operations/puppet@production] Stop logging parquet to the console

Reason:

The predecessor now includes better changes to logging properties.

https://gerrit.wikimedia.org/r/666948

elukey@an-launcher1002:~$ ls -l /tmp/parquet-0.log
-rw-r--r-- 1 analytics analytics 7521 Apr  1 09:09 /tmp/parquet-0.log

This is a very nice confirmation about the new logging style, it seems working :)

@awight what are the things to check to see if the jobs are working? I tried to check the Graphite metrics but they are not updated, so not sure what to look for to see if the changes are working or not :)

awight mentioned this in T279046: Broken reportupdater queries: edit count bucket label contains illegal characters.Apr 1 2021, 9:54 AM

In T275757#6963695, @elukey wrote:

This is a very nice confirmation about the new logging style, it seems working :)

Hehe, but as I was saying in Gerrit maybe we don't need these logs at all?

@awight what are the things to check to see if the jobs are working? I tried to check the Graphite metrics but they are not updated, so not sure what to look for to see if the changes are working or not :)

Thanks, it looks like the queries are broken downstream now, which is good. (T279046.) When they work, graphs like this will show new data: https://grafana.wikimedia.org/goto/2hvd4DlMz .

awight moved this task from Doing to Review on the WMDE-TechWish-Sprint-2021-03-31 board.Apr 6 2021, 9:15 AM

awight moved this task from Review to Demo on the WMDE-TechWish-Sprint-2021-03-31 board.

awight closed this task as Resolved.Apr 13 2021, 8:05 AM

awight claimed this task.

awight removed a project: Patch-For-Review.

Lena_WMDE moved this task from Demo to Done on the WMDE-TechWish-Sprint-2021-03-31 board.Apr 13 2021, 8:12 AM

awight closed subtask T279046: Broken reportupdater queries: edit count bucket label contains illegal characters as Resolved.Apr 26 2021, 3:36 PM

@elukey I'm seeing this in reportupdater logs, probably a bad sign:

Jun 1 06:29:17 an-launcher1002 reportupdater-codemirror[20213]: log4j:ERROR No output stream or file set for the appender named [console].

I can't say there's any logging obvious going missing (but this job is mysteriously failing so maybe?). Even stranger, I don't see the same line output from reportupdater jobs in the language directory.

Stang unsubscribed.Nov 13 2021, 8:44 PM

Reportupdater output can be corrupted by hive loggingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Reportupdater output can be corrupted by hive logging
Closed, ResolvedPublic
Actions

Related Objects
Search...