Page MenuHomePhabricator

wmfdata-python's Hive query output includes logspam
Open, MediumPublic

Description

When running a Hive query using wmfdata.hive.run, the output dataframe will now include a few lines of logs which the Hive CLI prints to stdout for some reason:

Feb 19, 2021 5:51:42 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Feb 19, 2021 5:51:42 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records.                                                                                       
Feb 19, 2021 5:51:42 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block

This started after the cluster upgrade to Bigtop (T273711)

The easiest way to fix this would for wmfdata-python to filter out these lines, but that's not very appealing since there's a small chance this could remove real data too. This stuff should be going to std_err instead of std_out anyway; hopefully we can figure out how to make that happen upstream.

Event Timeline

Milimetric added a subscriber: Milimetric.

Seems not to be an issue anymore, I closed the related task on our board, let me know if you see otherwise.

I just tested this and some queries still get log entries included. Here is a sample:

May 6, 2021 5:04:25 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
May 6, 2021 5:00:25 PM INFO: org.apache.parquet.CorruptStatistics: Ignoring statistics because this file was created prior to 1.8.0, see PARQUET-251

I think the underlying problem is that whenever there's a log message, it gets included in returned data. I think this is the fault of the hive CLI, because it includes the log messages in stdout, but it could also be that these messages are in stderr and wmfdata-python is including them by mistake.

So we still need to solve that problem (maybe by more fundamental changes like using the CLI for Hive queries or deprecating Hive as an SQL engine entirely).

Ok, weird, I can't reproduce this... maybe it's some weird access problem? We'll triage and look into it

@nshahquinn-wmf Hive CLI logs are certainly annoying/weird. Q though: any reason wmfdata.hive is using Hive CLI instead of a Hive Python Client?

Milimetric added a project: Analytics-Kanban.

I think Andrew has some ideas, we'll get to the bottom of this one way or another. Then, once Neil's issue is resolved I'd like to reframe this or add a subtask to go over logging on the cluster in general. Lots of background noise like SLF4J warnings clutter the already cluttered logs, and make maintenance harder.

@nshahquinn-wmf Hive CLI logs are certainly annoying/weird. Q though: any reason wmfdata.hive is using Hive CLI instead of a Hive Python Client?

wmfdata.hive did use the Impala library originally, but we couldn't get it to work with Kerberos authentication, so on your team's advice I switched it to PySpark and then to the CLI. That work was tracked in T245891, although some of the details might not be recorded there.

Maybe we could try one of the libraries again, but I doubt they've improved much since Hive is not the query engine of choice these days.

Ahhh, thanks for the context I (kind of!) remember now :)

I just tried pyhive with kerberos (for other reasons) and was able to get it to work: https://github.com/dropbox/PyHive/issues/174#issuecomment-421008445

from pyhive import hive
cursor = hive.Connection(host="analytics-hive.eqiad.wmnet", username="otto", auth='KERBEROS', kerberos_service_name="hive").cursor()
cursor.execute("select count(*) from event.eventgate_analytics_external_test_event where year=2021 and month=5 and day=23 and hour=0")
print(cursor.fetchall())
[(5040,)]

So they did add some support!

I just tried pyhive with kerberos (for other reasons) and was able to get it to work

So they did add some support!

Oh, nice! Switching to PyHive is clearly the best solution, then.