When running a Hive query using wmfdata.hive.run, the output dataframe will now include a few lines of logs which the Hive CLI prints to stdout for some reason:
Feb 19, 2021 5:51:42 PM WARNING: org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl Feb 19, 2021 5:51:42 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 10 records. Feb 19, 2021 5:51:42 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
This started after the cluster upgrade to Bigtop (T273711)
The easiest way to fix this would for wmfdata-python to filter out these lines, but that's not very appealing since there's a small chance this could remove real data too. This stuff should be going to std_err instead of std_out anyway; hopefully we can figure out how to make that happen upstream.