Page MenuHomePhabricator

Hive query failure in Jupyter notebook on stat1005
Closed, ResolvedPublic

Description

I ran into a failure since last week on stat1005. One of my Jupiter notebook ran by cronjob started to fail at one hive query with below error. I checked file "/tmp/parquet-0.log”. It belongs to another user. I don’t have permission to write. It also happened in interactive mode.

Please suggest any method to fix or work around this issue. Thanks!

ChildProcessError: The Hive command line client encountered the following error:
Can't load log handler "java.util.logging.FileHandler"
java.io.FileNotFoundException: /tmp/parquet-0.log (Permission denied)
java.io.FileNotFoundException: /tmp/parquet-0.log (Permission denied)

commands where the failure happened:

query_edits_quality_damaging_model_nonbot_registered_user='''
SELECT WEEKOFYEAR(rev_timestamp) AS week_n, 
ROUND(SUM(case when scores["damaging"].prediction[0]='true' then 1 else 0 end)/count(rev_id) ,4) AS damaging_rate
FROM event_sanitized.mediawiki_revision_score
WHERE year='{YEAR}' AND  (rev_timestamp >= '{START_YYYY_MM_DD}' AND rev_timestamp <= '{END_YYYY_MM_DD}') 
AND `database`='fawiki' 
-- exclude IP editors in baseline as ORES is biased against anonymous editors
AND performer.user_id IS NOT NULL
-- exclude bot editors
AND not performer.user_is_bot
AND performer.user_text not regexp "^.*bot([^a-z].*$|$)"
GROUP BY WEEKOFYEAR(rev_timestamp)
ORDER BY week_n
LIMIT 10000
‘''
df_edits_quality_damaging_rate_nonbot_registered_user=hive.run(query_edits_quality_damaging_model_nonbot_registered_user.format(YEAR=yr, START_YYYY_MM_DD=start_date_YYYY_MM_DD , END_YYYY_MM_DD=end_date_YYYY_MM_DD))

Event Timeline

Hm, strange In /etc/hive/conf/parquet-logging.properties:

# Naming style for the output file:
# (The output file is placed in the system temporary directory.
# %u is used to provide unique identifier for the file.
# For more information refer
# https://docs.oracle.com/javase/7/docs/api/java/util/logging/FileHandler.html)
java.util.logging.FileHandler.pattern=%t/parquet-%u.log

docs for %u say:

Normally the "%u" unique field is set to 0. However, if the FileHandler tries to open the filename and finds the file is currently in use by another process it will increment the unique number field and try again. This will be repeated until FileHandler finds a file name that is not currently in use. If there is a conflict and no "%u" field has been specified, it will be added at the end of the filename after a dot. (This will be after any automatically added generation number.)

But clearly in this case it didn't happen.

I have managed to replicate this issue using the hive CLI, although interestingly it skipped /tmp/parquet-0.log and tried to open /tmp/parquet-1.log for me instead.

hive (wmf)> SELECT WEEKOFYEAR(rev_timestamp) AS week_n, 
          > ROUND(SUM(case when scores["damaging"].prediction[0]='true' then 1 else 0 end)/count(rev_id) ,4) AS damaging_rate
          > FROM event_sanitized.mediawiki_revision_score
          > WHERE year='2021' AND  (rev_timestamp >= '2021_12_19' AND rev_timestamp <= '2021_12_21')
          > AND `database`='fawiki'
          > AND performer.user_id IS NOT NULL
          > AND not performer.user_is_bot
          > AND performer.user_text not regexp "^.*bot([^a-z].*$|$)"
          > GROUP BY WEEKOFYEAR(rev_timestamp)
          > ORDER BY week_n
          > LIMIT 10000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = btullis_20211221113332_afdbc720-789e-4f82-b811-f8826d0ad845
Total jobs = 2
Launching Job 1 out of 2
Can't load log handler "java.util.logging.FileHandler"
java.io.FileNotFoundException: /tmp/parquet-1.log (Permission denied)
java.io.FileNotFoundException: /tmp/parquet-1.log (Permission denied)
	at java.io.FileOutputStream.open0(Native Method)
	at java.io.FileOutputStream.open(FileOutputStream.java:270)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
	at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
	at java.util.logging.FileHandler.open(FileHandler.java:229)
	at java.util.logging.FileHandler.rotate(FileHandler.java:697)
	at java.util.logging.FileHandler.openFiles(FileHandler.java:574)
	at java.util.logging.FileHandler.<init>(FileHandler.java:282)

At the time there were 7 parquet logs in /tmp on stat1005. Some had corresponding .lck files and some did not.

btullis@stat1005:~$ ls -l /tmp/parquet-*
-rw-rw-r-- 1 urbanecm              wikidev               1713 Nov  8 15:31 /tmp/parquet-0.log
-rw-r--r-- 1 analytics-privatedata analytics-privatedata    0 Nov  8 17:34 /tmp/parquet-0.log.lck
-rw-r--r-- 1 rhuang-ctr            wikidev                903 Nov  2 17:53 /tmp/parquet-1.log
-rw-rw-r-- 1 urbanecm              wikidev                  0 Nov  2 19:56 /tmp/parquet-1.log.lck
-rw-r--r-- 1 conniecc1             wikidev                  0 Nov  4 08:16 /tmp/parquet-2.log
-rw-r--r-- 1 analytics-privatedata analytics-privatedata    0 Nov  8 17:31 /tmp/parquet-2.log.lck
-rw-r--r-- 1 conniecc1             wikidev                  0 Nov  4 07:51 /tmp/parquet-3.log
-rw-r--r-- 1 jiawang               wikidev                  0 Nov 10 19:22 /tmp/parquet-3.log.lck
-rw-r--r-- 1 conniecc1             wikidev                  0 Nov  4 07:51 /tmp/parquet-4.log
-rw-r--r-- 1 kcv-wikimf            wikidev                  0 Nov 23 15:40 /tmp/parquet-4.log.lck
-rw-rw-r-- 1 urbanecm              wikidev                  0 Dec 17 18:23 /tmp/parquet-5.log
-rw-r--r-- 1 conniecc1             wikidev                  0 Dec  8 19:25 /tmp/parquet-6.log

Right, and it isn't a race condition, since the existent parquet-1.log was created Nov 2 or before.

As a workaround, perhaps we could modify the FileHandler logging format in some way? Not sure how, there aren't a lot of options.

I'd like to see if this is resolved or affected at all by the new versions of log4j that we will be deploying as part of the new hive build. See T297468 for details.
It's possible that this bug has been fixed in log4j between log4j version 2.6.2 and version 2.17.0.

I also found this old ticket T208550: bothersome output in hive when querying events database which made reference to similar permissions denied errors for parquet logs.
These permission denied errors are still inexplicable, because the mode of /tmp/parquet-logs/ was set to 0777 at the time: https://gerrit.wikimedia.org/r/c/operations/puppet/cdh/+/471928/
This would have given users rights to write over each other's files.

The workaround suggested by @JAllemandou was to switch parquet logs from java.util.logging.FileHandler to java.util.logging.ConsoleHandler and reduce the logging level so that they did not log below the SEVERE level.

I could investigate this method as well, if the updated log4j packages don't fix the issue.

Hello. Sincere apologies for the delay in fixing this. I had hoped to have tested upgraded Log4 versions under Hive to see if it fixes this, but I've run into an issue with the upgraded packages, so I haven't been able to see whether or not this is fixed.
It looks like I'll have to investigate changing the logging handler for parquet next week instead.

I have tried the test case with the upgraded hive packages, but they do not fix the issue.
I will now change the logging handler for parquet, so that it logs to console at SEVERE and above, which should work around this issue.

hive (default)> SELECT WEEKOFYEAR(rev_timestamp) AS week_n,ROUND(SUM(case when scores["damaging"].prediction[0]='true' then 1 else 0 end)/count(rev_id) ,4) AS damaging_rate FROM event_sanitized.mediawiki_revision_score WHERE year='2021' AND  (rev_timestamp >= '2021_12_19' AND rev_timestamp <= '2021_12_21') AND `database`='fawiki' AND performer.user_id IS NOT NULL AND not performer.user_is_bot AND performer.user_text not regexp "^.*bot([^a-z].*$|$)" GROUP BY WEEKOFYEAR(rev_timestamp) ORDER BY week_n LIMIT 10000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = btullis_20220113115322_20e144a8-6004-4d7a-b8e8-65557234572e
Total jobs = 2

Launching Job 1 out of 2
Can't load log handler "java.util.logging.FileHandler"
java.io.FileNotFoundException: /tmp/parquet-1.log (Permission denied)
java.io.FileNotFoundException: /tmp/parquet-1.log (Permission denied)

Change 755655 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Stop writing parquet logs to files

https://gerrit.wikimedia.org/r/755655

I have now created a patch to redirect parquet logs to the console and to reduce their verbosity.
It will require testing on the test cluster, for which I propose:

  • Disabling puppet temporarily on all hosts with the bigtop::hive class applied
  • Submit and merge the CR
  • Enable puppet on an-test-client1001.eqiad.wmnet and an-test-coord1001.eqiad.wmnet
  • Run puppet on those hosts.
  • Restart hive services on the test-coordinator
  • Open a hive CLI on the test client and try to replicate the issue.
  • If all is well, re-enable puppet on all hosts with the bigtop::hive class applied
  • Run puppet on an-coord100[1-2]
  • Restart hive services on an-coord1002
  • Fail over the DNS for analytics-hive to an-coord1002
  • Wait for more than 5 minutes
  • Restart the hive services on an-coord1001
  • Wait for more than 5 minutes
  • Fail back DNS for analytics-hive to an-coord1001

Mentioned in SAL (#wikimedia-analytics) [2022-01-20T11:36:34Z] <btullis> temporarily disabling puppet on servers with hive installed T297734

Change 755655 merged by Btullis:

[operations/puppet@production] Stop writing parquet logs to files

https://gerrit.wikimedia.org/r/755655

No problems detected in the test cluster.
Re-enabling puppet for all other hive enabled servers.

@jwang - Please could you try your query again and see if the situation is any better now?
Once agin, apologies for the delay in fixing this issue.

@BTullis, thank you very much for working on this issue. I tried the query . It failed with a different error. (please see log below).
Log of error:

---------------------------------------------------------------------------
ChildProcessError                         Traceback (most recent call last)
<ipython-input-29-014bde41aa40> in <module>
----> 1 df_edits_quality_damaging_rate_nonbot_registered_user=hive.run(query_edits_quality_damaging_model_nonbot_registered_user.format(YEAR=yr, START_YYYY_MM_DD=start_date_YYYY_MM_DD , END_YYYY_MM_DD=end_date_YYYY_MM_DD))

~/.conda/envs/2021-04-29T18.52.39_jiawang/lib/python3.7/site-packages/wmfdata/hive.py in run(commands, format, engine)
    134     result = None
    135     if engine == "cli":
--> 136         return run_cli(commands, format)
    137 
    138 def load_csv(

~/.conda/envs/2021-04-29T18.52.39_jiawang/lib/python3.7/site-packages/wmfdata/hive.py in run_cli(commands, format, heap_size, use_nice, use_ionice)
    111             raise ChildProcessError(
    112                 "The Hive command line client encountered the following "
--> 113                 "error:\n{}".format(cleaned_stderr)
    114             )
    115     finally:

ChildProcessError: The Hive command line client encountered the following error:
Ended Job = job_1637058075222_385913 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://an-master1001.eqiad.wmnet:8088/proxy/application_1637058075222_385913/
Examining task ID: task_1637058075222_385913_m_000167 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000143 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000473 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000432 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000443 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000500 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000353 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000481 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000385 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000270 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000398 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000366 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000545 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000097 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000242 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000466 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000387 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000002 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000321 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000191 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000507 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000539 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000196 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000439 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000170 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000168 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000110 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000249 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000021 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000038 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000418 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000190 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000092 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000173 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000104 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000203 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000015 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000388 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000420 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000318 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000497 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000512 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000032 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000523 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000356 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000440 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000484 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000305 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000544 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000525 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000067 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000572 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000152 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000216 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000346 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000397 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000233 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000380 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000588 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000378 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000424 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000434 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000466 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000140 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000483 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000562 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000178 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000519 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000031 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000470 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000108 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000115 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000571 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000534 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000274 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000083 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000453 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000436 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000245 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000447 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000008 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000445 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000038 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000330 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000070 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000516 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000467 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000141 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000433 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000318 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000269 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000179 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000422 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000239 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000499 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000599 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000508 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000049 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000512 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000307 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000550 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000033 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000263 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000442 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000067 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000278 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000169 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000265 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000086 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000116 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000397 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000120 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000074 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000044 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000287 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000485 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000063 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000206 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000172 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000383 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000319 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000308 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000583 (and more) from job job_1637058075222_385913
Examining task ID: task_1637058075222_385913_m_000178 (and more) from job job_1637058075222_385913

Task with the most failures(4): 
-----
Task ID:
  task_1637058075222_385913_m_000174

URL:
  http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1637058075222_385913&tipid=task_1637058075222_385913_m_000174
-----
Diagnostic Messages for this Task:
AttemptID:attempt_1637058075222_385913_m_000174_3 Timed out after 600 secs
[2022-01-21 03:25:21.504]Sent signal OUTPUT_THREAD_DUMP (SIGQUIT) to pid 3507 as user jiawang for container container_e33_1637058075222_385913_01_002514, result=success
[2022-01-21 03:25:21.551]Container killed by the ApplicationMaster.
[2022-01-21 03:25:21.607]Container killed on request. Exit code is 143
[2022-01-21 03:25:21.703]Container exited with a non-zero exit code 143. 


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask

I also tried at hive terminal mode. It seems hanging at set mapreduce.job.reduces=<number> without any update.

hive (event_sanitized)> SELECT WEEKOFYEAR(rev_timestamp) AS week_n, 
                      > ROUND(SUM(case when scores["damaging"].prediction[0]='true' then 1 else 0 end)/count(rev_id) ,4) AS damaging_rate
                      > FROM event_sanitized.mediawiki_revision_score
                      > WHERE year='2021' AND  (substr(rev_timestamp,1,10) BETWEEN '2021-01-01' AND '2021-12-31') 
                      > AND `database`='ptwiki' 
                      > -- exclude IP editors in baseline as ORES is biased against anonymous editors
                      > AND performer.user_id IS NOT NULL
                      > -- exclude bot editors
                      > AND not performer.user_is_bot
                      > AND performer.user_text not regexp "^.*bot([^a-z].*$|$)"
                      > GROUP BY WEEKOFYEAR(rev_timestamp)
                      > ORDER BY week_n
                      > LIMIT 10000;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = jiawang_20220121030611_a9e9aee2-6b88-4dbc-ad43-99b46098fd2b
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 597
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>

I'm afraid I don't know what the cause of this might be.
The fact that you get this message: AttemptID:attempt_1637058075222_385913_m_000174_3 Timed out after 600 secs
...suggests to me that perhaps this is an unusually heavy query. It's looking at a whole year, which involves many partitions on hadoop.

Perhaps @JAllemandou or @mforns might have a suggestion on how you might go about optimizing this query?

@jwang
Agree with @BTullis, this dataset is partitioned hourly. This means that looking at 1 year of data it will process 24*365=8760 partitions.
Since the query hangs before it even starts computing, I think this is related to a high number of partitions scanned.
I tried executing the query for 1 month (by adding AND month=1), and it finished successfully.
Could you maybe try executing it month by month?

@mforns @BTullis , thank you both for explaining the details. I consider this issue is fix and mark it as resolve.