The latest hours that failed:
https://hue.wikimedia.org/oozie/list_oozie_workflow/0007757-191216160148723-oozie-oozi-W/?coordinator_job_id=0000001-191216144226458-oozie-oozi-C&bundle_job_id=0000000-191216144226458-oozie-oozi-B
https://hue.wikimedia.org/oozie/list_oozie_workflow/0007797-191216160148723-oozie-oozi-W/?coordinator_job_id=0000001-191216144226458-oozie-oozi-C&bundle_job_id=0000000-191216144226458-oozie-oozi-B
Taking hour 17 above as example, here is what we observe:
- Regular job fails because of mapper-memory error (https://yarn.wikimedia.org/jobhistory/job/job_1576512674871_21944) - Default memory setting (Map: 2G, Reduce: 4G)
- Job gets restarted with more memory and then fails because of reduce-memory error (https://yarn.wikimedia.org/jobhistory/job/job_1576512674871_22546) (Map: 4G, Reduce: 8G)
- Job gets restarted with even more memory and then succeeds (Map: 16G, Reduce: 16G)