Page MenuHomePhabricator

Most raw webrequest partitions for 2014-10-13T20/1H not marked successful
Closed, ResolvedPublic

Description

Three of the webrequest partitions [1] for 2014-10-13T20/1H have been
been marked successful.

What happened?

[1]


qchris@stat1002 jobs: 0 time: 14:37:13 // exit code: 0
cwd: ~
~/cluster-scripts/dump_webrequest_status.sh

+------------------+--------+--------+--------+--------+
| Date             |  bits  | mobile |  text  | upload |
+------------------+--------+--------+--------+--------+

[...]

| 2014-11-13T18/1H |    .   |    .   |    .   |    X   |
| 2014-11-13T19/1H |    .   |    .   |    .   |    .   |
| 2014-11-13T20/1H |    X   |    .   |    X   |    X   |
| 2014-11-13T21/1H |    .   |    .   |    .   |    .   |
| 2014-11-13T22/1H |    .   |    .   |    .   |    X   |

[...]

+------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
M --> Partition manually marked ok
X --> Partition is not ok (duplicates, missing, or nulls)

Version: unspecified
Severity: normal

Details

Reference
bz73418

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:57 AM
bzimport set Reference to bz73418.
bzimport added a subscriber: Unknown Object (MLST).

The three jobs for 2014-11-13T20/1H were in SUSPENDED state.
Some internal workflows got stuck with exception about RM issues [1].

This nicely matches yesterdays restarting of the resourcemanager after
upgrading the JVMs.
Resuming the 3 jobs did not work, so I killed and restarted them.

[1] JA009 JA009: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1409078537822_77051' doesn't exist in RM.

Now the jobs succeeded, and the partitions got marked ok.