diagram showing failed connection attepmts of some jobs around 2014-04-08
Sporadically some attempt of an Hadoop task fails with error messages
like
Error: java.io.IOException: Bad connect ack with firstBadLink as 10.64.36.116:50010
. See for example
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2971/m/FAILED
. The failed attempts are correctly restarted by Hadoop and eventually
succeed. But as the cluster is now pretty clean and not under heavy
beating by different jobs, I do not expect to see above failures at
all.
I cannot recall having seen the error message for Hive queries, and up
to now, I only saw tasks of camus webrequest importer jobs having such
failed attempts. However, it does not matter whether it's a full run
of importing the whole seven day's worth of wobile request traffic
(e.g.: above's job_1387838787660_2971), or just importing the last
hour (e.g.:
http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2965/m/FAILED
).
I briefly scanned the attempts of recently run applications, and there
seems to be a pattern that connecting to analytics10{11,16,17} is more
likely an issue than connecting to other machines [1]. Not sure if
this is a misinterpretation, as it may be time/scheduling dependent,
but it looks strange. (See attachment failures.png for dot output of
the failed connection attempts.)
[1]
+---------------------------------------+---------------+---------------+
Attempt | Source | Destination |
+---------------------------------------+---------------+---------------+
attempt_1387838787660_2856_m_000006_0 | analytics1013 | analytics1016 |
attempt_1387838787660_2859_m_000002_0 | analytics1013 | analytics1017 |
attempt_1387838787660_2955_m_000003_1 | analytics1019 | analytics1011 |
attempt_1387838787660_2955_m_000009_1 | analytics1011 | analytics1017 |
attempt_1387838787660_2956_m_000003_1 | analytics1011 | analytics1016 |
attempt_1387838787660_2956_m_000005_1 | analytics1013 | analytics1017 |
attempt_1387838787660_2956_m_000006_1 | analytics1015 | analytics1011 |
attempt_1387838787660_2956_m_000007_1 | analytics1017 | analytics1011 |
attempt_1387838787660_2956_m_000008_0 | analytics1011 | analytics1018 |
attempt_1387838787660_2971_m_000001_0 | analytics1012 | analytics1016 |
attempt_1387838787660_2971_m_000003_1 | analytics1020 | analytics1011 |
attempt_1387838787660_2971_m_000005_0 | analytics1018 | analytics1011 |
attempt_1387838787660_2971_m_000007_1 | analytics1013 | analytics1016 |
attempt_1387838787660_2971_m_000008_1 | analytics1015 | analytics1017 |
attempt_1387838787660_2972_m_000003_0 | analytics1015 | analytics1011 |
+---------------------------------------+---------------+---------------+
Version: unspecified
Severity: normal
Attached: