Page MenuHomePhabricator

Attempts of Hadoop tasks randomly fail "Bad connect ack with firstBadLink as $SOME_CLUSTER_IP"
Closed, DeclinedPublic

Description

diagram showing failed connection attepmts of some jobs around 2014-04-08

Sporadically some attempt of an Hadoop task fails with error messages
like

Error: java.io.IOException: Bad connect ack with firstBadLink as 10.64.36.116:50010

. See for example

http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2971/m/FAILED

. The failed attempts are correctly restarted by Hadoop and eventually
succeed. But as the cluster is now pretty clean and not under heavy
beating by different jobs, I do not expect to see above failures at
all.

I cannot recall having seen the error message for Hive queries, and up
to now, I only saw tasks of camus webrequest importer jobs having such
failed attempts. However, it does not matter whether it's a full run
of importing the whole seven day's worth of wobile request traffic
(e.g.: above's job_1387838787660_2971), or just importing the last
hour (e.g.:

http://analytics1010.eqiad.wmnet:19888/jobhistory/attempts/job_1387838787660_2965/m/FAILED

).

I briefly scanned the attempts of recently run applications, and there
seems to be a pattern that connecting to analytics10{11,16,17} is more
likely an issue than connecting to other machines [1]. Not sure if
this is a misinterpretation, as it may be time/scheduling dependent,
but it looks strange. (See attachment failures.png for dot output of
the failed connection attempts.)

[1]
+---------------------------------------+---------------+---------------+

AttemptSourceDestination

+---------------------------------------+---------------+---------------+

attempt_1387838787660_2856_m_000006_0analytics1013analytics1016
attempt_1387838787660_2859_m_000002_0analytics1013analytics1017
attempt_1387838787660_2955_m_000003_1analytics1019analytics1011
attempt_1387838787660_2955_m_000009_1analytics1011analytics1017
attempt_1387838787660_2956_m_000003_1analytics1011analytics1016
attempt_1387838787660_2956_m_000005_1analytics1013analytics1017
attempt_1387838787660_2956_m_000006_1analytics1015analytics1011
attempt_1387838787660_2956_m_000007_1analytics1017analytics1011
attempt_1387838787660_2956_m_000008_0analytics1011analytics1018
attempt_1387838787660_2971_m_000001_0analytics1012analytics1016
attempt_1387838787660_2971_m_000003_1analytics1020analytics1011
attempt_1387838787660_2971_m_000005_0analytics1018analytics1011
attempt_1387838787660_2971_m_000007_1analytics1013analytics1016
attempt_1387838787660_2971_m_000008_1analytics1015analytics1017
attempt_1387838787660_2972_m_000003_0analytics1015analytics1011

+---------------------------------------+---------------+---------------+


Version: unspecified
Severity: normal

Attached:

failures.png (852×857 px, 75 KB)

Details

Reference
bz63693

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:25 AM
bzimport set Reference to bz63693.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1535