Page MenuHomePhabricator

Bug: Airflow reports Mjolnir connection to Hive is already open
Closed, ResolvedPublic

Description

Airflow tasks to make folds for the Mjolnir pipeline report that the connection to Hive is already open.

The error was reported for 19 languages, so the same thing is happening everywhere.

Log/Stacktrace:

*** Reading local file: /var/log/airflow/mjolnir/make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr/2020-11-11T00:00:00+00:00/10.log
[2020-11-18 09:48:55,319] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mjolnir.make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr 2020-11-11T00:00:00+00:00 [queued]>
[2020-11-18 09:48:55,390] {taskinstance.py:630} INFO - Dependencies all met for <TaskInstance: mjolnir.make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr 2020-11-11T00:00:00+00:00 [queued]>
[2020-11-18 09:48:55,390] {taskinstance.py:841} INFO - 
--------------------------------------------------------------------------------
[2020-11-18 09:48:55,390] {taskinstance.py:842} INFO - Starting attempt 10 of 10
[2020-11-18 09:48:55,390] {taskinstance.py:843} INFO - 
--------------------------------------------------------------------------------
[2020-11-18 09:48:55,428] {taskinstance.py:862} INFO - Executing <Task(MjolnirOperator): make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr> on 2020-11-11T00:00:00+00:00
[2020-11-18 09:48:55,429] {base_task_runner.py:133} INFO - Running: ['airflow', 'run', 'mjolnir', 'make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr', '2020-11-11T00:00:00+00:00', '--job_id', '44812', '--pool', 'default_pool', '--raw', '-sd', 'DAGS_FOLDER/mjolnir.py', '--cfg_path', '/tmp/tmp6dort65l']
[2020-11-18 09:48:57,123] {base_task_runner.py:115} INFO - Job 44812: Subtask make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr [2020-11-18 09:48:57,123] {settings.py:252} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=2890
[2020-11-18 09:48:58,816] {base_task_runner.py:115} INFO - Job 44812: Subtask make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr [2020-11-18 09:48:58,815] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-11-18 09:48:58,825] {base_task_runner.py:115} INFO - Job 44812: Subtask make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr [2020-11-18 09:48:58,825] {dagbag.py:92} INFO - Filling up the DagBag from /srv/deployment/wikimedia/discovery/analytics/airflow/dags/mjolnir.py
[2020-11-18 09:48:58,923] {base_task_runner.py:115} INFO - Job 44812: Subtask make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr [2020-11-18 09:48:58,922] {cli.py:545} INFO - Running <TaskInstance: mjolnir.make_folds-frwiki-dbn-20180215-query_explorer-pruned_mrmr 2020-11-11T00:00:00+00:00 [running]> on host an-airflow1001.eqiad.wmnet
[2020-11-18 09:48:59,069] {logging_mixin.py:112} INFO - [2020-11-18 09:48:59,069] {hive_hooks.py:554} INFO - Trying to connect to 10.64.21.104:9083
[2020-11-18 09:48:59,070] {logging_mixin.py:112} INFO - [2020-11-18 09:48:59,070] {hive_hooks.py:556} INFO - Connected to 10.64.21.104:9083
[2020-11-18 09:48:59,090] {taskinstance.py:1058} ERROR - Already open!
Traceback (most recent call last):
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 915, in _run_raw_task
    self.render_templates(context=context)
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1267, in render_templates
    self.task.render_template_fields(context)
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 689, in render_template_fields
    self._do_render_template_fields(self, self.template_fields, context, jinja_env, set())
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 696, in _do_render_template_fields
    rendered_content = self.render_template(content, context, jinja_env, seen_oids)
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 725, in render_template
    return jinja_env.from_string(content).render(**context)
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/jinja2/asyncsupport.py", line 76, in render
    return original_render(self, *args, **kwargs)
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
  File "<template>", line 1, in top-level template code
  File "/srv/deployment/wikimedia/discovery/analytics/airflow/dags/mjolnir.py", line 348, in __call__
    with self.hook.metastore as client:
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/hmsclient/hmsclient.py", line 72, in __enter__
    self.open()
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/hmsclient/hmsclient.py", line 67, in open
    self._oprot.trans.open()
  File "/srv/deployment/search/airflow-cache/revs/62a024b56bbfb10348555ce39d65c3c052a16bdd/venv/lib/python3.7/site-packages/thrift_sasl/__init__.py", line 73, in open
    message="Already open!")
thrift.transport.TTransport.TTransportException: Already open!
[2020-11-18 09:48:59,092] {taskinstance.py:1087} INFO - All retries failed; marking task as FAILED

Event Timeline

Trying to debug this directly didn't get me anywhere, but there was another suspicious error. Previously I could use the hive metastore client from the python repl, but now it was giving up connection failures:

thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-1) SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.
 Minor code may provide more information (Server hive/10.64.21.104@WIKIMEDIA not found in Kerberos database)'

Took longer than i'd like to admit to realize, but this containing an ip-address instead of a name is non-standard. When setting up airflow the metastore was only listening to ipv4, but airflow was trying to talk to it on ipv6. At the time we put the ip address instead of the name into the config, and everything seemed to work.

Checked in with analytics, what changed is they have disabled a step that resolves provided names. Previously the libraries were most likely resolving the ip we provided into the domain name. This change is being done in preparation for moving everyone from hive/an-coord1001.eqiad.wmnet@WIKIMEDIA to hive/analytics-hive.eqiad.wmnet@WIKIMEDIA, removing specific hosts from configurations and enabling future work on HA.