Page MenuHomePhabricator

[airflow] Inserting task notes is not working since upgrade to version 2.7.3
Closed, ResolvedPublic

Description

We are experiencing issues when attempting to add notes to tasks in airflow.
The following is from the airflow-webserver log on an-launcher1002.

...skipping...
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     c = connection._execute_20(
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     return meth(self, args_10style, kwargs_10style, execution_options)
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     return connection._execute_clauseelement(
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     ret = self._execute_context(
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     self._handle_dbapi_exception(
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     util.raise_(
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     raise exception
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     self.dialect.do_execute(
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:   File "/usr/lib/airflow/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:     cursor.execute(statement, parameters)
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]: sqlalchemy.exc.DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type integer: "None"
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]: LINE 1: ...p_index, content, created_at, updated_at) VALUES ('None', 'm...
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]:                                                              ^
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]: [SQL: INSERT INTO task_instance_note (user_id, task_id, dag_id, run_id, map_index, content, created_at, updated_at) VALUES (%(user_id)s, %(task
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]: [parameters: {'user_id': 'None', 'task_id': 'move_data_to_archive', 'dag_id': 'pageview_hourly', 'run_id': 'scheduled__2023-12-01T06:00:00+00:0
Dec 01 13:45:12 an-launcher1002 airflow-webserver@analytics[12713]: (Background on this error at: https://sqlalche.me/e/14/9h9h)

image.png (881×1 px, 146 KB)

Event Timeline

BTullis triaged this task as High priority.Dec 1 2023, 1:49 PM
BTullis updated the task description. (Show Details)

@JAllemandou and I did some investigation of this issue.

Cause

The key part of the log appears to be the this part.

[SQL: INSERT INTO task_instance_note (user_id, task_id, dag_id, run_id, map_index, content, created_at, updated_at) VALUES (%(user_id)s, %(task_id)s, %(dag_id)s, %(run_id)s, %(map_index)s, %(content)s, %(created_at)s, %(updated_at)s)]
[parameters: {'user_id': 'None', 'task_id': 'wait_for_webrequest', 'dag_id': 'aqs_hourly_test', 'run_id': 'scheduled__2023-12-01T12:00:00+00:00', 'map_index': -1, 'content': 'This is a test note for T352534', 'created_at': datetime.datetime(2023, 12, 1, 13, 50, 28, 403341, tzinfo=Timezone('UTC')), 'updated_at': datetime.datetime(2023, 12, 1, 13, 50, 28, 403361, tzinfo=Timezone('UTC'))}]

The error shown is:

sqlalchemy.exc.DataError: (psycopg2.errors.InvalidTextRepresentation) invalid input syntax for type integer: "None"
LINE 1: ...p_index, content, created_at, updated_at) VALUES ('None', 'w...
                                                             ^

In other words, it is complaining about the fact that we are not logged in. Our user_id is not an integer, it is the string None.

We don't even have any user accounts, due to the way that we have deployed Airflow.

btullis@an-test-client1002:/srv/airflow-analytics_test$ sudo -u analytics airflow-analytics_test users list
No data found

Possible Solutions

There are several different ways that we could address this problem.

  • Report it as a bug upstream and point out that None should be a valid value for user_id
  • Patch the code ourselves to make it set/send an integer for user_id when we are not logged in.
  • Create local users on each airflow instance with:
btullis@an-test-client1002:/srv/airflow-analytics_test$ sudo -u analytics airflow-analytics_test users create
Usage: airflow users create [-h] -e EMAIL -f FIRSTNAME -l LASTNAME [-p PASSWORD] -r ROLE [--use-random-password] -u USERNAME [-v]

airflow users create command error: the following arguments are required: -e/--email, -f/--firstname, -l/--lastname, -r/--role, -u/--username, see help above.

There are pros and cons to each of these solutions and some may have short and long-term impacts. We can discuss.

In order to ascertain if we can work around this easily in the short term, I have created a local user on the analytics_test airflow instance.

btullis@an-test-client1002:/srv/airflow-analytics_test$ sudo -u analytics airflow-analytics_test users create --username admin --role Admin --firstname analytics_test --lastname admin --email data-engineering@wikimedia.org
/usr/lib/airflow/lib/python3.10/site-packages/flask_limiter/extension.py:336 UserWarning: Using the in-memory storage for tracking rate limits as no storage was explicitly specified. This is not recommended for production use. See: https://flask-limiter.readthedocs.io#configuring-a-storage-backend for documentation about configuring the storage backend.
[2023-12-01T16:19:18.513+0000] {manager.py:670} WARNING - No user yet created, use flask fab command to do it.
[2023-12-01T16:19:19.701+0000] {_plugin.py:346} INFO - Patching datahub policy
Password:
Repeat for confirmation:
[2023-12-01T16:19:31.307+0000] {manager.py:211} INFO - Added user %s
User "admin" created with role "Admin"

Password: admin

btullis@an-test-client1002:/srv/airflow-analytics_test$ sudo -u analytics airflow-analytics_test users list
id | username | email                          | first_name     | last_name | roles
===+==========+================================+================+===========+======
1  | admin    | data-engineering@wikimedia.org | analytics_test | admin     | Admin

I'll see if I can log in as this user and add a note. At this stage, we are still protected by the need to use SSH tunnelling to gain access to the server, so it doesn't matter that this is a trivial password. We have admin rights even when we are not logged in.

Success! So this is a valid workaround, if we want to use it.

image.png (756×1 px, 117 KB)

Ah, thanks @EBernhardson and apologies for the inconvenience. It's interesting to find out that you also use this feature.
As mentioned above we have a simplistic workaround that we can probably put in place quickly, then varying levels of more sophisticated and longer-term fixes.
In fact, as you are administrators of your own airflow instance you could already create your own local user accounts using the same technique, should you wish.

We do use the feature, although tbh I don't know how useful it is. Sometimes we skip runs that failed because canary events didn't fire, or we have to re-run a task with some extra settings because something weird about that hour is blowing out memory limits, and we make a note of that in the dagrun. But i don't know we've ever looked at those notes afterwards.

I appreciate the investigation, indeed this is sufficient to add notes going forward.

I have done some investigation regarding the other authentication methods that are available to Airflow.
In general, this is very similar to the way that Superset authentication works, in that it is built with the Flask Application Builder (fab) RBAC framework and will support any of the authenticaiton methods available to flask.
More information is available here: https://airflow.apache.org/docs/apache-airflow/2.7.3/security/webserver.html#other-methods

This means that we could choose to use any of the following methods for AUTH_TYPE:

  • AUTH_DB - This is effectively what we are using at the moment
  • AUTH_LDAP - We used to use this for Superset, until it was superseded when both superset and superset-next were integrated with the CAS-SSO mechanism
  • AUTH_REMOTE_USER - This is what we currently use with an Apache virtualhost on Superset to enable the CAS-SSO integration.

Both the AUTH_LDAP and AUTH_REMOTE_USER methods would require a reasonable amount of work and will have the same drawbacks that we currently face with Superset.

However, what we are currently working towards for Superset is to migrate the CAS-SSO authentication away from an Apache virtualhost (and therefore the AUTH_REMOTE_USER method) to a custom AUTH_OIDC method.
This will likely use the flask-oidc extension. This change in authentication for Superset is details here and here (later in the same doc). We are planning to do this work as part of the migration of Superset to Kubernetes (T347710).

My personal feeling is that it would be better for us to wait until we can do a similar migration for Airflow, before enabling a full user authentication environment. Our Airflow access control mechanism is currently based on SSH tunnelling alone, which uses a mature, if somewhat cumbersome, method. My feeling is that we should revisit this to improve both the Airflow user experience and security model, but that adding another layer now in order to fix the dagrun notes feature, would be adding technical debt.

So my recommendation is that we continue to use the AUTH_TYPE: AUTH_DB and either:

  • Create a simple admin:admin user for each airflow instance

or

  • Create a user for each person who uses airflow.

Administrators of each instance are also free to create and manage their own user databases, for the time being.

Create a simple admin:admin user for each airflow instance

I think this makes sense until we have something better. Our instances aren't multi tenant, so there is no need for user authentication at the airflow level. The airflow port is only open on localhost, and only users in the designated posix groups have their ssh keys deployed to their airflow instances, so only those users can access those ports.

Report it as a bug upstream and point out that None should be a valid value for user_id

To me, this feels like a regression. A feature that used to work with no auth (and presumably no auth is a supported feature as well) doesn't work anymore. I think we should log this as a bug.

Report it as a bug upstream and point out that None should be a valid value for user_id

To me, this feels like a regression. A feature that used to work with no auth (and presumably no auth is a supported feature as well) doesn't work anymore. I think we should log this as a bug.

Thanks, that's a good point. However, I'm not sure sure that unauthenticated users is a supported feature. I think it probably just used to work by mistake.
I found this: https://airflow.apache.org/docs/apache-airflow/2.7.3/security/security_model.html#airflow-security-model-user-types

Non-authenticated UI users: Airflow doesn’t support unauthenticated users by default. If allowed, potential vulnerabilities must be assessed and addressed by the Deployment Manager.

Report it as a bug upstream and point out that None should be a valid value for user_id

To me, this feels like a regression. A feature that used to work with no auth (and presumably no auth is a supported feature as well) doesn't work anymore. I think we should log this as a bug.

I have created a bug report upstream: https://github.com/apache/airflow/issues/36206 and explained why we feel that the loss of this feature is a regression.

While we wait to see if it is addressed, I will create those admin user accounts on the remaining instances. It should be easy to undo, once we know what our longer-term fix is likely to be.

  • Created the admin user on the analytics instance:
btullis@an-launcher1002:/srv/airflow-analytics$ sudo -u analytics airflow-analytics users create --username admin --role Admin --firstname analytics --lastname admin --email data-engineering@wikimedia.org
/usr/lib/airflow/lib/python3.10/site-packages/flask_limiter/extension.py:336 UserWarning: Using the in-memory storage for tracking rate limits as no storage was explicitly specified. This is not recommended for production use. See: https://flask-limiter.readthedocs.io#configuring-a-storage-backend for documentation about configuring the storage backend.
[2023-12-13T12:34:49.204+0000] {manager.py:670} WARNING - No user yet created, use flask fab command to do it.
[2023-12-13T12:34:51.382+0000] {_plugin.py:346} INFO - Patching datahub policy
Password:
Repeat for confirmation:
[2023-12-13T12:34:57.462+0000] {manager.py:211} INFO - Added user %s
User "admin" created with role "Admin"
btullis@an-launcher1002:/srv/airflow-analytics$ sudo -u analytics airflow-analytics users list
id | username | email                          | first_name | last_name | roles
===+==========+================================+============+===========+======
1  | admin    | data-engineering@wikimedia.org | analytics  | admin     | Admin
                                                                               
btullis@an-launcher1002:/srv/airflow-analytics$
  • Created the admin user on the research instance
  • Created the admin user on the platform_eng instance
  • Created the admin user on the search instance
  • Created the admin user on the product_analytics instance
  • Created the admin user on the wmde instance

I have sent an email to several related groups of users and I have updated wikitech here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Airflow#Authentication

I'll monitor the upstream bug and report back with any updates.