Page MenuHomePhabricator

table_maintenance_iceberg_monthly permission issue fails task due to permission on Ivy cache artifact
Open, Needs TriagePublic

Description

From email with subject:"Airflow alert: <TaskInstance: table_maintenance_iceberg_monthly.wmf_readership__unique_devices_per_project_family_monthly.rewrite_manifests scheduled__2026-02-01T00:00:00+00:00 [failed]>".

yarn logs -applicationId application_1764064841637_1943802

Exception in thread "main" java.io.FileNotFoundException: /tmp/table_maintenance_iceberg_monthly/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-5e7f72ea-cb6c-46d5-9461-beabde1dadd3-1.0.xml (Permission denied)

It appears that an artifact's permissions are amiss.

Upon fix the DAG task should be cleared/re-executed.

Event Timeline

This had happened before when we run an airflow devenv with a personal user that creates, say, /tmp/table_maintenance_iceberg_monthly, and then the analytics service user cannot access it.

Data-Platform-SRE could we run one of those fancy cumin jobs over all Yarn workers that does something in the line of:

rm -rf /tmp/table_maintenance_iceberg_*

This had happened before when we run an airflow devenv with a personal user that creates, say, /tmp/table_maintenance_iceberg_monthly, and then the analytics service user cannot access it.

Data-Platform-SRE could we run one of those fancy cumin jobs over all Yarn workers that does something in the line of:

rm -rf /tmp/table_maintenance_iceberg_*

@BTullis took care of this:

btullis@cumin1003:~$ sudo cumin A:hadoop-worker 'rm -rf /tmp/table_maintenance_iceberg_*'
95 hosts will be targeted:
an-worker[1142-1236].eqiad.wmnet
OK to proceed on 95 hosts? Enter the number of affected hosts to confirm or "q" to quit: 95
===== NO OUTPUT =====                                                                                                                                                                                              
PASS |███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (95/95) [00:01<00:00, 66.38hosts/s]
FAIL |                                                                                                                                                                            |   0% (0/95) [00:01<?, ?hosts/s]
100.0% (95/95) success ratio (>= 100.0% threshold) for command: 'rm -rf /tmp/tabl...enance_iceberg_*'.
100.0% (95/95) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

All tasks are green now.

Now the more general issue is why this continues to happen if we are supposed to automatically use a different artifact caching strategy when running with a development environment:

# avoid ivy errors by making cache location unique per dag id in prod, but unique per user in dev
"spark.driver.extraJavaOptions": "-Divy.cache.dir=/tmp/{{dag.dag_id}}/ivy_spark3/cache -Divy.home=/tmp/{{dag.dag_id}}/ivy_spark3/home"  # noqa
if is_wmf_airflow_instance()
else f"-Divy.cache.dir=/tmp/{current_user()}/ivy_spark3/cache -Divy.home=/tmp/{current_user()}/ivy_spark3/home",

I got the same error running an airflow devenv while developing a Spark 3.3.2 DAG.

Ivy Default Cache set to: /tmp/runuser/ivy_spark3/cache
The jars for the packages stored in: /tmp/runuser/ivy_spark3/home/jars
org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6700c2db-77a4-421a-864c-fcab30a038e7;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.6.1 in mirrored
Exception in thread "main" java.io.FileNotFoundException: /tmp/runuser/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-6700c2db-77a4-421a-864c-fcab30a038e7-1.0.xml (Permission denied)
	at java.io.FileOutputStream.open0(Native Method)

I got the same error running an airflow devenv while developing a Spark 3.3.2 DAG.

Ivy Default Cache set to: /tmp/runuser/ivy_spark3/cache
The jars for the packages stored in: /tmp/runuser/ivy_spark3/home/jars
org.apache.iceberg#iceberg-spark-runtime-3.3_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-6700c2db-77a4-421a-864c-fcab30a038e7;1.0
	confs: [default]
	found org.apache.iceberg#iceberg-spark-runtime-3.3_2.12;1.6.1 in mirrored
Exception in thread "main" java.io.FileNotFoundException: /tmp/runuser/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-6700c2db-77a4-421a-864c-fcab30a038e7-1.0.xml (Permission denied)
	at java.io.FileOutputStream.open0(Native Method)

Ah, now that I see it more closely, the code is indeed switching correctly, as your target folder is /tmp/runuser/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-6700c2db-77a4-421a-864c-fcab30a038e7-1.0.xml, which follows the pattern discussed in T418804#11708214 of f"-Divy.cache.dir=/tmp/{current_user()}/ivy_spark3/cache -Divy.home=/tmp/{current_user()}/ivy_spark3/home".

The problem then seems to be that an airflow-devenv is always running with a runuser user, but when accessing HDFS, it is using our user (say, xcollazo). Thus if I had pulled this artifact before as xcollazo, it will clash with tchin.

I have experienced again the same issue today:

Exception in thread "main" java.io.FileNotFoundException: /tmp/table_maintenance_iceberg_monthly/ivy_spark3/cache/resolved-org.apache.spark-spark-submit-parent-73ae20fa-2b58-4c79-9568-c95b98695cd1-1.0.xml (Permission denied)

I'll ask Ben to do the cleanup.