Page MenuHomePhabricator

Set data permission on new snapshot generation (discovery.wikibase_rdf)
Closed, ResolvedPublic1 Estimated Story Points

Description

Description

As of now permissions are not automatically set for the discovery.wikibase_rdf table when new snapshots are generated. For instance I just ran into the following error when trying to query this table:

PrestoUserError: PrestoUserError(type=USER_ERROR, name=PERMISSION_DENIED, message="Permission denied: user=andrewtavis-wmde, access=EXECUTE, inode="/wmf/data/discovery/wikidata/rdf/date=20230717"

I was able to query with a WHERE date=20230710 clause though. I now can fully query the table, but only after being given permissions explicitly.

Possible Solutions

@JAllemandou suggested the following on Slack:

  • Add --conf spark.hadoop.fs.permissions.umask-mode=022 to the spark job generating the data (seems simpler)
  • Explicitly add an airflow step to update perms after data generation

Details

TitleReferenceAuthorSource BranchDest Branch
Make wikibase ttl imports world readablerepos/data-engineering/airflow-dags!478ebernhardsonwork/ebernhardson/ttl-umaskmain
Customize query in GitLab

Event Timeline

Gehel set the point value for this task to 1.Aug 7 2023, 3:24 PM

I looked into these, the attached patch should fix it but it leaves an open question (@JAllemandou):

The core-site.xml, along with puppet which writes it out, has the default umask of 027 since at least 2021, which prevents world readability. So why do we have the following permissions for historical dumps:

drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230710
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230716
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230717
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230723
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230724
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230730
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230731
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230806

Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference?

drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806

Airflow instance has been updated. I manually changed the permissions of the existing files to 644 and dirs to 755 in /wmf/data/discovery/wikidata/rdf so the existing datasets all match the datasets that will be created in the future.

Additionally there were three directories for imports from feb 2021 that don't look to have automatically cleaned up, i verified they were not registered as a current hive partition to discovery.wikibase_rdf and deleted them.

Leaving this in the To Be Deployed state to verify the next produced dump has the file permissions we expect before closing.

I looked into these, the attached patch should fix it but it leaves an open question (@JAllemandou):

The core-site.xml, along with puppet which writes it out, has the default umask of 027 since at least 2021, which prevents world readability. So why do we have the following permissions for historical dumps:

drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230710
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230716
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230717
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230723
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230724
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230730
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230731
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230806

The world-readable change were manually made by myself to unblock @AndrewTavis_WMDE - I logged my change in the analytics IRC chan but didn't ping on the search IRC chan - I should have, please excuse me on this :)

Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference?

drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806

The guess I have about those would be that they are still generated by a Hive job. Hive and spark behave differently in regard to permissions when generating files. Spark uses the configured umask, while hive reproduces the parent-dir patten. I'd be interested to be sure if my guess is correct :)

Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference?

drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806

The guess I have about those would be that they are still generated by a Hive job. Hive and spark behave differently in regard to permissions when generating files. Spark uses the configured umask, while hive reproduces the parent-dir patten. I'd be interested to be sure if my guess is correct :)

These are both generated by spark. The rdf is being imported by a scala application while the cirrus dump is imported by pyspark, but they should both be using the same underlying implementation. Both applications use df.write.insertInto(table_name) to instruct spark to do the actual output. I'm a bit surprised they end up generating different sets of permissions.

I suppose it's not super important why the cirrus dump is world readable, it's fine to be readable, it just hints to me that there is something I don't understand about hdfs/spark/permissions happening here.

These are both generated by spark. The rdf is being imported by a scala application while the cirrus dump is imported by pyspark, but they should both be using the same underlying implementation. Both applications use df.write.insertInto(table_name) to instruct spark to do the actual output. I'm a bit surprised they end up generating different sets of permissions.

I suppose it's not super important why the cirrus dump is world readable, it's fine to be readable, it just hints to me that there is something I don't understand about hdfs/spark/permissions happening here.

Mwarf, wrong guess :) Interesting nonetheless - Let me know if you wish we pair on this.

New dataset for 20230821 has updated permissions as expected.