Set data permission on new snapshot generation (discovery.wikibase_rdf)
Closed, ResolvedPublic1 Estimated Story Points
Actions

Assigned To

Authored By

	AndrewTavis_WMDE
	Jul 21 2023, 9:22 AM

Description

As of now permissions are not automatically set for the discovery.wikibase_rdf table when new snapshots are generated. For instance I just ran into the following error when trying to query this table:

PrestoUserError: PrestoUserError(type=USER_ERROR, name=PERMISSION_DENIED, message="Permission denied: user=andrewtavis-wmde, access=EXECUTE, inode="/wmf/data/discovery/wikidata/rdf/date=20230717"

I was able to query with a WHERE date=20230710 clause though. I now can fully query the table, but only after being given permissions explicitly.

Possible Solutions

@JAllemandou suggested the following on Slack:

Add --conf spark.hadoop.fs.permissions.umask-mode=022 to the spark job generating the data (seems simpler)
Explicitly add an airflow step to update perms after data generation

Details

	Title	Reference	Author	Source Branch	Dest Branch
	Make wikibase ttl imports world readable	repos/data-engineering/airflow-dags!478	ebernhardson	work/ebernhardson/ttl-umask	main

Customize query in GitLab

Related Objects

Mentioned In: T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article)
T337021: [Analytics] Find out size of term subgraph

Event Timeline

AndrewTavis_WMDE created this task.Jul 21 2023, 9:22 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2023, 9:22 AM

BTullis subscribed.Jul 21 2023, 9:31 AM

dcausse added a project: Wikidata-Query-Service.Jul 21 2023, 9:45 AM

dcausse subscribed.

Maintenance_bot added a project: Wikidata.Jul 21 2023, 9:46 AM

AndrewTavis_WMDE mentioned this in T337021: [Analytics] Find out size of term subgraph.Jul 21 2023, 11:38 AM

Gehel added a project: Data-Engineering.Jul 24 2023, 3:33 PM

AndrewTavis_WMDE mentioned this in T342111: [Analytics] Find out the size of direct instances of Q13442814 (scholarly article).Aug 3 2023, 10:26 AM

Gehel moved this task from needs triage to Current work on the Discovery-Search board.Aug 7 2023, 3:11 PM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Aug 7 2023, 3:17 PM

Gehel set the point value for this task to 1.Aug 7 2023, 3:24 PM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

EBernhardson claimed this task.Aug 11 2023, 5:21 PM

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

EBernhardson moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Aug 14 2023, 6:26 PM

I looked into these, the attached patch should fix it but it leaves an open question (@JAllemandou):

The core-site.xml, along with puppet which writes it out, has the default umask of 027 since at least 2021, which prevents world readability. So why do we have the following permissions for historical dumps:

drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230710
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230716
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230717
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230723
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230724
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230730
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230731
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230806

Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference?

drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806

EBernhardson added a comment.Aug 14 2023, 6:54 PM

This comment was removed by EBernhardson.

ebernhardson updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/478

Make wikibase ttl imports world readable

ebernhardson merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/478

Make wikibase ttl imports world readable

Airflow instance has been updated. I manually changed the permissions of the existing files to 644 and dirs to 755 in /wmf/data/discovery/wikidata/rdf so the existing datasets all match the datasets that will be created in the future.

Additionally there were three directories for imports from feb 2021 that don't look to have automatically cleaned up, i verified they were not registered as a current hive partition to discovery.wikibase_rdf and deleted them.

Leaving this in the To Be Deployed state to verify the next produced dump has the file permissions we expect before closing.

In T342416#9091146, @EBernhardson wrote:
I looked into these, the attached patch should fix it but it leaves an open question (@JAllemandou):

The core-site.xml, along with puppet which writes it out, has the default umask of 027 since at least 2021, which prevents world readability. So why do we have the following permissions for historical dumps:
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230710
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230716
drwxr-xr-x   /wmf/data/discovery/wikidata/rdf/date=20230717
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230723
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230724
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230730
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230731
drwxr-x---   /wmf/data/discovery/wikidata/rdf/date=20230806

The world-readable change were manually made by myself to unblock @AndrewTavis_WMDE - I logged my change in the analytics IRC chan but didn't ping on the search IRC chan - I should have, please excuse me on this :)

Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference?

drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806

The guess I have about those would be that they are still generated by a Hive job. Hive and spark behave differently in regard to permissions when generating files. Spark uses the configured umask, while hive reproduces the parent-dir patten. I'd be interested to be sure if my guess is correct :)

In T342416#9101474, @JAllemandou wrote:
In T342416#9091146, @EBernhardson wrote:
Similarly we have other jobs that still run today and emit world readable dumps without explicitly setting the umask, what is causing the difference?
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230716
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230723
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230730
drwxrwxr-x   /wmf/data/discovery/cirrus/index/cirrus_replica=codfw/cirrus_group=chi/wiki=enwiki/snapshot=20230806
The guess I have about those would be that they are still generated by a Hive job. Hive and spark behave differently in regard to permissions when generating files. Spark uses the configured umask, while hive reproduces the parent-dir patten. I'd be interested to be sure if my guess is correct :)

These are both generated by spark. The rdf is being imported by a scala application while the cirrus dump is imported by pyspark, but they should both be using the same underlying implementation. Both applications use df.write.insertInto(table_name) to instruct spark to do the actual output. I'm a bit surprised they end up generating different sets of permissions.

I suppose it's not super important why the cirrus dump is world readable, it's fine to be readable, it just hints to me that there is something I don't understand about hdfs/spark/permissions happening here.

In T342416#9101868, @EBernhardson wrote:

These are both generated by spark. The rdf is being imported by a scala application while the cirrus dump is imported by pyspark, but they should both be using the same underlying implementation. Both applications use df.write.insertInto(table_name) to instruct spark to do the actual output. I'm a bit surprised they end up generating different sets of permissions.

I suppose it's not super important why the cirrus dump is world readable, it's fine to be readable, it just hints to me that there is something I don't understand about hdfs/spark/permissions happening here.

Mwarf, wrong guess :) Interesting nonetheless - Let me know if you wish we pair on this.

EBernhardson moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Aug 28 2023, 5:00 PM

New dataset for 20230821 has updated permissions as expected.

Gehel closed this task as Resolved.Sep 1 2023, 2:07 PM

Set data permission on new snapshot generation (discovery.wikibase_rdf)Closed, ResolvedPublic1 Estimated Story PointsActions

Description

Description

Possible Solutions

Details

Related Objects

Event Timeline

Set data permission on new snapshot generation (discovery.wikibase_rdf)
Closed, ResolvedPublic1 Estimated Story Points
Actions