Page MenuHomePhabricator

Fix permissions in hdfs://analytics-hadoop/wmf/data/discovery
Closed, ResolvedPublic

Description

It looks like at some point i created files in this subpath as ebernhardson, rather than analytics-search as expected. Noticed today while deploying a script that cleans up old data in hadoop (deletes it) that it can't do the cleanup due to permissions issues.

Requested fix:

hdfs dfs -chown -R analytics-search:analytics-search-users hdfs://analytics-hadoop/wmf/data/discovery

Also curious if there is some way we can prevent this in the future, perhaps having the directories owned by a group that only analytics-search is in?

Details

Other Assignee
nfraison

Event Timeline

Command to change right is runnning

To prevent that we can remove write for the analytics-search-users group:

From:
drwxrwxr-x   - analytics-search analytics-search-users               0 2023-03-10 01:06 hdfs://analytics-hadoop/wmf/data/discovery

To:
drwxr-xr-x   - analytics-search analytics-search-users               0 2023-03-10 01:06 hdfs://analytics-hadoop/wmf/data/discovery

I would just need some confirmation from someone having more history knowledge on if it is ot not expected to have users belonging to analytics-search-users group writing on that folder

All rights update to analytics-search:analytics-search-users on hdfs://analytics-hadoop/wmf/data/discovery

Command to change right is runnning

To prevent that we can remove write for the analytics-search-users group:

From:
drwxrwxr-x   - analytics-search analytics-search-users               0 2023-03-10 01:06 hdfs://analytics-hadoop/wmf/data/discovery

To:
drwxr-xr-x   - analytics-search analytics-search-users               0 2023-03-10 01:06 hdfs://analytics-hadoop/wmf/data/discovery

I would just need some confirmation from someone having more history knowledge on if it is ot not expected to have users belonging to analytics-search-users group writing on that folder

In terms of the purpose of this directory, this is where all data written by the search platform (previously the discovery department, hence the naming) data pipelines (airflow) goes. This should only be written to for production usages and all relevant users have the ability to become analytics-search through kerberos-run-command if they need to do some manual changes there.

I suspect data ends up here with the wrong credentials due to test runs that don't manage to properly substitute out all the appropriate paths. There are a number of issues that happen with data pipelines that can only be debugged by running real input data through it. I suspect sometimes the invocations don't get fully modified and the writes end up going to the real output locations by mistake.

I'm not sure if changing the top level directory to disallow group writes would be sufficient, I suspect newly created directories and subdirectories, for newly deployed use cases, would still get the default umask (appears to be 066) which allows group writes. Perhaps we also need to change something in our airflow configs and/or hive table writes to ensure the umask gets set to 046?

ignore the above, patch attached to wrong ticket.