In {T270629} we changed the default HDFS umask so that newly written files are not world readable. This caused the hdfs-rsync jobs declared in dumps::web::fetches::stats to fail. Those jobs are currently run as the dumpsgen user, which does not exist on the HDFS namenode. Since the files are not world readable, the user must exist on the namenode and belong to a posix group there that has permissions to read the files.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Restricted Task | |||||
Resolved | • fdans | T271362 dumps::web::fetches::stats job should use a user to pull from HDFS that exists in Hadoop cluster |
Event Timeline
I'd like that not to be root; what are our choices? Looping in @Bstorm who will have thoughts on this I am sure, since those are WMCS boxes doing the fetches.
I'll try to follow up on this today, we tightened up the file permission settings on hdfs and some use cases like this one popped up. The alternative are two (keeping dumpsgen):
- add dumpsgen to analytics-privatedata-users, to be able to read the files.
- modify permissions/umask for the specific datasets that are public so that other (and dumpsgen) can read them.
I really prefer 2) since it is also better in the context of containing security breaches on WMCS nodes exposed to the internet, but it needs a little bit of work :) Will update the task once done!
We are in the process of fixing permissions for files, and restart the rsync jobs that pull data from HDFS to the labstore nodes (serving dumps.wikimedia.org). The job is expected to take more or less the weekend, so sorry for the inconvenience to every consumer. We'll update the task as soon as we have more news to share.
There is still a problem with permissions on labstore nodes, so users get a HTTP 403 when trying to download files, as outlined in T271616. We are trying to fix this issue asap.
@Ottomata the remaining step to do would be to address the issue that you outlined in the task description, namely adding dumspgen to the namenode, but in theory it is not really mandatory with the fix the Joseph did during the past days (adding back other perms). Let me know if we can move on or if you prefer to add the user :)
I think adding back readable perms to public datasets is a better solution. I think this is done!