Page MenuHomePhabricator

Check home/HDFS leftovers of paramd
Closed, ResolvedPublic

Description

The access for Paramita Das (paramd) was removed. It needs to be checked if data was left in home dirs on stat*/HDFS since they were part of the "analytics-privatedata-users" group.

The Kerberos principal has already been removed. Point of contact wrt keeping any data is @diego

Event Timeline

Hi @MoritzMuehlenhoff ,
Yes, please I'll need a copy of all the data both on the stat machines and HDFS

Thanks!

Hi @MoritzMuehlenhoff ,
Yes, please I'll need a copy of all the data both on the stat machines and HDFS

Thanks!

The data is processed by Data Engineering in batches, do you need it soon? Then it's probably possible to handle this task out of the regular cadence?

I have access to most of the data, I can wait a couple of weeks to get the full dump.

Hi @diego - I can look into getting this data for you.

Let's start with the stat boxes; the majority if it is at stat1005:/home/paramd
There's a total of 12 GB here, which includes 3 GB across two conda environments.

image.png (992×625 px, 183 KB)

Would you like me to move this in bulk to a new directory within your home, such as: /home/dsaez/paramd-archive or would you like to be more selective? We could exclude the conda environments and other hidden directories, for example?
Or I could archive this as a tarball to your user home in HDFS, if you prefer, or something else.

The only other potentially interesting data on the stat boxes is on stat1008. There are four small jupyter notebooks here, so you might just want to check to see if they have any value.

The HDFS content is as follows:

======= HDFS ========
Found 24 items
drwxr-xr-x   - paramd paramd          0 2022-07-03 00:00 /user/paramd/.Trash
drwxr-xr-x   - paramd paramd          0 2023-06-19 09:34 /user/paramd/.sparkStaging
drwxr-xr-x   - paramd paramd          0 2023-01-03 07:53 /user/paramd/article_quality_history
drwxr-x---   - paramd paramd          0 2023-03-08 18:57 /user/paramd/biography
drwxr-x---   - paramd paramd          0 2023-03-08 18:54 /user/paramd/bios
drwxr-x---   - paramd paramd          0 2023-04-11 17:47 /user/paramd/bios_en.json
drwxr-x---   - paramd paramd          0 2023-04-11 18:06 /user/paramd/bios_en_new.json
drwxr-x---   - paramd paramd          0 2023-04-25 12:58 /user/paramd/bios_english_all.json
drwxr-x---   - paramd paramd          0 2023-04-11 17:50 /user/paramd/bios_hi.json
drwxr-x---   - paramd paramd          0 2023-04-13 01:48 /user/paramd/bios_hi_1.json
drwxr-x---   - paramd paramd          0 2023-04-13 02:00 /user/paramd/bios_hi_2.json
drwxr-x---   - paramd paramd          0 2023-04-25 12:30 /user/paramd/bios_hi_all.json
drwxr-x---   - paramd paramd          0 2023-04-11 17:57 /user/paramd/bios_hi_new.json
drwxr-x---   - paramd paramd          0 2023-04-25 12:35 /user/paramd/bios_hindi_all.json
drwxr-x---   - paramd paramd          0 2023-04-25 12:32 /user/paramd/bios_hindi_all.parquet
drwxr-x---   - paramd paramd          0 2023-03-08 19:32 /user/paramd/bios_pages_fa.parquet
drwxr-xr-x   - paramd paramd          0 2022-04-21 10:01 /user/paramd/current_quality.json
drwxr-xr-x   - paramd paramd          0 2022-05-30 12:26 /user/paramd/issac_model
drwxr-x---   - paramd paramd          0 2023-02-09 05:50 /user/paramd/new_data
drwxr-x---   - paramd paramd          0 2022-08-31 13:41 /user/paramd/paramita_article_quality
drwxr-xr-x   - paramd paramd          0 2022-04-21 10:16 /user/paramd/quality
drwxr-xr-x   - paramd paramd          0 2022-06-01 19:21 /user/paramd/scores_all_partitioned_by_wiki_and_year.parquet
drwxr-x---   - paramd paramd          0 2023-02-21 09:52 /user/paramd/sigir_results
drwxr-x---   - paramd paramd          0 2023-02-08 06:44 /user/paramd/user

What is your preference for this data? Would you like it to be moved to your own home directory (e.g /user/dsaez/paramd-archive) and have the ownership changed to your username?

Hive is a little bit more tricky, so I'll respond to this in a separate comment with the various databases owned by paramd - We can work out whether it's better to change their ownership to you, or something else.

Hi @BTullis

Would you like me to move this in bulk to a new directory within your home, such as: /home/dsaez/paramd-archive

This sounds good and enough!

Thanks

I've executed the following commands:

btullis@stat1005:/home/paramd$ sudo mkdir /home/dsaez/paramd-archive
btullis@stat1005:/home/paramd$ sudo mv /home/paramd/* /home/dsaez/paramd-archive
btullis@stat1005:/home/paramd$ sudo chown -R dsaez /home/dsaez/paramd-archive

So you now have acces to all of this content @diego - Feel free to review/delete/archive as you wish.

What did you decide about the files in HDFS? We have a standard archival process, which will relocate them to /wmf/data/archive/user/paramd. Would you like us to do this, or are you happy for them to be removed? Thanks.

Hi! The standard archival process works good. Thanks!

BTullis claimed this task.
BTullis moved this task from Ready for Work to Done on the Data-Platform-SRE board.

I have now archived these files:

sudo -u hdfs kerberos-run-command hdfs hdfs dfs -mv /user/paramd /wmf/data/archive/user/
sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R hdfs:analytics-privatedata-users /wmf/data/archive/user/paramd