Page MenuHomePhabricator

Check home/HDFS leftovers of mnz
Closed, ResolvedPublic

Description

The access for Muniza Aslam (mnz) was removed. It needs to be checked if data was left in home dirs on stat*/HDFS since they were part of the "analytics-privatedata-users" group.

The Kerberos principal has already been removed. The point of contact for for the data in stat/HDFS is @Miriam

Event Timeline

@Miriam - Sorry to trouble you, but there appears to be quite a bit of data to review here, which belonged to Muniza.
Could you please let me know what I should do with it?
If I can help you to review it, or if you would like it to be archived, please let me know. Thanks.

btullis@barracuda:~$ check-user-leftovers mnz

====== stat1008 ======
total 48
drwxrwxr-x  2 32084 wikidev  4096 Jan 14  2025 data
drwxr-xr-x 17 32084 wikidev 36864 May 22  2024 experiments
drwxr-xr-x  6 32084 wikidev  4096 Feb  5 08:34 workspace

====== stat1010 ======
total 52
drwxr-xr-x  3 32084 wikidev  4096 Jun 27 16:52 airflow
drwxr-xr-x  7 32084 wikidev  4096 Feb  7  2023 event_analytics
drwxr-xr-x 11 32084 wikidev  4096 Jan 20  2023 iceberg_experiments
drwxr-xr-x  2 32084 wikidev  4096 Jul 18  2024 intel
drwxr-xr-x  3 32084 wikidev  4096 Feb 19 17:04 nltk_data
drwxr-xr-x 13 32084 wikidev  4096 Jul  4  2023 nores
drwxr-xr-x 15 32084 wikidev  4096 May 28  2024 one_offs
drwxr-xr-x 21 32084 wikidev  4096 Jun 18 15:51 scratch
drwxr-xr-x 20 32084 wikidev 12288 May 28  2024 section_alignment
-rw-r--r--  1 32084 wikidev  1737 Aug 27  2024 subsampled_revisions.csv
drwxrwxr-x 55 32084 wikidev  4096 May  7 20:40 workspace

====== stat1011 ======
total 1557460
-rw-r--r-- 1 mnz wikidev    1198141 Mar  8 00:54 anchor.bloom
-rw------- 1 mnz wikidev 1593567021 Mar  7 18:30 conda-2025-03-07T18.01.20_mnz.tgz
-rw-r--r-- 1 mnz wikidev      63661 Mar 21 02:36 content.ipynb

======= HDFS ========
Found 71 items
drwxr-xr-x   - mnz mnz          0 2025-05-05 00:00 /user/mnz/.Trash
drwx------   - mnz mnz          0 2024-10-06 15:49 /user/mnz/.flink
drwxr-x---   - mnz mnz          0 2024-10-09 18:09 /user/mnz/.skein
drwxr-xr-x   - mnz mnz          0 2025-06-24 21:12 /user/mnz/.sparkStaging
drwx------   - mnz mnz          0 2023-08-07 16:43 /user/mnz/.staging
drwxr-xr-x   - mnz mnz          0 2021-12-02 14:43 /user/mnz/OUTPUT_DIR
drwxr-xr-x   - mnz mnz          0 2023-09-04 16:55 /user/mnz/archives
drwxr-xr-x   - mnz mnz          0 2024-06-04 03:07 /user/mnz/article_embeddings
drwxr-xr-x   - mnz mnz          0 2024-05-07 23:51 /user/mnz/article_topics
drwxr-x---   - mnz mnz          0 2025-03-08 02:35 /user/mnz/daily_diff.parquet
drwxr-x---   - mnz mnz          0 2022-02-18 16:06 /user/mnz/dataframe_checkpoint
drwxr-xr-x   - mnz mnz          0 2024-01-29 19:52 /user/mnz/embeddings_wikitext
drwxr-x---   - mnz mnz          0 2024-10-26 18:06 /user/mnz/enterprise-html
drwxr-x---   - mnz mnz          0 2024-10-28 12:21 /user/mnz/enterprise_html
drwxr-x---   - mnz mnz          0 2023-01-20 14:34 /user/mnz/enwiki_bad.parquet
-rw-r-----   3 mnz mnz        888 2024-10-06 15:21 /user/mnz/events.json
drwxr-x---   - mnz mnz          0 2022-08-30 12:24 /user/mnz/experimental
-rw-r-----   3 mnz mnz         20 2024-04-25 21:53 /user/mnz/experimental.pkl
-rw-r-----   3 mnz mnz          0 2024-04-25 21:45 /user/mnz/experimental.txt
drwxr-x---   - mnz mnz          0 2023-08-07 13:31 /user/mnz/experiments
drwxr-x---   - mnz mnz          0 2022-03-23 01:35 /user/mnz/exps
-rw-r-----   3 mnz mnz          0 2025-05-09 17:26 /user/mnz/glarchive
-rw-r-----   3 mnz mnz  127205237 2025-02-18 13:34 /user/mnz/hash_cost.tsv
-rw-r-----   3 mnz mnz  136122702 2025-02-18 11:36 /user/mnz/hash_cost_test.json
drwxr-xr-x   - mnz mnz          0 2025-03-31 16:08 /user/mnz/iceberg
drwxr-xr-x   - mnz mnz          0 2022-07-21 12:05 /user/mnz/image_recommendation
drwxr-x---   - mnz mnz          0 2022-07-21 14:15 /user/mnz/image_recommendation_unaggregated
drwxr-xr-x   - mnz mnz          0 2022-08-10 10:28 /user/mnz/imagerecs
drwxr-x---   - mnz mnz          0 2022-10-10 22:50 /user/mnz/integrity
drwxr-x---   - mnz mnz          0 2023-12-06 16:28 /user/mnz/list_building
drwxr-x---   - mnz mnz          0 2023-11-27 23:40 /user/mnz/list_building2
drwxr-x---   - mnz mnz          0 2024-07-06 18:02 /user/mnz/mwaddlink
drwxr-x---   - mnz mnz          0 2024-07-03 15:40 /user/mnz/mwaddlink-original
drwxr-x---   - mnz mnz          0 2024-07-03 15:48 /user/mnz/mwaddlink-prov
drwxr-x---   - mnz mnz          0 2024-05-28 05:32 /user/mnz/one_offs_archive
drwxr-xr-x   - mnz mnz          0 2022-06-24 14:54 /user/mnz/ores
drwxr-x---   - mnz mnz          0 2022-06-22 14:17 /user/mnz/ores_data
drwxr-x---   - mnz mnz          0 2025-05-28 15:13 /user/mnz/output
drwxr-x---   - mnz mnz          0 2024-09-19 18:35 /user/mnz/ref_model
-rw-r-----   3 mnz mnz  141557760 2025-05-09 18:10 /user/mnz/research_archive
-rw-r-----   3 mnz mnz 1427426337 2025-05-09 18:12 /user/mnz/research_archive_br
-rw-r-----   3 mnz mnz     167741 2024-07-10 16:09 /user/mnz/revert_risk_model.pkl
drwxr-xr-x   - mnz mnz          0 2024-08-27 19:06 /user/mnz/revert_risk_predictions
drwxr-x---   - mnz mnz          0 2022-09-01 17:36 /user/mnz/reverts_predict
drwxr-xr-x   - mnz mnz          0 2022-08-25 11:46 /user/mnz/revision_text
drwxr-xr-x   - mnz mnz          0 2023-01-18 11:50 /user/mnz/risk_observatory
drwxr-x---   - mnz mnz          0 2024-05-03 13:04 /user/mnz/rrla-training
drwxrwxrwx   - mnz mnz          0 2024-01-16 07:02 /user/mnz/rsds
drwxr-xr-x   - mnz mnz          0 2022-11-23 14:02 /user/mnz/samples
-rw-r-----   3 mnz mnz 1427426337 2025-05-09 17:09 /user/mnz/scratch
drwxr-xr-x   - mnz mnz          0 2021-12-16 10:52 /user/mnz/secmap
drwxr-xr-x   - mnz mnz          0 2021-12-23 14:47 /user/mnz/secmap_dir
drwxr-xr-x   - mnz mnz          0 2022-02-01 11:39 /user/mnz/secmap_embeddings
drwxr-xr-x   - mnz mnz          0 2022-04-27 11:56 /user/mnz/secmap_experimental
drwxr-xr-x   - mnz mnz          0 2022-02-02 10:21 /user/mnz/secmap_features
drwxr-x---   - mnz mnz          0 2022-02-14 10:37 /user/mnz/secmap_ground_truth
drwxr-xr-x   - mnz mnz          0 2022-02-01 23:50 /user/mnz/secmap_out
drwxr-xr-x   - mnz mnz          0 2021-12-21 14:47 /user/mnz/secmap_output
drwxr-xr-x   - mnz mnz          0 2022-06-27 15:29 /user/mnz/secmap_results
drwxr-xr-x   - mnz mnz          0 2022-03-09 17:11 /user/mnz/secmap_sections
drwxr-x---   - mnz mnz          0 2022-02-14 10:55 /user/mnz/secmap_top_sections
drwxr-x---   - mnz mnz          0 2022-03-24 06:09 /user/mnz/secmap_training
drwxr-xr-x   - mnz mnz          0 2022-04-19 20:43 /user/mnz/secmap_training_data
drwxr-xr-x   - mnz mnz          0 2022-05-09 12:37 /user/mnz/section_images
drwxr-x---   - mnz mnz          0 2023-01-20 19:18 /user/mnz/st
drwxr-x---   - mnz mnz          0 2022-05-05 10:48 /user/mnz/streaming_experiments
drwxr-x---   - mnz mnz          0 2024-10-06 15:23 /user/mnz/tinyevents
drwxr-x---   - mnz mnz          0 2025-03-25 11:09 /user/mnz/traffic_patterns
drwxr-x---   - mnz mnz          0 2025-03-20 18:58 /user/mnz/trwiki_2025_02_features.parquet
drwxr-xr-x   - mnz mnz          0 2025-06-10 00:16 /user/mnz/warehouse
drwxr-x---   - mnz mnz          0 2023-03-16 20:41 /user/mnz/wikidata_embeddings

Thanks @BTullis for the ping! I will check with @fkaelin and report back here.

I would like to keep this folder until the end of Q2 in case there is a need to dig deeper for recent projects Muniza was working on.

Noting here that there are 7.7M HDFS files associated with user mnz. See latest HDFS usage dashboard at https://superset.wikimedia.org/superset/dashboard/409/?native_filters_key=bIfs5MQRkyBgn-V4IHFj_xsMhf3haMa0acFKr7ny_R-tWM1ZaxKHinKcr3X7yFr5.

This user still has ~2M HDFS files. Can we delete?

I would like to keep this folder until the end of Q2 in case there is a need to dig deeper for recent projects Muniza was working on.

Let's wait until January unless there is a reason to do it earlier.

I would like to keep this folder until the end of Q2 in case there is a need to dig deeper for recent projects Muniza was working on.

@Miriam / @fkaelin : we are now in January. Can we delete those files?

Request from @fkaelin : can we chown those files to fkaelin so that he can investicate and cleanup.

I'm running this command now:

btullis@an-coord1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R fkaelin:fkaelin /user/mnz/

It's taking a while because there are ~2M files, but it should complete before long.

Thank you and apologies, I neglected to specify my shell user name: fab ; could you rerun please?

Sorry, I should have checked that. I've made that change now @fkaelin.

btullis@an-coord1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R fab:fab /user/mnz/
btullis@an-coord1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -ls /user/mnz/|head
Found 71 items
drwxr-xr-x   - fab fab          0 2025-05-05 00:00 /user/mnz/.Trash
drwx------   - fab fab          0 2024-10-06 15:49 /user/mnz/.flink
drwxr-x---   - fab fab          0 2024-10-09 18:09 /user/mnz/.skein
drwxr-xr-x   - fab fab          0 2025-06-24 21:12 /user/mnz/.sparkStaging
drwx------   - fab fab          0 2023-08-07 16:43 /user/mnz/.staging
drwxr-xr-x   - fab fab          0 2021-12-02 14:43 /user/mnz/OUTPUT_DIR
drwxr-xr-x   - fab fab          0 2023-09-04 16:55 /user/mnz/archives
drwxr-xr-x   - fab fab          0 2024-06-04 03:07 /user/mnz/article_embeddings
drwxr-xr-x   - fab fab          0 2024-05-07 23:51 /user/mnz/article_topics

tbomk, all the relevant data/code from Muniza's work has a backup.

Thanks. The needed data is salvaged - the directories can be removed.

Gehel claimed this task.

Cleaning up according to https://wikitech.wikimedia.org/wiki/Data_Platform_Engineering/Ops_week#Have_any_users_left_the_Foundation

Dropped hive databases:

hive (default)> DROP DATABASE mnz CASCADE;
OK
Time taken: 10.589 seconds
hive (default)> DROP DATABASE mnz_test CASCADE;
OK
Time taken: 1.1 seconds

No Hive warehouse database directory found:

gehel@an-launcher1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -ls /user/hive/warehouse/ | grep -i mnz

Remove HDFS homedir

gehel@an-launcher1003:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/mnz
26/02/05 10:38:01 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/user/mnz' to trash at: hdfs://analytics-hadoop/user/hdfs/.Trash/Current/user/mnz1770287878873

Remove homedirs from regular filesystem on all nodes

gehel@cumin1003:~$ sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/mnz'
13 hosts will be targeted:
an-coord[1003-1004].eqiad.wmnet,an-launcher1003.eqiad.wmnet,an-master[1003-1004].eqiad.wmnet,an-test-client1002.eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-master[1001-1002].eqiad.wmnet,stat[1008-1011].eqiad.wmnet
OK to proceed on 13 hosts? Enter the number of affected hosts to confirm or "q" to quit: 13
===== NO OUTPUT =====                                                           
PASS |████████████████████████████████| 100% (13/13) [11:02<00:00, 50.95s/hosts]
FAIL |                                         |   0% (0/13) [11:02<?, ?hosts/s]
100.0% (13/13) success ratio (>= 100.0% threshold) for command: 'rm -rf /home/mnz'.
100.0% (13/13) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.