Page MenuHomePhabricator

Check home/HDFS leftovers of dedcode
Closed, ResolvedPublic

Description

The access for Djellel Difallah was removed. It needs to be checked if data was left in home dirs on stat*/HDFS since they were part of the "analytics-privatedata-users" group. The Kerberos principal has already been removed. Point of contact for eventual questions on data to be retained are @MGerlach and @Isaac.

Event Timeline

====== stat1004 ======
total 0

====== stat1005 ======
total 266892
-rw-r--r--  1 22235 wikidev   4588519 Feb 12  2020 core_stable.tar.gz
drwxrwxr-x  7 22235 wikidev      4096 Feb 25  2020 data
-rw-r--r--  1 22235 wikidev     44229 Jan  8  2020 en.txt
-rw-rw-r--  1 22235 wikidev   3817346 Feb 26  2020 enwiki.dis
-rw-r--r--  1 22235 wikidev     19152 Oct 21  2019 ExamplesPySpark_b.ipynb
-rw-r--r--  1 22235 wikidev     15027 Jan  9  2020 f2
-rw-r--r--  1 22235 wikidev    151716 Jan  7  2020 hs_err_pid2535.log
-rw-r--r--  1 22235 wikidev         0 Oct 21  2019 __init__.py
-rw-r--r--  1 22235 wikidev     27192 Feb 12  2020 Link Page Arabic.ipynb
-rw-r--r--  1 22235 wikidev     20152 Feb 12  2020 Link Page Ar-Copy1.ipynb
-rw-r--r--  1 22235 wikidev     36227 Feb 25  2020 Link Page AR-Copy2.ipynb
-rw-r--r--  1 22235 wikidev    257252 Feb 21  2020 Link Page CS-Copy1.ipynb
-rw-r--r--  1 22235 wikidev     77777 Feb 12  2020 Link Page CS.ipynb
-rw-r--r--  1 22235 wikidev     40224 Jan  9  2020 Link Page En.ipynb
-rw-r--r--  1 22235 wikidev      9323 Feb 12  2020 Link Page KO-Copy1.ipynb
-rw-r--r--  1 22235 wikidev     29440 Feb 17  2020 Link Page KO-copy2.ipynb
-rw-r--r--  1 22235 wikidev     14964 Feb 12  2020 Link Page KO.ipynb
-rw-r--r--  1 22235 wikidev     26395 Feb 12  2020 Link Page Vi-Copy1.ipynb
-rw-r--r--  1 22235 wikidev     30704 Feb 21  2020 Link Page VI-Copy2.ipynb
-rw-r--r--  1 22235 wikidev      6161 Feb 12  2020 Link Page Vi.ipynb
-rw-r--r--  1 22235 wikidev 131796811 May 22  2019 lm-ar-opus-large-backward-v0.1.pt
-rw-r--r--  1 22235 wikidev 131796801 May 22  2019 lm-ar-opus-large-forward-v0.1.pt
-rw-r--r--  1 22235 wikidev     20827 Oct 21  2019 MostPopularUserAgents.ipynb
drwxr-xr-x 12 22235 wikidev      4096 Feb 12  2020 nltk_data
drwxrwxr-x  2 22235 wikidev      4096 Feb 25  2020 notebooks
drwxr-xr-x  2 22235 wikidev      4096 Feb 10  2020 output.singletons
-rw-r--r--  1 22235 wikidev     17408 Feb 12  2020 Parser.ipynb
-rw-rw-r--  1 22235 wikidev      2539 Apr 19  2020 pig_1587337383578.log
drwxr-xr-x  5 22235 wikidev      4096 Feb  7  2020 polyglot_data
drwxr-xr-x  2 22235 wikidev      4096 Oct 21  2019 __pycache__
drwxr-xr-x 10 22235 wikidev      4096 Feb 12  2020 pywikibot
-rw-------  1 22235 wikidev      1375 Feb 12  2020 pywikibot.lwp
drwxr-xr-x  2 22235 wikidev      4096 Nov  7  2019 repos
-rw-r--r--  1 22235 wikidev      3464 Feb 25  2020 reqs.txt
-rw-r--r--  1 22235 wikidev     63559 Oct 21  2019 rPCA.ipynb
-rw-r--r--  1 22235 wikidev      2781 Oct 21  2019 rpca.py
-rw-r--r--  1 22235 wikidev     12162 Dec 10  2019 SpaCy test.ipynb
-rw-r--r--  1 22235 wikidev      2448 Jan  8  2020 test.py
-rw-r--r--  1 22235 wikidev        35 Feb 17  2020 throttle.ctrl
-rw-r--r--  1 22235 wikidev      9040 Jan 18  2020 Untitled1.ipynb
-rw-r--r--  1 22235 wikidev     38970 Jan 31  2020 Untitled2.ipynb
-rw-r--r--  1 22235 wikidev      4636 Feb  7  2020 Untitled3.ipynb
-rw-r--r--  1 22235 wikidev      7954 Jan  8  2020 Untitled.ipynb
-rw-------  1 22235 wikidev       251 Feb 12  2020 user-config.py
-rw-------  1 22235 wikidev       494 Feb 12  2020 user-password.py
drwxrwxr-x  6 22235 wikidev      4096 Feb 25  2020 venv
-rw-r--r--  1 22235 wikidev    197284 Feb  9  2020 viwiki.dis

====== stat1006 ======
total 3184876
drwxr-xr-x 21 22235 wikidev       4096 Aug 25  2020 backup
drwxrwxr-x  3 22235 wikidev       4096 Jun 19  2020 bipart.csv
-rw-r--r--  1 22235 wikidev   15258830 Dec 11  2019 csanchors.pickle
-rw-rw-r--  1 22235 wikidev     722507 Jun 23  2020 diff2
drwxrwxr-x 13 22235 wikidev       4096 May 29  2020 diff-match-patch
-rw-rw-r--  1 22235 wikidev      72153 May 29  2020 diff_match_patch-current.jar
-rw-rw-r--  1 22235 wikidev       5579 Jun 23  2020 diff.tmp
drwxrwxr-x  3 22235 wikidev       4096 Jun  7  2020 enwiki_dataset.parquet
-rw-r--r--  1 22235 wikidev   19016342 Dec 11  2019 koanchors.pickle
-rw-rw-r--  1 22235 wikidev    1685398 May 26  2020 lucene-analyzers-common-8.5.2.jar
-rw-rw-r--  1 22235 wikidev    3475136 May 26  2020 lucene-core-8.5.2.jar
-rw-rw-r--  1 22235 wikidev 3198873213 Jun 20  2020 m2vbi.csv
drwxrwxr-x  3 22235 wikidev       4096 Aug  8  2020 nltk_data
drwxrwxr-x  4 22235 wikidev       4096 Aug 15  2020 notebooks
drwxrwxr-x  3 22235 wikidev       4096 Jun  8  2020 repo
drwxrwxr-x  3 22235 wikidev       4096 Aug 26  2020 sock
-rw-rw-r--  1 22235 wikidev       7890 Jun 23  2020 sock_no_indefinite.csv
-rw-rw-r--  1 22235 wikidev    7236492 Jun 23  2020 sock_parse_comment.csv
-rw-rw-r--  1 22235 wikidev    3973911 Jun 22  2020 socks.csv
-rw-rw-r--  1 22235 wikidev    7618403 Jun 23  2020 socks_full.csv
-rw-rw-r--  1 22235 wikidev    3000865 Jun 23  2020 socks_template.csv
-rw-rw-r--  1 22235 wikidev      20487 Jun 23  2020 tmp_apostroph.csv
drwxrwxr-x  7 22235 wikidev       4096 Jun  8  2020 venv
-rw-r--r--  1 22235 wikidev     277968 Jun 17  2020 whitelist.csv

====== stat1007 ======
total 24421704
-rw-r--r-- 1 22235 wikidev    11503531 Feb 13  2020 000000_0
drwxrwxr-x 3 22235 wikidev        4096 Dec 12  2019 anom
-rw-rw-r-- 1 22235 wikidev  5601051999 May 25  2020 bios_full.csv.bz2
-rw-rw-r-- 1 22235 wikidev  6734633300 May 25  2020 bios_full.tgz
-rw-rw-r-- 1 22235 wikidev         504 May 25  2020 bio.sql
-rw-rw-r-- 1 22235 wikidev 10737013804 May 25  2020 bios_wikidata.csv
-rw-rw-r-- 1 22235 wikidev  1357446538 May 25  2020 bios_wikidata.tgz
-rw-rw-r-- 1 22235 wikidev      304698 May 29  2015 brickhouse-0.7.1.jar
drwxrwxr-x 3 22235 wikidev        4096 Dec  5  2019 data
-rw-rw-r-- 1 22235 wikidev   191964391 May 28  2020 full9.csv
-rw-rw-r-- 1 22235 wikidev    80675214 May 28  2020 full9.csv.gz
-rw-r--r-- 1 22235 wikidev     6268756 May 25  2020 ids.csv
drwxrwxr-x 5 22235 wikidev        4096 Dec  8  2019 linkrec
-rw-rw-r-- 1 22235 wikidev      865298 Jun 10  2020 master.csv
-rw-rw-r-- 1 22235 wikidev     8314882 Jun 10  2020 master_original.csv
-rw-rw-r-- 1 22235 wikidev         226 Jun  9  2020 master.sql
drwxrwxr-x 4 22235 wikidev        4096 Apr 19  2020 nltk_data
-rw-rw-r-- 1 22235 wikidev    27463250 Apr 20  2020 nltk_data.zip
-rw-rw-r-- 1 22235 wikidev        7917 Dec 17  2019 out
-rw-rw-r-- 1 22235 wikidev        3910 Feb 10  2020 pig_1581330083849.log
-rw-r--r-- 1 22235 wikidev    11503531 Feb 14  2020 redirect
drwxrwxr-x 2 22235 wikidev        4096 Jan 20  2020 resultsMapping-CoOcurrenceCountPandas
drwxrwxr-x 2 22235 wikidev        4096 Jan 20  2020 scp
drwxrwxr-x 2 22235 wikidev        4096 Jan 20  2020 SectionsCharacterization
drwxrwxr-x 8 22235 wikidev        4096 Aug 10  2020 sockpuppet
-rw-rw-r-- 1 22235 wikidev         137 May 28  2020 socks
-rw-rw-r-- 1 22235 wikidev     1371534 May 28  2020 socks.csv
-rw-rw-r-- 1 22235 wikidev         236 May 25  2020 wikid
-rw-r--r-- 1 22235 wikidev   237310243 Jan 20  2020 wikidataSixLanguages.csv.g
drwxrwxr-x 2 22235 wikidev        4096 Jan 20  2020 wikidataSixLanguages.csv.gz

====== stat1008 ======
total 4
drwxrwxr-x 6 22235 wikidev 4096 Oct 13 08:47 venv

======= HDFS ========
Found 36 items
drwx------   - dedcode dedcode          0 2020-09-16 00:00 /user/dedcode/.Trash
drwxr-xr-x   - dedcode dedcode          0 2020-08-16 21:59 /user/dedcode/.sparkStaging
drwx------   - dedcode dedcode          0 2020-08-16 20:57 /user/dedcode/.staging
drwxr-xr-x   - dedcode dedcode          0 2020-06-19 14:04 /user/dedcode/bipart.csv
-rw-r--r--   3 dedcode dedcode     304698 2020-06-07 09:08 /user/dedcode/brickhouse-0.7.1.jar
-rw-r--r--   3 dedcode dedcode       2545 2020-04-20 07:33 /user/dedcode/comment_properties_mapper2.py
-rw-r--r--   3 dedcode dedcode   37347264 2020-04-20 05:23 /user/dedcode/denv.zip
drwxrwxrwx   - dedcode dedcode          0 2020-06-08 22:27 /user/dedcode/embeddings
drwxr-xr-x   - dedcode dedcode          0 2020-05-29 19:04 /user/dedcode/graph
drwxr-xr-x   - dedcode dedcode          0 2020-02-11 13:41 /user/dedcode/linkrec
drwxr-xr-x   - dedcode dedcode          0 2020-02-10 15:15 /user/dedcode/ltrees
drwxr-xr-x   - dedcode dedcode          0 2020-06-20 14:15 /user/dedcode/m2vbipart.csv
-rw-r--r--   3 dedcode dedcode   27463250 2020-04-20 05:40 /user/dedcode/nltk_data.zip
drwxr-xr-x   - dedcode dedcode          0 2020-02-25 17:03 /user/dedcode/notebooks
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 03:46 /user/dedcode/output.pairs
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 03:46 /user/dedcode/output.singletons
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 03:46 /user/dedcode/output.triples
drwxr-xr-x   - dedcode dedcode          0 2020-05-29 19:12 /user/dedcode/simplewiki.parquet
drwxr-xr-x   - dedcode dedcode          0 2020-05-30 10:31 /user/dedcode/sock_joal
drwxr-xr-x   - dedcode dedcode          0 2020-06-23 13:54 /user/dedcode/sock_parse_comment.csv
drwxr-xr-x   - dedcode dedcode          0 2020-06-23 00:33 /user/dedcode/sock_template.csv
drwxr-xr-x   - dedcode dedcode          0 2020-06-07 22:17 /user/dedcode/sockdata
drwxr-xr-x   - dedcode dedcode          0 2020-06-22 10:15 /user/dedcode/socks.csv
-rw-r--r--   3 dedcode dedcode  207437778 2020-04-20 05:53 /user/dedcode/test_spark_venv.zip
-rw-r--r--   3 dedcode dedcode       3854 2020-04-20 07:17 /user/dedcode/textproperties.py
drwxr-xr-x   - dedcode dedcode          0 2020-04-19 23:30 /user/dedcode/token_out
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 04:03 /user/dedcode/vi.pab_table
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 04:04 /user/dedcode/vi.pabc_table
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 03:51 /user/dedcode/vi.pairs
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 03:51 /user/dedcode/vi.singletons
drwxr-xr-x   - dedcode dedcode          0 2020-02-12 03:51 /user/dedcode/vi.triples
drwxr-xr-x   - dedcode dedcode          0 2020-04-20 05:27 /user/dedcode/virtualenv
drwxr-xr-x   - dedcode dedcode          0 2020-04-20 07:33 /user/dedcode/wikidiff_output_feat_split
drwxr-xr-x   - dedcode dedcode          0 2019-11-22 03:44 /user/dedcode/wikidiff_output_feat_split5
drwxr-xr-x   - dedcode dedcode          0 2019-11-22 00:21 /user/dedcode/wikidiff_output_new_split5
drwxr-xr-x   - dedcode dedcode          0 2020-05-28 22:28 /user/dedcode/wikidiff_output_split

====== Hive =========
drwxrwxrwt   - dedcode        hdfs                      0 2020-05-25 12:39 /user/hive/warehouse/dedcode.db/bio_pageids
drwxr-xr-x   - dedcode        hadoop                    0 2020-05-31 14:05 /user/hive/warehouse/dedcode.db/df1
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-07 16:51 /user/hive/warehouse/dedcode.db/enwiki_history_agg2
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-07 21:11 /user/hive/warehouse/dedcode.db/enwiki_history_agg_compact
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-13 23:25 /user/hive/warehouse/dedcode.db/enwiki_history_agg_new
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-13 23:07 /user/hive/warehouse/dedcode.db/enwiki_history_agg_new_test
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-15 06:06 /user/hive/warehouse/dedcode.db/enwiki_history_agg_part
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-14 11:49 /user/hive/warehouse/dedcode.db/enwiki_history_agg_sample
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-14 07:41 /user/hive/warehouse/dedcode.db/enwiki_history_agg_sample3
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-14 07:27 /user/hive/warehouse/dedcode.db/enwiki_history_agg_sample4
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-15 03:13 /user/hive/warehouse/dedcode.db/enwiki_history_agg_temp
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-15 01:25 /user/hive/warehouse/dedcode.db/enwiki_history_agg_tmp
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-15 02:27 /user/hive/warehouse/dedcode.db/enwiki_history_agg_tmp2
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-16 10:10 /user/hive/warehouse/dedcode.db/enwiki_history_diff
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-15 03:45 /user/hive/warehouse/dedcode.db/enwiki_history_diff_part
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-12 20:06 /user/hive/warehouse/dedcode.db/enwiki_history_diff_part_talk
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-12 22:18 /user/hive/warehouse/dedcode.db/enwiki_history_part
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-12 10:56 /user/hive/warehouse/dedcode.db/enwiki_history_part_talk
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-06 09:50 /user/hive/warehouse/dedcode.db/enwiki_history_part_year
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-16 07:43 /user/hive/warehouse/dedcode.db/enwiki_ig
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-16 09:47 /user/hive/warehouse/dedcode.db/enwiki_ig_bis
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-16 00:50 /user/hive/warehouse/dedcode.db/enwiki_ig_prep
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-16 20:03 /user/hive/warehouse/dedcode.db/enwiki_interaction_graph
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-07 21:33 /user/hive/warehouse/dedcode.db/enwiki_sock_dataset
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-08 07:39 /user/hive/warehouse/dedcode.db/enwiki_sock_dataset_full
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-16 21:30 /user/hive/warehouse/dedcode.db/enwiki_user_feat
drwxrwxrwt   - dedcode        hadoop                    0 2020-05-25 10:25 /user/hive/warehouse/dedcode.db/ids_dataset
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-03 23:22 /user/hive/warehouse/dedcode.db/simple_history_part_year
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-17 09:45 /user/hive/warehouse/dedcode.db/sock_dataset
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-17 12:41 /user/hive/warehouse/dedcode.db/sock_dataset_whitelist
drwxrwxrwt   - dedcode        hdfs                      0 2020-06-07 11:52 /user/hive/warehouse/dedcode.db/sock_id
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-15 16:01 /user/hive/warehouse/dedcode.db/sock_ids
drwxr-xr-x   - dedcode        hadoop                    0 2020-08-13 22:03 /user/hive/warehouse/dedcode.db/sock_label
drwxr-xr-x   - dedcode        hadoop                    0 2020-06-07 13:31 /user/hive/warehouse/dedcode.db/tmp_diffs
drwxrwxrwt   - dedcode        hdfs                      0 2020-04-20 14:37 /user/hive/warehouse/dedcode.db/users_data
drwxrwxrwt   - dedcode        hdfs                      0 2020-05-20 23:33 /user/hive/warehouse/dedcode.db/wdhuman
drwxrwxrwt   - dedcode        hdfs                      0 2020-05-21 02:44 /user/hive/warehouse/dedcode.db/wdhumanids
drwxrwxrwt   - dedcode        hadoop                    0 2020-06-17 09:33 /user/hive/warehouse/dedcode.db/whitelist
drwxrwxrwt   - dedcode        hdfs                      0 2020-05-25 11:58 /user/hive/warehouse/dedcode.db/wiki_data
drwxrwxrwt   - dedcode        hdfs                      0 2020-05-25 11:42 /user/hive/warehouse/dedcode.db/wiki_ids

@Isaac @MGerlach could you please check if we have to keep anything or if we can drop? :)

elukey triaged this task as Medium priority.Mar 10 2021, 2:55 PM

@elukey thanks for the ping. I just talked with Djellel.

  • hdfs/hive: all data can be dropped
  • stat100X[5,6,7,8]: /user/dedcode/: is this possible to keep for some time? we are mainly interested in keeping potentially relevant code (*.py, *.ipynb, *.java). do you have any suggestions how to back up those files without going through every folder manually?

@MGerlach I can move the /home/dedcode dirs under your username, what we care is that an active user maintains/own them so we can ping in case there are issues etc... Would it be ok? Then you'll be in charge of dropping data when needed :)

@MGerlach I created on the stat boxes /home/mgerlach/dedcode_home, and changed file ownership permission to your username, lemme know if you can read files etc..

I am going to proceed to drop hdfs and hive data :)

Mentioned in SAL (#wikimedia-analytics) [2021-03-11T08:15:56Z] <elukey> hdfs dfs -rmr /user/dedcode on an-launcher1002 (data in trash for a month) - T276748

Mentioned in SAL (#wikimedia-analytics) [2021-03-11T08:25:46Z] <elukey> drop database dedcode cascade in hive - T276748

elukey claimed this task.

All cleaned up! Please re-open if needed :)

@MGerlach I created on the stat boxes /home/mgerlach/dedcode_home, and changed file ownership permission to your username, lemme know if you can read files etc..

I am going to proceed to drop hdfs and hive data :)

Perfect. I was able to read the files on stat1005. This looks good. Thanks again.