Page MenuHomePhabricator

Check home/HDFS leftovers of goransm (timeboxed 0,5 days)
Closed, ResolvedPublic

Description

The access for Goran Milovanovic (goransm) was removed. It needs to be checked if data was left in home dirs on stat*/HDFS since they were part of the "analytics-privatedata-users" group.

The Kerberos principal has already been removed. Point of contact wrt potentially keeping any data is @AndrewTavis_WMDE or @Manuel

Event Timeline

Gehel triaged this task as Low priority.Feb 23 2024, 9:19 AM

Thank you for making this, @MoritzMuehlenhoff! Linking the related cleanup task for Gerrit, T357697: Archive WMDE analytics Gerrit repositories. Specifically in relation to this task, in T356618 we found hdfs:///tmp/wmde/analytics that was found to have data that wasn't deleted in the data processes. This data will be removed by your all's tmp deletion schedule, but it might be nice to clear it or just know when to check that it's empty?

A suggestion for the user tables from your all's end was that they be archived and restricted to admins only. I do not believe that any of the temporary data in user tables are of long term importance to WMDE. The data of value is the end results that are stored in published/datasets/wmde-analytics-engineering. I'm doing some explorations for bringing back some of the pipelines in T358254, so for now we should keep the published data so that they can be used as a reference for replication. Archiving the rest makes sense for now, but also getting a size estimate might make sense so that we can decide if we need to discuss deletion for space concerns.

I have run our script to list user content in our various machines, the result is below.
@AndrewTavis_WMDE , I let you review and let us know when you have copied stuff you wish to keep, so that we can delete the rest.

====== stat1004 ======
total 28
drwxrwxr-x 5 16664 wikidev                     4096 Jul 25  2020 Analytics
drwxrwxr-x 2 16664 wikidev                     4096 Jun 23  2020 Experiments
drwxr-xr-x 5 16664 wikidev                     4096 Jul 23  2017 _miscWMDE
drwxr-xr-x 3 16664 wikidev                     4096 Jun  5  2017 _miscWMDE_1004
drwxrwxr-x 3 16664 wikidev                     4096 Sep 30  2018 R
drwxrwxr-x 3 16664 wikidev                     4096 Nov 27  2019 Research
drwxrwxr-x 2 16664 analytics-privatedata-users 4096 Jul 25  2020 wdUsagePerPage

====== stat1005 ======
total 964
drwxrwxr-x 9 16664 wikidev   4096 Oct 25  2021 Analytics
-rw-r--r-- 1 16664 wikidev  35757 Sep  1  2021 BotEdits_perProject.ipynb
-rw-rw-r-- 1 16664 wikidev   2194 Sep 21  2020 crontabstat1005.txt
-rw-r--r-- 1 16664 wikidev  16557 May 15  2021 DataModelTerms_20210228_Updates.ipynb
-rw-r--r-- 1 16664 wikidev   5506 Nov  2  2021 dewiki_NewEds_2021.ipynb
-rw-r--r-- 1 16664 wikidev  15344 Sep 28  2021 QCF_M2_Test.ipynb
-rw-r--r-- 1 16664 wikidev  37833 Jun 29  2021 QuratorCuriousFacts_Separators.ipynb
-rw-r--r-- 1 16664 wikidev  42314 Dec 25  2020 Qurator_M1.ipynb
drwxrwxr-x 3 16664 wikidev   4096 Feb 25  2020 R
-rw-r--r-- 1 16664 wikidev     38 Feb  6  2021 snapshot_query.hql
-rw-r--r-- 1 16664 wikidev     72 May  1  2021 Untitled1.ipynb
-rw-r--r-- 1 16664 wikidev      0 Jan  3  2021 untitled1.txt
-rw-r--r-- 1 16664 wikidev     72 May 15  2021 Untitled2.ipynb
-rw-r--r-- 1 16664 wikidev    913 May 15  2021 Untitled3.ipynb
-rw-r--r-- 1 16664 wikidev  21950 Jun 30  2021 Untitled4.ipynb
-rw-r--r-- 1 16664 wikidev     72 Aug 12  2021 Untitled5.ipynb
-rw-r--r-- 1 16664 wikidev  20060 Apr  8  2021 Untitled.ipynb
-rw-r--r-- 1 16664 wikidev      0 Dec  9  2020 untitled.txt
drwxr-xr-x 7 16664 wikidev   4096 May 25  2020 venv
-rw-r--r-- 1 16664 wikidev  11630 Dec 26  2020 wd_cluster_fetch_items_M2.ipynb
-rw-r--r-- 1 16664 wikidev  20110 May 15  2021 wd_cluster_fetch_items_M3.ipynb
-rw-r--r-- 1 16664 wikidev  34888 Mar 11  2021 WDCM_ETL_OTHER_TEST.ipynb
-rw-r--r-- 1 16664 wikidev  10783 Feb  8  2021 WDCM_Statements_Test.ipynb
-rw-r--r-- 1 16664 wikidev   5222 Aug  2  2020 WD_HumanEditsPerClass_RevisionTags.ipynb
-rw-r--r-- 1 16664 wikidev   6267 Feb  5  2021 WD_Inequality_Intake.ipynb
-rw-r--r-- 1 16664 wikidev  17699 Nov 18  2020 WD_Languages_Datamodel_CollectInit.ipynb
-rw-r--r-- 1 16664 wikidev   4678 Nov 15  2020 WD_Languages_Datamodel_EXP.ipynb
-rw-r--r-- 1 16664 wikidev   2238 Nov 19  2020 WD_MonthlyEditors.ipynb
-rw-r--r-- 1 16664 wikidev  21117 Aug 12  2021 WD_Sitelinks_WDAHP_202108.ipynb
-rw-r--r-- 1 16664 wikidev    195 Aug  8  2021 wd_statements_HiveQL_Query.hql
-rw-r--r-- 1 16664 wikidev  12506 Apr 11  2021 WD_Translations.ipynb
-rw-r--r-- 1 16664 wikidev  14204 Jun 30  2020 WHEIP_exps.ipynb
drwxrwxr-x 3 16664 wikidev   4096 Feb  2  2022 wikidata_analytics_examples
-rw-rw-r-- 1 16664 wikidev 537383 Dec  9  2020 WikidataRevisions_November2020.csv

====== stat1006 ======
total 48
drwxrwxr-x 4 16664 wikidev 4096 Sep  6  2017 misc_projects
drwxr-xr-x 2 16664 wikidev 4096 Jul  8  2017 myTemp
drwxr-xr-x 3 16664 wikidev 4096 May 22  2017 NewEds
-rw------- 1 16664 wikidev    0 May 23  2017 nohup.out
drwxrwxr-x 3 16664 wikidev 4096 May 20  2017 R
drwxrwxr-x 2 16664 wikidev 4096 Jul 26  2017 RPckg
drwxrwxr-x 5 16664 wikidev 4096 Sep 13  2017 RScripts
drwxrwxr-x 2 16664 wikidev 4096 May 25  2017 sqlIn
drwxrwxr-x 2 16664 wikidev 4096 May 25  2017 sqlOut
drwxrwxr-x 2 16664 wikidev 4096 May 20  2017 WDCM_Credentials
drwxrwxr-x 3 16664 wikidev 4096 May 17  2017 WDCM_DataIN
drwxrwxr-x 3 16664 wikidev 4096 May 17  2017 WDCM_DataOUT
drwxrwxr-x 2 16664 wikidev 4096 May 20  2017 WDCM_sql

====== stat1007 ======
total 28
drwxrwxr-x 8 16664 wikidev 4096 Aug 23  2020 Analytics
-rw-rw-r-- 1 16664 wikidev 2497 Sep 21  2020 crontabstat1007.txt
drwxrwxr-x 5 16664 wikidev 4096 Jan 27  2020 Experiments
drwxrwxr-x 3 16664 wikidev 4096 May  3  2019 Python3
drwxrwxr-x 3 16664 wikidev 4096 Dec 18  2018 R
drwxrwxr-x 5 16664 wikidev 4096 Aug 23  2020 RScripts
drwxr-xr-x 7 16664 wikidev 4096 Jul 17  2020 venv

====== stat1008 ======
total 16
drwxrwxr-x 8 16664 analytics-privatedata-users 4096 Oct 11  2021 Analytics
drwxrwxr-x 3 16664 wikidev                     4096 Jun 23  2020 R
drwxr-xr-x 3 16664 wikidev                     4096 Oct 11  2021 renv
drwxr-xr-x 7 16664 wikidev                     4096 Jun 24  2020 venv

====== stat1009 ======
total 0

====== stat1010 ======
total 0

======= HDFS ========
Found 55 items
drwx------   - goransm goransm          0 2021-12-03 00:00 /user/goransm/.Trash
drwxr-xr-x   - goransm goransm          0 2017-07-14 09:34 /user/goransm/.metadata
drwxr-xr-x   - goransm goransm          0 2021-11-02 17:39 /user/goransm/.sparkStaging
drwx------   - goransm goransm          0 2022-02-02 16:52 /user/goransm/.staging
drwxr-xr-x   - goransm goransm          0 2017-07-14 10:38 /user/goransm/.temp
-rw-r--r--   3 goransm goransm   68694522 2018-12-22 21:02 /user/goransm/Architectural-Structure_ItemIDs.csv
-rw-r--r--   3 goransm goransm    6422939 2018-12-22 21:02 /user/goransm/Astronomical-Object_ItemIDs.csv
-rw-r--r--   3 goransm goransm    2764777 2018-12-22 21:03 /user/goransm/Book_ItemIDs.csv
-rw-r--r--   3 goransm goransm   35465788 2018-12-22 21:03 /user/goransm/Chemical-Entities_ItemIDs.csv
-rw-r--r--   3 goransm goransm   14902316 2018-12-22 21:03 /user/goransm/Event_ItemIDs.csv
-rw-r--r--   3 goransm goransm   22704864 2018-12-22 21:03 /user/goransm/Gene_ItemIDs.csv
-rw-r--r--   3 goransm goransm  315855392 2018-12-22 21:03 /user/goransm/Geographical-Object_ItemIDs.csv
-rw-r--r--   3 goransm goransm  141475342 2018-12-22 21:04 /user/goransm/Human_ItemIDs.csv
-rw-r--r--   3 goransm goransm 5067639259 2020-01-19 15:33 /user/goransm/ORESPredictions
-rw-r--r--   3 goransm goransm   22320361 2018-12-22 21:04 /user/goransm/Organization_ItemIDs.csv
-rw-r--r--   3 goransm goransm   75044259 2018-12-22 21:04 /user/goransm/Taxon_ItemIDs.csv
-rw-r--r--   3 goransm goransm   21116753 2018-12-22 21:04 /user/goransm/Thoroughfare_ItemIDs.csv
drwxr-xr-x   - goransm goransm          0 2019-12-14 20:16 /user/goransm/WDCM_Biases_ETL_Test
drwxr-xr-x   - goransm goransm          0 2019-12-14 16:13 /user/goransm/WDCM_CollectedGeoItems
drwxr-xr-x   - goransm goransm          0 2019-12-14 16:11 /user/goransm/WDCM_CollectedItems
-rw-r--r--   3 goransm goransm  212723385 2018-12-22 21:04 /user/goransm/Wikimedia_Internal_ItemIDs.csv
-rw-r--r--   3 goransm goransm   51330011 2018-12-22 21:04 /user/goransm/Work-Of-Art_ItemIDs.csv
drwxr-x---   - goransm goransm          0 2021-11-02 17:37 /user/goransm/dewiki_revisions
-rw-r--r--   3 goransm goransm     480857 2018-05-10 14:59 /user/goransm/dfTrain1.csv
-rw-r--r--   3 goransm goransm     480878 2018-05-10 14:59 /user/goransm/dfTrain2.csv
-rw-r--r--   3 goransm goransm     480210 2018-05-10 14:59 /user/goransm/dfTrain3.csv
-rw-r--r--   3 goransm goransm     480875 2018-05-10 14:59 /user/goransm/dfTrain4.csv
-rw-r--r--   3 goransm goransm     480867 2018-05-10 14:59 /user/goransm/dfTrain5.csv
-rw-r--r--   3 goransm goransm  592406591 2018-05-07 23:46 /user/goransm/flights.csv
-rw-r--r--   3 goransm goransm         24 2017-07-18 13:44 /user/goransm/mysql-analytics-research-client-pw.txt
-rw-r-----   3 goransm goransm     697715 2021-09-28 10:54 /user/goransm/refClassSubclasses.csv
-rw-r-----   3 goransm goransm      21367 2021-06-29 15:07 /user/goransm/separators.csv
-rw-r-----   3 goransm goransm      86883 2021-06-29 15:07 /user/goransm/singleValueConstraintProperties.csv
-rw-r-----   3 goransm goransm     312740 2021-09-28 10:53 /user/goransm/subclasses.csv
-rw-r--r--   3 goransm goransm  283424350 2018-05-13 00:34 /user/goransm/tfMatrixDF.csv
-rw-r--r--   3 goransm goransm    2291969 2018-05-12 11:38 /user/goransm/tfMatrix_Human.csv
drwxr-xr-x   - goransm goransm          0 2019-12-04 00:18 /user/goransm/wdORESQuality.csv
drwxr-xr-x   - goransm goransm          0 2020-01-19 15:54 /user/goransm/wdORESQuality_Reuse.csv
drwxr-xr-x   - goransm goransm          0 2019-12-15 10:11 /user/goransm/wdORESQuality_Reuse_Commons.csv
drwxr-xr-x   - goransm goransm          0 2019-12-15 12:22 /user/goransm/wdORESQuality_Reuse_nonCommons.csv
drwxr-xr-x   - goransm goransm          0 2019-04-30 01:21 /user/goransm/wd_dump_geocoded
drwxr-xr-x   - goransm goransm          0 2019-04-30 01:28 /user/goransm/wd_dump_human_author
drwxr-xr-x   - goransm goransm          0 2019-04-30 01:27 /user/goransm/wd_dump_human_creator
drwxr-xr-x   - goransm goransm          0 2019-04-30 01:24 /user/goransm/wd_dump_human_gender
drwxr-xr-x   - goransm goransm          0 2019-04-30 01:25 /user/goransm/wd_dump_human_occupation
drwxr-xr-x   - goransm goransm          0 2019-10-22 19:52 /user/goransm/wd_dump_item_language
drwxr-xr-x   - goransm goransm          0 2019-04-30 01:23 /user/goransm/wd_dump_labels_English
drwxr-xr-x   - goransm goransm          0 2019-10-22 20:04 /user/goransm/wd_entity_reuse
drwxr-xr-x   - goransm goransm          0 2019-04-19 23:30 /user/goransm/wd_extId_data_qual_.csv
drwxr-xr-x   - goransm goransm          0 2019-10-23 18:26 /user/goransm/wd_extId_data_ref_.csv
drwxr-xr-x   - goransm goransm          0 2019-04-19 23:09 /user/goransm/wd_extId_data_ref_snak_.csv
drwxr-xr-x   - goransm goransm          0 2019-10-23 18:20 /user/goransm/wd_extId_data_stat_.csv
drwxr-xr-x   - goransm goransm          0 2019-12-14 00:00 /user/goransm/wdcmsqoop
drwxr-x---   - goransm goransm          0 2021-04-11 16:51 /user/goransm/wdtranslationsb
drwxr-xr-x   - goransm goransm          0 2019-11-01 14:24 /user/goransm/wikidataRevisions_EXP.csv

====== Hive =========

Thank you, @JAllemandou! I'll start looking into this next week, with final decisions likely coming in early April once we've had time to sift through it all :)

@AndrewTavis_WMDE Gentle ping on this. Do you need to keep any files? Thanks!

Hey @brouberol 👋 Just getting back from two weeks off today :) I'll check into this and get back to you all! Thanks for the ping!

Manuel renamed this task from Check home/HDFS leftovers of goransm to Check home/HDFS leftovers of goransm (timeboxed 0,5 days).May 15 2024, 8:27 AM

Status: Done ✅

Going through the files sent by @JAllemandou above. This message will be saved as I go so that I don't loose my progress 😊 If I do find something worth documenting, then I'll also include it below so that this task can serve as a reference for later if need be.

stat1004

Summary: All of the files are not worth keeping. See descriptions and reasoning below:

total 28

Analytics
└─ NewEditors 
    └─ adHoc (nothing of interest)
    └─ Compaigns
        └─ 2019 and 2020 email compaigns with R based analysis (nothing of interest)
└─ WDCM
    └─ WDCM_Output 
        └─ Lots directories of CSVs (nothing of interest)
    └─ WDCM_Scripts
        └─ R based scripts that would be archived on Gerrit if they were ever in production (nothing of interest)
└─ Wikidata
    └─ misc
        └─ Some ad hoc work (nothing of interest)
    └─ WD_languagesLandscape
        └─ R based scripts that would be archived on Gerrit if they were ever in production (nothing of interest)
    └─ WD_ORES_ItemQuality (nothing of interest given Lift Wing migration)
    └─ WD_UsageCoverage
        └─ R and Python scripts that are doubtless versions of the WDCM UsageCoverage dashboard that's archived on Gerrit (nothing of interest)
Experiments
    └─ Empty
_miscWMDE
    └─ summerBannerCampaign2017_DataOUT
        └─ TSV files (nothing of interest)
    └─ TWLBanner_2017
        └─ TSV files and simple HQL queries from `wmf.webrequest` for banner campaigns hits (nothing of interest, easy to learn as needed)

Example query:

SELECT count(*)
FROM wmf.webrequest
WHERE uri_host = 'de.wikipedia.org'
  AND uri_query LIKE "$/wiki/Wikipedia:Umfragen/Technische_Wünsche_2017$"
  AND http_method = 'GET'
  AND is_pageview = TRUE
  AND YEAR = 2017
  AND MONTH = 6
  AND DAY = 1
  and HOUR = 20;

    └─ TWLBanner_2017_DataOUT
        └─ TSV files (nothing of interest)
_miscWMDE_1004
    └─ TWLBanner_2017
        └─ One HQL and one TSV file that are similar to the above (nothing of interest)
R
    └─ x86_64-pc-linux-gnu-library (nothing of interest)
Research
    └─ DydimusZengenene
        └─ Note: work to support a researcher (nothing of interest)
        └─ _analytics
        └─ _data
        └─ DydimusZengenene.Rproj
        └─ ParseTargetPage.R
wdUsagePerPage
    └─ Related to the percentage usage dashboard, so would be archived on Gerrit if they were ever in production (nothing of interest)

stat1005

Summary: There are some Python files related to prior work for @Manuel, but nothing that I'd need for my work.

total 964

Analytics
    └─ adhoc
        └─ renv
            └─ Env file (not of interest)
        └─ renv.lock
            └─ Env file (not of interest)
        └─ WD_GlobalSouth_202109
            └─ _analytics
                └─ CSV files (not of interest)
            └─ _data
                └─ Empty
            └─ WD_GS_activeEditorsEdits.hql

Query:

SELECT * 
FROM wmf.geoeditors_edits_monthly 
WHERE wiki_db="wikidatawiki" AND month = "2021-08";

            └─ WD_GS_activeEditors.hql

Query:

SELECT * 
FROM wmf.geoeditors_monthly 
WHERE wiki_db="wikidatawiki" AND month = "2021-08";

        └─ WD_UserRetention2021
            └─ _analytics
                └─ CSVs (not of interest)
            └─ _data
                └─ WD_retention.hql

Query:

USE wmf; 
          SELECT 
            event_user_id, event_user_registration_timestamp, 
            substring(event_timestamp, 1, 4) AS year, 
            substring(event_timestamp, 6, 2) AS month, 
            COUNT(*) AS revisions FROM mediawiki_history 
          WHERE (
            event_entity = 'revision' AND 
            event_type = 'create' AND 
            wiki_db = 'wikidatawiki' AND 
            event_user_is_anonymous = FALSE AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by, 'name') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by, 'group') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'name') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'group') AND 
            NOT ARRAY_CONTAINS(event_user_groups, 'bot') AND 
            NOT ARRAY_CONTAINS(event_user_groups_historical, 'bot') AND 
            event_user_id != 0 AND 
            page_is_redirect = FALSE AND 
            revision_is_deleted_by_page_deletion = FALSE AND 
            (page_namespace = 0 OR page_namespace = 120 OR page_namespace = 146) AND 
            snapshot = '2021-09'
          ) 
          GROUP BY 
            event_user_id, 
            event_user_registration_timestamp, 
            substring(event_timestamp, 1, 4), 
            substring(event_timestamp, 6, 2);


                └─ WD_retentionTalk.hql

Query:

USE wmf; 
          SELECT 
            event_user_id, COUNT(*) AS talkrevisions FROM mediawiki_history 
          WHERE (
            event_entity = 'revision' AND 
            event_type = 'create' AND 
            wiki_db = 'wikidatawiki' AND 
            event_user_is_anonymous = FALSE AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by, 'name') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by, 'group') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'name') AND 
            NOT ARRAY_CONTAINS(event_user_is_bot_by_historical, 'group') AND 
            NOT ARRAY_CONTAINS(event_user_groups, 'bot') AND 
            NOT ARRAY_CONTAINS(event_user_groups_historical, 'bot') AND 
            event_user_id != 0 AND 
            page_is_redirect = FALSE AND 
            revision_is_deleted_by_page_deletion = FALSE AND 
            (page_namespace = 1 OR 
            page_namespace = 3 OR 
            page_namespace = 5 OR 
            page_namespace = 7 OR 
            page_namespace = 9 OR 
            page_namespace = 11 OR 
            page_namespace = 13 OR 
            page_namespace = 15 OR 
            page_namespace = 121 OR 
            page_namespace = 123 OR 
            page_namespace = 147 OR 
            page_namespace = 641 OR 
            page_namespace = 829 OR 
            page_namespace = 1199 OR 
            page_namespace = 2301 OR 
            page_namespace = 2303) AND 
            snapshot = '2021-09'
          ) 
          GROUP BY event_user_id;

                └─ WD_retentionTalk.tsv
                    └─ Not of interest
                └─ WD_retention.tsv
                    └─ Not of interest
    └─ exps
        └─ Empty
    └─ NewEditors (all related to old campaigns, not of interest)
        └─ 2021_LeadNurturing
        └─ campaigns  
        └─ CampaignsReview2020
        └─ CampaignsReview2021
        └─ comprehensive_report
    └─ TechWishes
        └─ Work related to the Tech Wish 2021 Survey including hive queries and CSVs/TSVs (not of interest)
    └─ test
        └─ R file and envs - `_test_WMDEData.R`, internal file, not of interest
        └─ Also random WDCM related queries (not of interest)
        └─ Also code that's for WMDEData, but it's all in a `test` repo, so not needed
            └─ WMDEData description: "WMDE specific functions to navigate and orchestrate the WMF infrastructure/system calls from within R environments"
    └─ WDCM (anything that was on prod would be archived on Gerrit, not needed)
        └─ WDCM_Output
        └─ WDCM_Scripts
    └─ Wikidata
        └─ WD_Communications
            └─ Technical wishes related work that queries from wmf.webrequest for views based on tech wish related uri_path values
        └─ WD_HumanEdits
            └─ Would be within Gerrit archive (not of interest)
        └─ WD_Inequality
            └─ Would be within Gerrit archive (not of interest)
        └─ WD_misc (none of this looks like stuff we should keep, and it's in "misc")
            └─ renv
            └─ WD_BotEdits_Manuel_202108
                └─ Work on per project bot edits from 2021 (not of interest)
            └─ WD_GlobalSouth_202109
                └─ Empty
            └─ WD_Sitelinks_Manuel_202108
                └─ _analytics
                    └─ Lots of CSVs and tar.gz files (nothing of interest)
                └─ _data
                    └─ Lots of CSVs (nothing of interest)
                └─WD_External_Ids_Manuel_202108_ETL.py
                └─WD_Sitelinks_Manuel_202108_ETL.py
                └─WD_Sitelinks_Manuel_202108.R
            └─ WD_UserRetention
                └─ CSVs, TSVs, R and HQL files (also found other places)
            └─ renv.lock
                └─ Env file (not of interest)
            └─ WD_edits_monthly
                └─ HQL query and CSV/TSV files (nothing of interest)
            └─ WD_RevisionTags_Manuel_20210831
                └─ R files and CSVs (not of interest)
            └─ WD_SPARQL_Endpoint_Analytics (I really don't get what this is - maybe related to what was worked on for the graph split, but I don't think it's needed)
                └─ _analytics
                └─ _event
                └─ _log
                └─ WD_SPARQL_Endpoint_Analytics.Rproj
                └─ WD_WDQS_Analytics_CollectData.R
                └─ _data
                └─ _img
                └─ wdqs_CollectData.hql
                    └─ FROM event.wdqs_external_sparql_query
                └─ WD_SPARQL_Endpoint_Analytics_Test.R
                    └─ The `.directory` files can't be accessed
WD_WDQS_Analytics_XGBoostModelSelection_PRODUCTION.R
            └─ WD_WDQS_Automated_Queries
                └─ Work related to a query FROM event.wdqs_external_sparql_query
        └─ WD_ORES_ItemQuality
            └─ Given Lift Wing migration, not of interest
        └─ WD_Tranlations
            └─ Lots of TXT files (nothing of interest)
        └─ WD_DataQualityPrototype_2020
            └─ Work for Wikidata/Wikibase Week 2020 (nothing of interest)
        └─ WD_IdentifierLandscape
            └─ Likely similar to the external identifiers code (not needed)
            └─ The `_analysis/.directory` can't be accessed
        └─ WD_languagesLandscape
            └─ Would be in the Gerrit archive (nothing of interest)
        └─ WD_MonthlyEditors
            └─ _data
                └─ CSVs (nothing of interest)
        └─ WD_Strategy_2021
            └─ Work related to setting the strategy goals from before? (Likely not needed)
BotEdits_perProject.ipynb
    └─ Copied over to my staat1005 to check - this isn't only Wikidata related, and I don't think it's useful
crontabstat1005.txt
    └─ Not of interest
DataModelTerms_20210228_Updates.ipynb
    └─ Could be of interest, but can be recreated given new task
dewiki_NewEds_2021.ipynb
    └─ Not of interest
QCF_M2_Test.ipynb
    └─ Qurator Curious Facts analysis (not of interest)
QuratorCuriousFacts_Separators.ipynb
    └─ Qurator Curious Facts analysis (not of interest)
Qurator_M1.ipynb
    └─ Qurator Curious Facts or Current Events analysis (not of interest)
R
    └─ x86_64-pc-linux-gnu-library (not of interest)
snapshot_query.hql
    └─ Inconsequential and not of interest
Untitled1.ipynb
    └─ Ad hoc and not of interest
untitled1.txt
    └─ Ad hoc and not of interest
Untitled2.ipynb
    └─ Ad hoc and not of interest
Untitled3.ipynb
    └─ Ad hoc and not of interest
Untitled4.ipynb
    └─ Ad hoc and not of interest
Untitled5.ipynb
    └─ Ad hoc and not of interest
Untitled.ipynb
    └─ Ad hoc and not of interest
untitled.txt
    └─ Ad hoc and not of interest
venv
    └─ Env file (not of interest)
wd_cluster_fetch_items_M2.ipynb
    └─ Copied over to my staat1005 to check - clustering of Wikidata items by property, not need to keep
wd_cluster_fetch_items_M3.ipynb
    └─ Copied over to my staat1005 to check - clustering of Wikidata items by property, no need to keep
WDCM_ETL_OTHER_TEST.ipynb
    └─ WDCM analysis (not of interest)
WDCM_Statements_Test.ipynb
    └─ WDCM analysis (not of interest)
WD_HumanEditsPerClass_RevisionTags.ipynb
    └─ Not of interest
WD_Inequality_Intake.ipynb
    └─ Not of interest
WD_Languages_Datamodel_CollectInit.ipynb
    └─ Likely languages landscape related, not of interest
WD_Languages_Datamodel_EXP.ipynb
    └─ Likely languages landscape related, not of interest
WD_MonthlyEditors.ipynb
    └─ Not of interest
WD_Sitelinks_WDAHP_202108.ipynb
    └─ Copied over to my staat1005 to check - based on wikidata_entity, so this party's already done for the new implementation
wd_statements_HiveQL_Query.hql
    └─ WDCM related (not of interest)
WD_Translations.ipynb
    └─ Copied over to my staat1005 to check - just is a query for `revision_text` from `mediawiki_wikitext_history` for translations (`P5972`)
WHEIP_exps.ipynb
    └─ Copied over to my staat1005 to check - item edits by humans vs. bots based on `wmf.mediawiki_history`, so nothing we need to keep
wikidata_analytics_examples
    └─ _data
        └─ A TSV and a TXT file (not of interest)
WikidataRevisions_November2020.csv
    └─ Not of interest

stat1006

Summary: The only thing that I would see of any value is sbc2017_PROD_causalImpactData.R, but I'd argue that causality has come a very long way since 2017 and Python would have different methods. Going from scratch based on examples we can find or syncing with WMF on their practices makes more sense.

total 48

misc_projects
    └─ springCampaign2017
        └─ R file and CSVs/TSVs (nothing of interest)
    └─ summerBannerCampaign2017 (nothing of interest)
        └─ sbc2017_DataOUT
            └─ sbc2017CausalImpact
                └─ CSVs (ISOwiki_revData) and SQL files (ISOwiki_revCount)

Example SQL file:

USE dewiki; 
SELECT 
COUNT(*), 
LEFT(CAST(rev_timestamp AS CHAR(14)), 8) AS revtime 
FROM revision INNER JOIN (
SELECT page_id FROM page 
WHERE (
(page_namespace = 0) AND (page_is_redirect = 0))
) AS pselected 
ON rev_page = pselected.page_id WHERE (
(rev_user != 0) AND (rev_timestamp BETWEEN 20170419000000 AND 20170718595959)
) GROUP BY revtime;

            └─ sbc2017_ServerSideAccountCreation_5487345.tsv
myTemp (nothing of interest)
    └─ beelineTestHiveQL_07082017.hql
    └─ testSCPtostat1003.txt
NewEds
    └─ currentUpdate
        └─ Lots of `updateNewEds` CSVs (nothing of interest)
nohup.out
    └─ Output file from a command that ignores the HUP (hangup) signal (Permission denied, not of interest)
R
    └─ x86_64-pc-linux-gnu-library (not of interest)
RPckg
    └─ Empty
RScripts
    └─ NewEds
        └─ mySQLcreds.csv
        └─ newEds_enwiki
            └─ R files, CSVs and TSVs (nothing of interest)
        └─ newedsUpdate.R
        └─ newUsersUpdate.sql

Query:

USE dewiki;
SELECT rev_user, rev_user_text, rev_page, rev_timestamp FROM revision INNER JOIN ( SELECT page_id FROM page WHERE ((page_namespace = 0) AND (page_is_redirect = 0)) ) AS pselected ON rev_page = pselected.page_id WHERE ((rev_user != 0) AND (rev_timestamp BETWEEN 20171009072348 AND 20171009104452));

        └─ updateStartsAt.csv
    └─ sbc2017
        └─ sbc2017_PROD_causalImpactData.R
        └─ I'd argue that we should work from scratch when it comes to causality as methods have improved lots since 2017, and we'd be doing it in Python anyway
    └─ WDCM_R (nothing of interest)
        └─ nohup.out
        └─ WDCM_Search_Clients.R (would be on Gerrit if it's valuable)
sqlIn
    └─ SQL files that make tables and count infoboxes (nothing of interest)
sqlOut (nothing of interest)
    └─ enwikitest.tsv
    └─ OUT_testSQLQuery1.out
    └─ testLogOUTPUT.tsv
WDCM_Credentials
    └─ Empty
WDCM_DataIN
    └─ CSVs for Genes, Humans etc (nothing of interest)
WDCM_DataOUT
    └─ Similar CSVs that are broken down by wiki (nothing of interest)
WDCM_sql
    └─ stderr.txt
    └─ stdout.txt
    └─ wdcm_searchClients.sql (such a long query with QIDs over and over ......)

stat1007

Summary: There is some Python based work, but nothing that we'd need to keep as this stuff can be learned from the docs faster than it can be learned from this code. The tainted references query is saved here, but it's just a wmf.mediawiki_history query that with adequate support can be remade if needed.

total 28

Analytics
    └─ NewEditors (generally nothing of interest)
        └─ 2019_EmailCampaign
            └─ CSVs and TSVs (nothing of interest)
        └─ 2021_OccasionalEditors
            └─ CSVs and R files (nothing of interest)
        └─ adHoc
            └─ New editors request from 2019 (nothing of interest)
        └─ CampaignsReview2020
        └─ wmde_BannerCampaignsDashboard
        └─ 2021_LeadNurturing
        └─ 2021_VolunteerSupport
        └─ Campaigns
        └─ _republicaConference
    └─ Share
        └─ Empty
    └─ TechnicalWishes
        └─ Empty
    └─ WDCM
        └─ Anything of value here would be on Gerrit and archived (nothing of interest)
    └─ Wikidata
        └─ WD_misc
            └─ WD_Mobile-Pageviews-Increase-2018
                └─ Investigating an increase in page views (nothing of interest, but it is Python)
            └─ WD_SPARQL_Endpoint_Analytics
                └─ Seemed interesting by the title, but it's just generating some test data and empty directories
        └─ WD_MonthlyEditors
            └─ Python file on monthly editors (not needed)
        └─ WD_ORES_ItemQuality
            └─ Nothing of interest given Lift Wing migration
        └─ WD_PageviewsPerType
            └─ Work based on `wmf.pageview_hourly` (best to remake)
        └─ WD_processDump_Spark
            └─ All humans authors (nothing of interest)
        └─ WD_protectedItems
            └─ Analysis based on WDCM (nothing of interest)
        └─ WD_TaintedReferences

Example query:

SELECT page_title, event_timestamp, event_user_id, \
                                revision_id, revision_parent_id, event_comment, \
                                revision_is_deleted_by_page_deletion, revision_is_identity_reverted, \
                                revision_first_identity_reverting_revision_id, revision_is_identity_revert \
                            FROM wmf.mediawiki_history \
                            WHERE event_entity='revision' \
                            AND event_type='create' \
                            AND wiki_db='wikidatawiki' \
                            AND snapshot='2019-12' \
                            AND event_comment RLIKE 'wbsetclaim-update' \
                            AND revision_is_identity_reverted = False \
                            AND event_timestamp RLIKE '^2019-12'

    └─ Wiktionary
        └─ Wiktionary_CognateDashboard (on Gerrit, nothing of interest)
crontabstat1007.txt
    └─ Not of interest
Experiments
    └─ Experimenting with Spark SQL and Kerberos (nothing of interest)
Python3
    └─ WDCM (would be archived on Gerrit if valuable, so not of interest)
R
    └─ x86_64-pc-linux-gnu-library (nothing of interest)
RScripts
    └─ NewEditors
        └─ Would be on Gerrit if in production (nothing of interest)
    └─ TechnicalWishes
        └─ Empty
    └─ WDCM_R
        └─ Would be on Gerrit if in production (nothing of interest)
venv
    └─ Env file (not of interest)

stat1008

Summary: Mostly ad hoc work and copies of WDCM/Cognate code that's archived on Gerrit. Nothing I'd keep.

total 16

Analytics
    └─ NewEditors
        └─ Would be archived on Gerrit if in production (nothing of interest)
    └─ Qurator
        └─ Would be archived on Gerrit if in production (nothing of interest)
    └─ WDCM
        └─ Would be archived on Gerrit if in production (nothing of interest)
    └─ WD_Editors_Lydia_20200624.ipynb
        └─ Ad hoc, not of interest
    └─ Wikidata
        └─ misc (ad hoc and it's all TSVs/CSVs, nothing of interest)
            └─ WD_Editors_Lydia_20200624
            └─ WD_Editors_Manuel_202107
        └─ renv
            └─ Env file (not of interest)
        └─ WD_languagesLandscape
            └─ Would be on Gerrit if in production (nothing of interest)
        └─ WD_ORES_ItemQuality
            └─ Not of interest given Lift Wing migration
        └─ WD_UsageCoverage
            └─ Would be on Gerrit if in production (nothing of interest)
    └─ Wiktionary
        └─ Wiktionary_CognateDashboard (would be on Gerrit if in production, not of interest)
R
    └─ x86_64-pc-linux-gnu-library (Not of interest)
renv
    └─ Env file (not of interest)
venv
    └─ Env file (not of interest)

stat1009

total 0

stat1010

total 0

HDFS

Summary: The HDFS files are a lot of CSVs, data dumps and files that are related to WDCM. I don't think that any of them are needed.

Found 55 items

/user/goransm/.Trash
    └─ Not of interest
/user/goransm/.metadata
    └─ Not of interest
/user/goransm/.sparkStaging
    └─ Not of interest
/user/goransm/.staging
    └─ Not of interest
/user/goransm/.temp
    └─ Not of interest
/user/goransm/Architectural-Structure_ItemIDs.csv
    └─ Not of interest
/user/goransm/Astronomical-Object_ItemIDs.csv
    └─ Not of interest
/user/goransm/Book_ItemIDs.csv
    └─ Not of interest
/user/goransm/Chemical-Entities_ItemIDs.csv
    └─ Not of interest
/user/goransm/Event_ItemIDs.csv
    └─ Not of interest
/user/goransm/Gene_ItemIDs.csv
    └─ Not of interest
/user/goransm/Geographical-Object_ItemIDs.csv
    └─ Not of interest
/user/goransm/Human_ItemIDs.csv
    └─ Not of interest
/user/goransm/ORESPredictions
    └─ Given Lift Wing migration, not of interest
/user/goransm/Organization_ItemIDs.csv
    └─ Not of interest
/user/goransm/Taxon_ItemIDs.csv
    └─ Not of interest
/user/goransm/Thoroughfare_ItemIDs.csv
    └─ Not of interest
/user/goransm/WDCM_Biases_ETL_Test
    └─ Would be on Gerrit if in production, not of interest
/user/goransm/WDCM_CollectedGeoItems
    └─ Would be on Gerrit if in production, not of interest
/user/goransm/WDCM_CollectedItems
    └─ Would be on Gerrit if in production, not of interest
/user/goransm/Wikimedia_Internal_ItemIDs.csv
    └─ Not of interest
/user/goransm/Work-Of-Art_ItemIDs.csv
    └─ Not of interest
/user/goransm/dewiki_revisions
    └─ Not of interest
/user/goransm/dfTrain1.csv
    └─ Not sure what this would be, but not of interest
/user/goransm/dfTrain2.csv
    └─ Not sure what this would be, but not of interest
/user/goransm/dfTrain3.csv
    └─ Not sure what this would be, but not of interest
/user/goransm/dfTrain4.csv
    └─ Not sure what this would be, but not of interest
/user/goransm/dfTrain5.csv
    └─ Not sure what this would be, but not of interest
/user/goransm/flights.csv
    └─ Not sure what this would be, but not of interest
/user/goransm/mysql-analytics-research-client-pw.txt
    └─ Not sure what this would be, but not of interest
/user/goransm/refClassSubclasses.csv
    └─ Not of interest
/user/goransm/separators.csv
    └─ Not of interest
/user/goransm/singleValueConstraintProperties.csv
    └─ Not of interest
/user/goransm/subclasses.csv
    └─ Not of interest
/user/goransm/tfMatrixDF.csv
    └─ Likely TFIDF term frequencies, not of interest
/user/goransm/tfMatrix_Human.csv
    └─ Likely TFIDF term frequencies, not of interest
/user/goransm/wdORESQuality.csv
    └─ Given Lift Wing migration, not of interest
/user/goransm/wdORESQuality_Reuse.csv
    └─ Given Lift Wing migration, not of interest
/user/goransm/wdORESQuality_Reuse_Commons.csv
    └─ Given Lift Wing migration, not of interest
/user/goransm/wdORESQuality_Reuse_nonCommons.csv
    └─ Given Lift Wing migration, not of interest
/user/goransm/wd_dump_geocoded
    └─ Not of interest
/user/goransm/wd_dump_human_author
    └─ Not of interest
/user/goransm/wd_dump_human_creator
    └─ Not of interest
/user/goransm/wd_dump_human_gender
    └─ Not of interest
/user/goransm/wd_dump_human_occupation
    └─ Not of interest
/user/goransm/wd_dump_item_language
    └─ Not of interest
/user/goransm/wd_dump_labels_English
    └─ Not of interest
/user/goransm/wd_entity_reuse
    └─ Not of interest
/user/goransm/wd_extId_data_qual_.csv
    └─ Not of interest
/user/goransm/wd_extId_data_ref_.csv
    └─ Not of interest
/user/goransm/wd_extId_data_ref_snak_.csv
    └─ Not of interest
/user/goransm/wd_extId_data_stat_.csv
    └─ Not of interest
/user/goransm/wdcmsqoop
    └─ Would be on Gerrit if in production, not of interest
/user/goransm/wdtranslationsb
    └─ Not really sure what this would be, but not of interest
/user/goransm/wikidataRevisions_EXP.csv
    └─ Not of interest

Hive

Nothing found

Ok then!

So the checks of the files above is complete as shown by its status. General summaries of each stat machine and HDFS are provided under the subsections above. stat1005 has some files that @Manuel may find interesting given that they're for prior tasks of his. Any queries that were interesting or were in files whose names sounded interesting but the query ended up not being interesting are printed above for documentation.

Overall I can say that anything from the above would be easier to work from scratch via the docs and checking with WMDE engineers or WMF Data Engineering/Analytics rather than going through and re-implementing for new task scopes. I personally would not keep anything, and will delete the files I copied over to my stat1005 once this is closed :)

Thanks again @JAllemandou for the file lists, and thanks @brouberol for the ping!

Hi Andrew, thank you for looking into the files. Based on what you found, I agree to delete everything once we have made our copies. While you are at it: Could you please make a copy of all stat1005 files for me (except data files as well as 2020 files or older). Some of the newer notebooks could still be useful to have. Thx!

Hi @Manuel - sending along a summary of what I'll be getting for you:

====== stat1004 ======
Jul 25  2020 Analytics
Jun 23  2020 Experiments
Jul 25  2020 wdUsagePerPage

====== stat1005 ======
All non data files

====== stat1007 ======
Aug 23  2020 Analytics
Jan 27  2020 Experiments
Aug 23  2020 RScripts

====== stat1008 ======
Oct 11  2021 Analytics
Jun 23  2020 R

======= HDFS ========
2021-11-02 17:37 /user/goransm/dewiki_revisions
2021-04-11 16:51 /user/goransm/wdtranslationsb
No other files, as everything after 2020 is a data file or ORES related (this is coming in the stat server files anyway)

TSVs, CSVs and other data file types will not be included in the transfer. Out of convenience, I'm going to transfer the files into your directory on the given server.

Hi Andrew, only the files from stat1005 will do (except data files and except files 2020 or older). Thank you for confirming!

Hi @Manuel, checking further as it's still not clear what you'd like. The double except is confusing. I'll only transfer files from stat1005, and could you answer the following questions:

  1. Do you want data files (.csv, .tsv, etc) before 2020? (assumption no)
  2. Do you want data files after 2020? (as of now unclear)
  3. Do you want non data files (.py, .R, etc) before 2020? (as of now unclear)
  4. Do you want non data files after 2020? (assumption yes)

I'm also realizing that I don't have admin rights and thus can't move files to your directory. I'll copy these files over to my directory, download them and send you a link to a zipped directory on Google Drive once we have the above figured out.

I'm also realizing that I don't have admin rights and thus can't move files to your directory. I'll copy these files over to my directory, download them and send you a link to a zipped directory on Google Drive once we have the above figured out.

I can move them for you @AndrewTavis_WMDE - I think that's probably better than transferring anything via Google Drive, if you don't mind.

Thank you, @BTullis! Ya I wasn't happy with the solution either. Appreciate your willingness to help!

Hi, I am only interested in #4: non data files after 2020 from stat1005. You could also just copy all stat1005 files if this is easier. Cheers!

Thank you, @BTullis!

I have created an archive with the following command:

btullis@stat1005:/home/goransm$ tar czvf ~/goransm.stat1005.T358311.tgz --exclude '*/.*' --exclude '*.tsv' --exclude '*.csv' --exclude '*.dmatrix' --exclude '*.Rds' --exclude '*.log' --exclude '*LOG.txt' .

Without those exclusions, the user's home directory was 118 GB. After the exclusions it was around 1 GB. After compression this became 355 MB.

@Manuel I- have put this file into your home directory on stat1011.

btullis@stat1011:/home/manuel-wmde$ tree -ugph
.
└── [-rw-rw-r-- manuel-wmde wikidev  355M]  goransm.stat1005.T358311.tgz

0 directories, 1 file

stat1005 is due for decommissioning imminently under T353785: Decom EOL stats servers stat100[4-7] so stat1011 is a better place for it.

I hope that's OK. Please do let me know if you feel that I missed anything with those exclusions.

I have removed the HDFS files (although they're still recoverable for 30 days).

btullis@an-launcher1002:~$ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r /user/goransm
24/06/04 10:59:40 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/user/goransm' to trash at: hdfs://analytics-hadoop/user/hdfs/.Trash/Current/user/goransm

@Manuel - Have you verified that the archive contains what you wish to retain? Are you happy for me to proceed to delete the user's home directories on the stats servers now?
Thanks.

Are you happy for me to proceed to delete the user's home directories on the stats servers now?

Yes I am, please proceed from my side, and thank you again for confirming!

Thanks again @Manuel. I executed the following command to remove the home directories.

btullis@cumin1002:~$ sudo cumin 'C:profile::analytics::cluster::client or C:profile::hadoop::master or C:profile::hadoop::master::standby' 'rm -rf /home/goransm'

Thanks so much for the support here, @BTullis! I'll update the epic with this being done. So close to being finished with all this :)