Checks of infrastructure related to prior WMDE analytics infrastructure. Note that for these checks I'm looking at versions that were edited and the originals from Gerrit. Generally what the following entails is codebase searches for the given topics.
- [ ] Check [[ https://github.com/wikimedia/analytics-wmde-WD-WikidataAnalytics | WD Analytics code ]] ([[ https://gerrit.wikimedia.org/r/admin/repos/q/filter:analytics%252Fwmde%252FWD | Gerrit ]]) for relations to other data services and a username
- [x] Graphite
- None (`libgraphite2` - for fonts)
- [x] Cloud VPS (references and two main endpoint URLs)
- References in HTML `<a>` tags of `ui.R` and `app_ui.R` files (`wmdeanalytics`)
- R and Docker based deployment files
- URL targets to https://wikidata-analytics.wmcloud.org (`wikidata-analytics`)
- No references to `wmde-dashboards`, `wiktionary-cognate`
- References to `neweditors` in R files for TW ad hoc analysis (on Gerrit, but not a part of WikidataAnalytics)
- [x] Published data export path
- R files for reading and writing data and HTML `<a>` tags
- [x] Cronjobs and commands
- R cron jobs, data lake ETL Python jobs, moving data to HDFS, cleaning up HDFS, settings configurations, copying update strings to published data folder, kerberos credential initialization, copying files to published data folder, logging runs, Spark 2 commands, Docker commands, dependency installation
- References to `/local` are for setting a local dir and running queries after setting kerberos via sudoing into the data lake, setting login credentials, returning errors, installing dependencies
- No direct references to `cron` ("asynconous" only)
- Some commands are sudo commands
- [x] PHP codes
- No references to `analytics-wmde-scripts`
- No references to `statistics::wmde` (puppet)
- References to `php` in code are for the Wikidata action API, Commons API and Reference Hunt PHP stats report
- [x] User name
- IRC and Wikimedia nic references
- Code comments
- Personal DB used for some data sources (`wdcm_clients_wb_entity_usage`)
- Paths to config files and other setup files
- Paths for saved data
- [ ] Archive Gerrit repo
- [ ] Archive/delete GitHub repo (check if not automatic)
- [ ] Check [[ https://github.com/wikimedia/analytics-wmde-WDCM | WDCM code ]] ([[ https://gerrit.wikimedia.org/r/admin/repos/q/filter:analytics%252Fwmde%252FWDCM | Gerrit ]]) for relations to other data services and a username (deprecated on GitHub because of deprecation on Gerrit)
- [x] Graphite
- None (`libgraphite2` - for fonts)
- [x] Cloud VPS (references and two main endpoint URLs)
- References in HTML `<a>` tags (`wmdeanalytics`)
- No references to `wmde-dashboards`, `wiktionary-cognate`, `neweditors`, `wikidata-analytics`
- [x] Published data export path
- R files for reading and writing data
- [x] Cronjobs and commands
- R cron jobs, data lake ETL Python jobs, moving data to HDFS, copying update strings to published data folder
- No direct references to `cron`
- References to `/local` are for setting a local dir and running queries after setting kerberos
- Some commands are sudo commands
- [x] PHP codes
- No references to `analytics-wmde-scripts`
- No references to `statistics::wmde` (puppet)
- References to `php` in code are for the Wikidata action API and Reference Hunt PHP stats report
- [x] User name
- IRC and Wikimedia nic references
- Code comments
- Personal DB used for some data sources (`wdcm_clients_wb_entity_usage`)
- Paths to config files and other setup files
- Paths for saved data
- [x] Archive Gerrit repo
- [x] Archive/delete GitHub repo (check if not automatic)
- [ ] Check [[ https://github.com/wikimedia/analytics-wmde-WiktionaryCognateDashboard | Wiktionary Cognates code ]] ([[ https://gerrit.wikimedia.org/r/admin/repos/q/filter:analytics%252Fwmde%252FWiktionary | Gerrit ]]) for relations to other data services and a username
- [x] Graphite
- None (`libgraphite2` - for fonts)
- [x] Cloud VPS (references and two main endpoint URLs)
- References to `wiktionary-cognate` images via `.pper` files and path files
- No references to `wmde-dashboards`, `wmdeanalytics`, `wikidata-analytics`, `neweditors`
- [x] Published data export path
- R files for reading and writing data
- [x] Cronjobs and commands
- Installing dependencies
- No direct references to `cron`
- References to `/local` are for installing dependencies
- No sudo commands
- [x] PHP codes
- No references to `analytics-wmde-scripts`
- No references to `statistics::wmde` (puppet)
- [x] User name
- IRC nic references
- Original .yml file references to local directories
- `.Rproj.user` paths to local directories where Wiktionary Cognate dockerfiles, engines and dev start files are `_Img/Wiktionary-Cognate`
- [ ] Archive Gerrit repo
- [ ] Archive/delete GitHub repo (check if not automatic)
- [ ] Check [[ https://gerrit.wikimedia.org/r/admin/repos/q/filter:analytics%252Fwmde%252FNewEditors | NewEditors ]] Gerrit codes
- As there were cron jobs that were still running for this
- It looks like there's only one file in here that's a config file for Gerrit?
- [ ] Archive Gerrit repo?
- [x] Check of [[ https://github.com/wikimedia/analytics-wmde-scripts | PHP codes ]] that could be related and are still in use
- Assumption from time of writing is no connections
- [x] Check all Cloud VPS instances
- No references to `wmde-dashboards`
- No references to `wmdeanalytics`
- No references to `wikidata-analytics`
- No references to `wiktionary`
- No references to `cognate`
- No references to `neweditors`
- No reference to `.r` or `.R` in the codes
- [x] Check published data export path
- No references to `wmde-analytics-engineering`
- [x] Check references to projects or sensitive sources
- No reference to `wdcm`, `curator`, `curious`, `wd_analytics`, `tmp`, `hdfs`/`HDFS`, the given username
- [x] Check GitHub for routes to [[ https://wikidata-analytics.wmcloud.org | dashboard endpoint ]] or the [[ http://wmdeanalytics.wmflabs.org/ | other dashboard endpoint ]]
- Endpoint: [[ https://github.com/wikimedia/wikidata-query-builder/blob/f4ee611f52bc80525f4be7ff34938f5888ee27d7/src/components/Footer.vue#L53 | Wikidata Query Builder footer code ]], [[ https://github.com/wmde/wikidata-mismatch-finder/blob/7ccb74874cdbb9160c1177a765dc53ec83a44f1a/resources/js/Pages/Layout.vue#L74 | Mismatch Finder Layout code ]], Wikidata contributor documentation issue, URLS in code for projects themselves, comments in code for the projects themselves, similarly named Python scripts from WMF employees, related business forks of the projects
- Other endpoint: WMF repos for the projects themselves, WMDE employee fork, related business fork of the projects,
- [x] Check GitHub for routes to [[ https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/ | published data export paths ]]
- WMF repos for the projects themselves, [[ https://github.com/search?q=repo%3Awmde%2Fgrafana-dashboards+wmde-analytics-engineering&type=code | deprecated Grafana dashboard codebase ]] and forks of it
- [x] Check GitHub for routes to VPS related code
- `wmde-dashboards`: WMDE/WMF repos and forks
- `wmdeanalytics`: WMF repos and forks
- See above for `wikidata-analytics`
- `wiktionary-cognate`: WMF repos, some repos that do Wiktionary cognates themselves ([[ https://github.com/crowtherln/cognates | one in Python ]])
- [x] Checking [[ https://graphite.wikimedia.org/ | Graphite ]] processes for references to WD, WDCM and Wiktionary Cognates processes
- No results when searching for `wdcm`, `wmde` references are for `test.wikidata`, scribunto or a 2019 email banner campaign, no combination of `wd` or `wikidata` with `analytics` is returned with either a dash or underscore, `cognate` results are all MediaWiki, but it's actual cognate, not analysis on it
- [x] Checking [[ https://wikitech.wikimedia.org/wiki/Prometheus | Prometheus ]]
- Given timing for when Prometheus was adopted (2021), there is a level of certainty that no jobs are related
- [x] Cronjobs
- [[ https://docs.google.com/spreadsheets/d/1w2f_ndQa6Lo2BBfPJ88sJLSg2RJeTQKFNOPd0zjiB4I/edit#gid=1625715087 | Sheet with PHP jobs overview ]]
- [[ https://docs.google.com/spreadsheets/d/1w2f_ndQa6Lo2BBfPJ88sJLSg2RJeTQKFNOPd0zjiB4I/edit?usp=sharing | Sheet with R jobs overview ]]
- Was checked by multiple members of the team
- It does not look like the R jobs are required for the PHP workflows at all
- [x] Misc cron jobs
- Via WMF: there's only one job still running on the given user account - `2021_WMDEMitmachenBereichCampaign_PRODUCTION.R`
- Box is checked once this job has been migrated from current account or stopped
- [ ] Cloud VPS general check
- [x] Instances (machines below)
- https://openstack-browser.toolforge.org/project/wmde-dashboards
- [[ https://openstack-browser.toolforge.org/server/wikidata-analytics.wmde-dashboards.eqiad1.wikimedia.cloud | wikidata-analytics.wmde-dashboards.eqiad1.wikimedia.cloud ]]
- [[ https://openstack-browser.toolforge.org/server/wiktionary-cognate.wmde-dashboards.eqiad1.wikimedia.cloud | wiktionary-cognate.wmde-dashboards.eqiad1.wikimedia.cloud ]]
- [[ https://openstack-browser.toolforge.org/server/wmde-neweditors.wmde-dashboards.eqiad1.wikimedia.cloud | wmde-neweditors.wmde-dashboards.eqiad1.wikimedia.cloud ]]
- https://openstack-browser.toolforge.org/project/wmdeanalytics
- [[ https://openstack-browser.toolforge.org/server/wikidata-analytics-1.wmdeanalytics.eqiad1.wikimedia.cloud | wikidata-analytics-1.wmdeanalytics.eqiad1.wikimedia.cloud ]]
- [[ https://openstack-browser.toolforge.org/server/wiktionary-cognate-1.wmdeanalytics.eqiad1.wikimedia.cloud | wiktionary-cognate-1.wmdeanalytics.eqiad1.wikimedia.cloud ]]
- https://openstack-browser.toolforge.org/project/wikidata-dev (no related machines)
- [x] Are the above the only ones we need to consider?
- Asked in WMDE SWE Data and no indication of anything else
- [ ] Look into the machines themselves as a final check
- [ ] HDFS
- Do we need to investigate this? There are references to moving data there and cleaning a directory.
- `hdfs:///tmp/wmde/analytics`
- Similarly user tables
- Are these restricted without private data access?
See [Legacy Infrastructure Investigation](https://docs.google.com/document/d/1r0dLNPO_-JbXwstT9afxCvgVVmEjRR6B_WA-RmcLjDM/edit)