For these checks I'm looking at versions of the code that were edited and the originals from Gerrit. Generally what the following entails is codebase searches for the given topics.
Main goal of this: are there other processes like the PHP codes that are linked to old R-based analytics processes? We should also look from the other direction to see if some of the functioning processes have scripts that serve as inputs for the old R-based processes, which would indicate that their connections should then be checked as well.
- Check WD Analytics code (Gerrit) for relations to other data services and a username
- Graphite
- None (libgraphite2 - for fonts)
- Cloud VPS (references and two main endpoint URLs)
- References in HTML <a> tags of ui.R and app_ui.R files (wmdeanalytics)
- R and Docker based deployment files
- URL targets to https://wikidata-analytics.wmcloud.org (wikidata-analytics)
- No references to wmde-dashboards, wiktionary-cognate
- References to neweditors in R files for TW ad hoc analysis (on Gerrit, but not a part of WikidataAnalytics)
- Published data export path
- R files for reading and writing data and HTML <a> tags
- Cronjobs and commands
- R cron jobs, data lake ETL Python jobs, moving data to HDFS, cleaning up HDFS, settings configurations, copying update strings to published data folder, kerberos credential initialization, copying files to published data folder, logging runs, Spark 2 commands, Docker commands, dependency installation
- References to /local are for setting a local dir and running queries after setting kerberos via sudoing into the data lake, setting login credentials, returning errors, installing dependencies
- No direct references to cron ("asynconous" only)
- Some commands are sudo commands
- PHP codes
- No references to analytics-wmde-scripts
- No references to statistics::wmde (puppet)
- References to php in code are for the Wikidata action API, Commons API and Reference Hunt PHP stats report
- User name
- IRC and Wikimedia nic references
- Code comments
- Personal DB used for some data sources (wdcm_clients_wb_entity_usage)
- Paths to config files and other setup files
- Paths for saved data
- Archive Gerrit repo
- Archive/delete GitHub repo (check if not automatic)
- Covered by T357697
- Graphite
- Check WDCM code (Gerrit) for relations to other data services and a username (deleted on GitHub because of archive on Gerrit)
- Graphite
- None (libgraphite2 - for fonts)
- Cloud VPS (references and two main endpoint URLs)
- References in HTML <a> tags (wmdeanalytics)
- No references to wmde-dashboards, wiktionary-cognate, neweditors, wikidata-analytics
- Published data export path
- R files for reading and writing data
- Cronjobs and commands
- R cron jobs, data lake ETL Python jobs, moving data to HDFS, copying update strings to published data folder
- No direct references to cron
- References to /local are for setting a local dir and running queries after setting kerberos
- Some commands are sudo commands
- PHP codes
- No references to analytics-wmde-scripts
- No references to statistics::wmde (puppet)
- References to php in code are for the Wikidata action API and Reference Hunt PHP stats report
- User name
- IRC and Wikimedia nic references
- Code comments
- Personal DB used for some data sources (wdcm_clients_wb_entity_usage)
- Paths to config files and other setup files
- Paths for saved data
- Archive Gerrit repo
- Archive/delete GitHub repo (check if not automatic)
- Graphite
- Check Wiktionary Cognates code (Gerrit) for relations to other data services and a username
- Graphite
- None (libgraphite2 - for fonts)
- Cloud VPS (references and two main endpoint URLs)
- References to wiktionary-cognate images via .pper files and path files
- No references to wmde-dashboards, wmdeanalytics, wikidata-analytics, neweditors
- Published data export path
- R files for reading and writing data
- Cronjobs and commands
- Installing dependencies
- No direct references to cron
- References to /local are for installing dependencies
- No sudo commands
- PHP codes
- No references to analytics-wmde-scripts
- No references to statistics::wmde (puppet)
- User name
- IRC nic references
- Original .yml file references to local directories
- .Rproj.user paths to local directories where Wiktionary Cognate dockerfiles, engines and dev start files are _Img/Wiktionary-Cognate
- Archive Gerrit repo
- Archive/delete GitHub repo (check if not automatic)
- Covered by T357697
- Graphite
- Check NewEditors Gerrit codes
- As there were cron jobs that were still running for this
- It looks like there's only one file in here that's a config file for Gerrit?
- Archive Gerrit repo?
- Covered by T357697
- Check of PHP codes that could be related and are still in use
- Assumption from time of writing is no connections
- Check all Cloud VPS instances
- No references to wmde-dashboards
- No references to wmdeanalytics
- No references to wikidata-analytics
- No references to wiktionary
- No references to cognate
- No references to neweditors
- No reference to .r or .R in the codes
- Check published data export path
- No references to wmde-analytics-engineering
- Check references to projects or sensitive sources
- No reference to wdcm, curator, curious, wd_analytics, tmp, hdfs/HDFS, the given username
- Check GitHub for routes to dashboard endpoint or the other dashboard endpoint
- Endpoint: Wikidata Query Builder footer code, Mismatch Finder Layout code, Wikidata contributor documentation issue, URLS in code for projects themselves, comments in code for the projects themselves, similarly named Python scripts from WMF employees, related business forks of the projects
- Other endpoint: WMF repos for the projects themselves, WMDE employee fork, related business fork of the projects,
- Check GitHub for routes to published data export paths
- WMF repos for the projects themselves, deprecated Grafana dashboard codebase and forks of it
- Check GitHub for routes to VPS related code
- wmde-dashboards: WMDE/WMF repos and forks
- wmdeanalytics: WMF repos and forks
- See above for wikidata-analytics
- wiktionary-cognate: WMF repos, some repos that do Wiktionary cognates themselves (one in Python)
- Checking Graphite processes for references to WD, WDCM and Wiktionary Cognates processes
- No results when searching for wdcm, wmde references are for test.wikidata, scribunto or a 2019 email banner campaign, no combination of wd or wikidata with analytics is returned with either a dash or underscore, cognate results are all MediaWiki, but it's actual cognate, not analysis on it
- Checking Prometheus
- Given timing for when Prometheus was adopted (2021), there is a level of certainty that no jobs are related
- Cronjobs
- Sheet with PHP jobs overview
- Sheet with R jobs overview
- Was checked by multiple members of the team
- It does not look like the R jobs are required for the PHP workflows at all
- Misc cron jobs
- Via WMF: there's only one job still running on the given user account - 2021_WMDEMitmachenBereichCampaign_PRODUCTION.R
- Box is checked once this job has been migrated from current account or stopped
- Cloud VPS general check
- Instances (machines below)
- Are the above the only ones we need to consider?
- Asked in WMDE SWE Data and no indication of anything else
- Look into the machines themselves as a final check
- Reach out to WMF Cloud Services on #wikimedia-cloud
- https://openstack-browser.toolforge.org/project/wmdeanalytics
- I have access to wiktionary-cognate
- I don't have access to wikidata-analytics
- https://openstack-browser.toolforge.org/project/wmde-dashboards
- I do not have access, but my EM has asked for it for me
- I now have access
- Note that these projects will be shut down at the end of June via the Cloud VPS purge (task will be updated after)
- HDFS
- Do we need to investigate this? There are references to moving data there and cleaning a directory.
- hdfs:///tmp/wmde/analytics
- Was used as an intermediary step in the R-based processes
- Was it being cleaned properly? Can we check if there's anything in there?
- There are files related to this process in HDFS, so cleaning was not being done properly
- They will be deleted automatically, or we can request to have them be deleted
- There are still files in here post deprecation, so we should check again once the Cloud VPS work above is finished
- User tables
- Covered in https://phabricator.wikimedia.org/T358311
- Are these restricted without private data access?
- Tables can be archived and restricted to admins only
- Box is checked once this is finalized