Page MenuHomePhabricator

[EPIC] Check of legacy wmde analytics infrastructure
Open, LowPublic

Description

For these checks I'm looking at versions of the code that were edited and the originals from Gerrit. Generally what the following entails is codebase searches for the given topics.

Main goal of this: are there other processes like the PHP codes that are linked to old R-based analytics processes? We should also look from the other direction to see if some of the functioning processes have scripts that serve as inputs for the old R-based processes, which would indicate that their connections should then be checked as well.

  • Check WD Analytics code (Gerrit) for relations to other data services and a username
    • Graphite
      • None (libgraphite2 - for fonts)
    • Cloud VPS (references and two main endpoint URLs)
      • References in HTML <a> tags of ui.R and app_ui.R files (wmdeanalytics)
      • R and Docker based deployment files
      • URL targets to https://wikidata-analytics.wmcloud.org (wikidata-analytics)
      • No references to wmde-dashboards, wiktionary-cognate
      • References to neweditors in R files for TW ad hoc analysis (on Gerrit, but not a part of WikidataAnalytics)
    • Published data export path
      • R files for reading and writing data and HTML <a> tags
    • Cronjobs and commands
      • R cron jobs, data lake ETL Python jobs, moving data to HDFS, cleaning up HDFS, settings configurations, copying update strings to published data folder, kerberos credential initialization, copying files to published data folder, logging runs, Spark 2 commands, Docker commands, dependency installation
      • References to /local are for setting a local dir and running queries after setting kerberos via sudoing into the data lake, setting login credentials, returning errors, installing dependencies
      • No direct references to cron ("asynconous" only)
      • Some commands are sudo commands
    • PHP codes
      • No references to analytics-wmde-scripts
      • No references to statistics::wmde (puppet)
      • References to php in code are for the Wikidata action API, Commons API and Reference Hunt PHP stats report
    • User name
      • IRC and Wikimedia nic references
      • Code comments
      • Personal DB used for some data sources (wdcm_clients_wb_entity_usage)
      • Paths to config files and other setup files
      • Paths for saved data
    • Archive Gerrit repo
      • Archive/delete GitHub repo (check if not automatic)
      • Covered by T357697
  • Check WDCM code (Gerrit) for relations to other data services and a username (deleted on GitHub because of archive on Gerrit)
    • Graphite
      • None (libgraphite2 - for fonts)
    • Cloud VPS (references and two main endpoint URLs)
      • References in HTML <a> tags (wmdeanalytics)
      • No references to wmde-dashboards, wiktionary-cognate, neweditors, wikidata-analytics
    • Published data export path
      • R files for reading and writing data
    • Cronjobs and commands
      • R cron jobs, data lake ETL Python jobs, moving data to HDFS, copying update strings to published data folder
      • No direct references to cron
      • References to /local are for setting a local dir and running queries after setting kerberos
      • Some commands are sudo commands
    • PHP codes
      • No references to analytics-wmde-scripts
      • No references to statistics::wmde (puppet)
      • References to php in code are for the Wikidata action API and Reference Hunt PHP stats report
    • User name
      • IRC and Wikimedia nic references
      • Code comments
      • Personal DB used for some data sources (wdcm_clients_wb_entity_usage)
      • Paths to config files and other setup files
      • Paths for saved data
    • Archive Gerrit repo
    • Archive/delete GitHub repo (check if not automatic)
  • Check Wiktionary Cognates code (Gerrit) for relations to other data services and a username
    • Graphite
      • None (libgraphite2 - for fonts)
    • Cloud VPS (references and two main endpoint URLs)
      • References to wiktionary-cognate images via .pper files and path files
      • No references to wmde-dashboards, wmdeanalytics, wikidata-analytics, neweditors
    • Published data export path
      • R files for reading and writing data
    • Cronjobs and commands
      • Installing dependencies
      • No direct references to cron
      • References to /local are for installing dependencies
      • No sudo commands
    • PHP codes
      • No references to analytics-wmde-scripts
      • No references to statistics::wmde (puppet)
    • User name
      • IRC nic references
      • Original .yml file references to local directories
      • .Rproj.user paths to local directories where Wiktionary Cognate dockerfiles, engines and dev start files are _Img/Wiktionary-Cognate
    • Archive Gerrit repo
      • Archive/delete GitHub repo (check if not automatic)
      • Covered by T357697
  • Check NewEditors Gerrit codes
    • As there were cron jobs that were still running for this
    • It looks like there's only one file in here that's a config file for Gerrit?
    • Archive Gerrit repo?
  • Check of PHP codes that could be related and are still in use
    • Assumption from time of writing is no connections
    • Check all Cloud VPS instances
      • No references to wmde-dashboards
      • No references to wmdeanalytics
      • No references to wikidata-analytics
      • No references to wiktionary
      • No references to cognate
      • No references to neweditors
      • No reference to .r or .R in the codes
    • Check published data export path
      • No references to wmde-analytics-engineering
    • Check references to projects or sensitive sources
      • No reference to wdcm, curator, curious, wd_analytics, tmp, hdfs/HDFS, the given username
  • Check GitHub for routes to dashboard endpoint or the other dashboard endpoint
    • Endpoint: Wikidata Query Builder footer code, Mismatch Finder Layout code, Wikidata contributor documentation issue, URLS in code for projects themselves, comments in code for the projects themselves, similarly named Python scripts from WMF employees, related business forks of the projects
    • Other endpoint: WMF repos for the projects themselves, WMDE employee fork, related business fork of the projects,
  • Check GitHub for routes to published data export paths
  • Check GitHub for routes to VPS related code
    • wmde-dashboards: WMDE/WMF repos and forks
    • wmdeanalytics: WMF repos and forks
    • See above for wikidata-analytics
    • wiktionary-cognate: WMF repos, some repos that do Wiktionary cognates themselves (one in Python)
  • Checking Graphite processes for references to WD, WDCM and Wiktionary Cognates processes
    • No results when searching for wdcm, wmde references are for test.wikidata, scribunto or a 2019 email banner campaign, no combination of wd or wikidata with analytics is returned with either a dash or underscore, cognate results are all MediaWiki, but it's actual cognate, not analysis on it
  • Checking Prometheus
    • Given timing for when Prometheus was adopted (2021), there is a level of certainty that no jobs are related
  • Cronjobs
  • Misc cron jobs
    • Via WMF: there's only one job still running on the given user account - 2021_WMDEMitmachenBereichCampaign_PRODUCTION.R
    • Box is checked once this job has been migrated from current account or stopped
  • Cloud VPS general check
  • HDFS
    • Do we need to investigate this? There are references to moving data there and cleaning a directory.
    • hdfs:///tmp/wmde/analytics
      • Was used as an intermediary step in the R-based processes
      • Was it being cleaned properly? Can we check if there's anything in there?
        • There are files related to this process in HDFS, so cleaning was not being done properly
        • They will be deleted automatically, or we can request to have them be deleted
        • There are still files in here post deprecation, so we should check again once the Cloud VPS work above is finished
    • User tables

See Legacy Infrastructure Investigation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
AndrewTavis_WMDE updated the task description. (Show Details)
AndrewTavis_WMDE updated the task description. (Show Details)

@hashar, pinging you on this as you were helpful in T354534: Archive Wikidata Concepts Monitor repositories. There are a few more related Gerrit repos that should be deprecated at this point, specifically:

Would you be able to help with this? Totally fine if I should make a specific task or check with someone else!

Also, for the first two there are GitHub repos for them. As they're Gerrit mirrors, will they be removed automatically, or do we have to do it explicitly?

Manuel renamed this task from Check of legacy wmde analytics infrastructure to [EPIC] Check of legacy wmde analytics infrastructure.Mar 20 2024, 9:43 AM
Manuel lowered the priority of this task from High to Low.
Manuel moved this task from Prioritized backlog to WIP epics on the Wikidata Analytics (Kanban) board.

Note that in checking the tmp directory just now, there still are files/directories in there, meaning that parts of the process are likely still running (maybe parts that don't need private data access). We'll be checking this again in a month once the VPS machines are shut down.