Page MenuHomePhabricator

[EPIC] Check of legacy wmde analytics infrastructure
Closed, ResolvedPublic

Description

For these checks I'm looking at versions of the code that were edited and the originals from Gerrit. Generally what the following entails is codebase searches for the given topics.

Main goal of this: are there other processes like the PHP codes that are linked to old R-based analytics processes? We should also look from the other direction to see if some of the functioning processes have scripts that serve as inputs for the old R-based processes, which would indicate that their connections should then be checked as well.

Check WD Analytics code (Gerrit) for relations to other data services and a username

  • Graphite
    • None (libgraphite2 - for fonts)
  • Cloud VPS (references and two main endpoint URLs)
    • References in HTML <a> tags of ui.R and app_ui.R files (wmdeanalytics)
    • R and Docker based deployment files
    • URL targets to https://wikidata-analytics.wmcloud.org (wikidata-analytics)
    • No references to wmde-dashboards, wiktionary-cognate
    • References to neweditors in R files for TW ad hoc analysis (on Gerrit, but not a part of WikidataAnalytics)
  • Published data export path
    • R files for reading and writing data and HTML <a> tags
  • Cronjobs and commands
    • R cron jobs, data lake ETL Python jobs, moving data to HDFS, cleaning up HDFS, settings configurations, copying update strings to published data folder, kerberos credential initialization, copying files to published data folder, logging runs, Spark 2 commands, Docker commands, dependency installation
    • References to /local are for setting a local dir and running queries after setting kerberos via sudoing into the data lake, setting login credentials, returning errors, installing dependencies
    • No direct references to cron ("asynconous" only)
    • Some commands are sudo commands
  • PHP codes
    • No references to analytics-wmde-scripts
    • No references to statistics::wmde (puppet)
    • References to php in code are for the Wikidata action API, Commons API and Reference Hunt PHP stats report
  • User name
    • IRC and Wikimedia nic references
    • Code comments
    • Personal DB used for some data sources (wdcm_clients_wb_entity_usage)
    • Paths to config files and other setup files
    • Paths for saved data
  • Archive Gerrit repo

Check WDCM code (Gerrit) for relations to other data services and a username (deleted on GitHub because of archive on Gerrit)

  • Graphite
    • None (libgraphite2 - for fonts)
  • Cloud VPS (references and two main endpoint URLs)
    • References in HTML <a> tags (wmdeanalytics)
    • No references to wmde-dashboards, wiktionary-cognate, neweditors, wikidata-analytics
  • Published data export path
    • R files for reading and writing data
  • Cronjobs and commands
    • R cron jobs, data lake ETL Python jobs, moving data to HDFS, copying update strings to published data folder
    • No direct references to cron
    • References to /local are for setting a local dir and running queries after setting kerberos
    • Some commands are sudo commands
  • PHP codes
    • No references to analytics-wmde-scripts
    • No references to statistics::wmde (puppet)
    • References to php in code are for the Wikidata action API and Reference Hunt PHP stats report
  • User name
    • IRC and Wikimedia nic references
    • Code comments
    • Personal DB used for some data sources (wdcm_clients_wb_entity_usage)
    • Paths to config files and other setup files
    • Paths for saved data
  • Archive Gerrit repo
  • Archive/delete GitHub repo (check if not automatic)

Check Wiktionary Cognates code (Gerrit) for relations to other data services and a username

  • Graphite
    • None (libgraphite2 - for fonts)
  • Cloud VPS (references and two main endpoint URLs)
    • References to wiktionary-cognate images via .pper files and path files
    • No references to wmde-dashboards, wmdeanalytics, wikidata-analytics, neweditors
  • Published data export path
    • R files for reading and writing data
  • Cronjobs and commands
    • Installing dependencies
    • No direct references to cron
    • References to /local are for installing dependencies
    • No sudo commands
  • PHP codes
    • No references to analytics-wmde-scripts
    • No references to statistics::wmde (puppet)
  • User name
    • IRC nic references
    • Original .yml file references to local directories
    • .Rproj.user paths to local directories where Wiktionary Cognate dockerfiles, engines and dev start files are _Img/Wiktionary-Cognate
  • Archive Gerrit repo
    • Archive/delete GitHub repo (check if not automatic)
    • Covered by T357697

Check NewEditors Gerrit codes

    • As there were cron jobs that were still running for this
    • It looks like there's only one file in here that's a config file for Gerrit?
    • Archive Gerrit repo?
  • Check of PHP codes that could be related and are still in use
    • Assumption from time of writing is no connections
    • Check all Cloud VPS instances
      • No references to wmde-dashboards
      • No references to wmdeanalytics
      • No references to wikidata-analytics
      • No references to wiktionary
      • No references to cognate
      • No references to neweditors
      • No reference to .r or .R in the codes
    • Check published data export path
      • No references to wmde-analytics-engineering
    • Check references to projects or sensitive sources
      • No reference to wdcm, curator, curious, wd_analytics, tmp, hdfs/HDFS, the given username

Check GitHub for routes to dashboard endpoint or the other dashboard endpoint

  • Endpoint: Wikidata Query Builder footer code, Mismatch Finder Layout code, Wikidata contributor documentation issue, URLS in code for projects themselves, comments in code for the projects themselves, similarly named Python scripts from WMF employees, related business forks of the projects
  • Other endpoint: WMF repos for the projects themselves, WMDE employee fork, related business fork of the projects,

Check GitHub for routes to published data export paths

  • Check GitHub for routes to VPS related code
    • wmde-dashboards: WMDE/WMF repos and forks
    • wmdeanalytics: WMF repos and forks
    • See above for wikidata-analytics
    • wiktionary-cognate: WMF repos, some repos that do Wiktionary cognates themselves (one in Python)

Check Graphite processes for references to WD, WDCM and Wiktionary Cognates processes

    • No results when searching for wdcm, wmde references are for test.wikidata, scribunto or a 2019 email banner campaign, no combination of wd or wikidata with analytics is returned with either a dash or underscore, cognate results are all MediaWiki, but it's actual cognate, not analysis on it
  • Checking Prometheus
    • Given timing for when Prometheus was adopted (2021), there is a level of certainty that no jobs are related
  • Cronjobs

Misc cron jobs

  • Via WMF: there's only one job still running on the given user account - 2021_WMDEMitmachenBereichCampaign_PRODUCTION.R
  • Box is checked once this job has been migrated from current account or stopped
    • Account access was restricted

Cloud VPS general check

HDFS

  • Do we need to investigate this? There are references to moving data there and cleaning a directory.
  • hdfs:///tmp/wmde/analytics
    • 5/7/2024: @AndrewTavis_WMDE: I just checked this and the only stuff that's in there is from me :)
    • Was used as an intermediary step in the R-based processes
    • Was it being cleaned properly? Can we check if there's anything in there?
      • There are files related to this process in HDFS, so cleaning was not being done properly
      • They will be deleted automatically, or we can request to have them be deleted
      • There are still files in here post deprecation, so we should check again once the Cloud VPS work above is finished
  • User tables

Gerrit groups related to prior work (see overview)

See Legacy Infrastructure Investigation

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@hashar, pinging you on this as you were helpful in T354534: Archive Wikidata Concepts Monitor repositories. There are a few more related Gerrit repos that should be deprecated at this point, specifically:

Would you be able to help with this? Totally fine if I should make a specific task or check with someone else!

Also, for the first two there are GitHub repos for them. As they're Gerrit mirrors, will they be removed automatically, or do we have to do it explicitly?

Manuel renamed this task from Check of legacy wmde analytics infrastructure to [EPIC] Check of legacy wmde analytics infrastructure.Mar 20 2024, 9:43 AM
Manuel lowered the priority of this task from High to Low.
Manuel moved this task from To-Do to WIP Epics on the Wikidata Analytics (Kanban) board.

Note that in checking the tmp directory just now, there still are files/directories in there, meaning that parts of the process are likely still running (maybe parts that don't need private data access). We'll be checking this again in a month once the VPS machines are shut down.

One thing that I just noticed and that might be worth cleaning up as well: There are a lot of analytics-related user groups on Gerrit. Probably many of those are not needed anymore either: https://gerrit.wikimedia.org/r/admin/groups/q/filter:wmde

Thank you for this, @Michael! Really appreciate you following along with this and helping out :) :)

So I looked through the members of all of these groups, and the following is an overview:

1. Non-WMDE Analytics Members

Groups where there are other members aside from prior WMDE Analytics employees.

Current WMDE Employees

Technical Wishes Related

These groups have current WMDE employees in them.

Non-WMDE Staff

2. Suggestions for Deletion

@awight, looping you in as there are TW related groups here 👋 Maybe we can chat in the office on this in the coming days, but could you let us know if any of the analytics-wmde-TW* groups are needed?

The TW groups seem automatically created, I don't know. But +1 please delete, we don't need them at the moment.

Thanks for the response, @awight! Looking into this again, what's needed here is:

I'll ping WMF on this in Slack :)

We need the wmdeanalytics Cloud VPS machine to be deleted

A wmdeanalytics Cloud VPS project admin member should be able to do this. You probably need someone from Cloud VPS to delete the Cloud VPS project, I'm not sure.

The following Gerrit groups should be deleted:

I think the process is here: https://www.mediawiki.org/wiki/Gerrit/Inactive_projects

Thanks for the above points, @Ottomata!

@rook, pinging you here as you're the WMF employee on the wmdeanalytics machine. Would you be able to delete the machine? I've marked it for deletion in the Cloud VPS 2024 Purge Wikitech page, but would like to close this task out :)

And mediawiki.org/wiki/Gerrit/Inactive_projects seems to be more focused on projects, but given the directions there we could remove all members and prepend INACTIVE to the group name?

Oh, right! Groups, not repos. Got it.

Amazing work on this extensive catalog! Looks like the last check box is now ticked, and I'm just here to state the obvious :-)

Thanks @awight! 😊 Just moved it to resolved epics on our end. Will do the final close in a few weeks when we go through the backlog 🎉

cleaning up the board