Page MenuHomePhabricator

[Analytics] [Request] Remove all unneeded WMDE data from published/datasets/ directories
Open, Needs TriagePublic

Description

Wikidata Analytics Request

This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create tasks for the team. Thank you!

Purpose

Please provide as much context as possible as well as what the produced insights or services will be used for.

In T388634 we removed data that had been generated in the published/datasets/ by the new WMDE Airflow processes as the decision was made to stop this process do to how long it was taking to get the data release rights. This was data from modern WMDE Analytics processes, but there still is a lot of data from deprecated WMDE Analytics processes that should be cleaned out.

Specific Results

Please detail the specific results that the task should deliver.

Investigate all WMDE related directories of published/datasets/ and remove any data that is no longer needed.

Desired Outputs

Please list the desired outputs of this task.

  • Report of which information could be removed
    • stat1009 and stat1011 have nothing related to WMDE
    • one-off/wikidata/ looks to be Wikidata map related datasets and other files that can be kept
    • Not all files that are in these directories have equivalents on the stats servers as some are orphaned files from deleted stat servers (1004 to 1007)
    • wmde-analytics-engineering/Wikidata/wbs_propertypairs
      • Is split between stat1010 and other servers
      • Specifically the other servers are not in use anymore, so all files except for the last two are orphaned
    • The following from published/datasets/wmde-analytics-engineering/ are all on stat1008: qurator, wdcm, Wikidata, Wiktionary, WMDE_Banners, WMDE_NewEds_Comprehensive
  • Checking with WMDE Product and other stakeholders to see if this data can be deleted
  • Looking for where data has been used and cleaning up any wiki pages
  • Deleting the data and directories

Deadline

Please make the time sensitivity of this request clear with a date that it should be completed by. If there is no specific date, then the task will be triaged based on its priority.

This task is non-critical tech debt and has no deadline.


Information below this point is filled out by the task assignee.

Assignee Planning

Sub Tasks

A full breakdown of the steps to complete this task.

  • Check published/datasets/ for WMDE related files
  • Find files on servers
    • These files are not on HDFS and are rather in the srv/published/ directories on the stats servers (see details above)
    • Some files are also orphaned from old stats servers (1004 to 1007)
    • These files need to be deleted on the analytics servers by a WMF SRE, and then they won't repopulate
  • Report findings to Analytics stakeholders at WMDE
  • Delete data and wiki page references for all data that can be removed
  • Remove all tables from hdfs that are wmde processes that weren't made in personal schemas and delete any WMDE data from contractors

Estimation

Estimate: 2 days over a period and we'll need a deprecation notice
Actual:

Notes

Things that came up during the completion of this task, questions to be answered and follow up tasks.

  • Note

Event Timeline

At the request of @AndrewTavis_WMDE, I have deleted the /srv/analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/TechnicalWishes and /srv/analytics.wikimedia.org/published/datasets/wmdecampaigns directories.

They should not appear in https://analytics.wikimedia.org anymore.

Thanks for the help, @brouberol!

Note that the reason for deletion was both directories were empty and could not be found in HDFS or any of the active stats servers (1008 to 1011), so they appeared to be orphaned from a stat server between 1004 and 1007.

Worth noting that when I run published-sync on a stat (say, stat1010), it runs this command:

/usr/bin/flock -n /var/lock/published-sync -c /usr/bin/rsync -rptL -v --delete /srv/published/ analytics-web.discovery.wmnet::published-destination/stat1010//

perhaps there are old published-destination/stat100X (where X is 1, 2, …, 7) at analytics-web.discovery.wmnet that continue to sync to /srv/analytics.wikimedia.org/published/datasets?