Page MenuHomePhabricator

Remove code and tables related to deprecated mediawiki_wikitext_history and mediawiki_wikitext_current
Open, In Progress, Needs TriagePublic

Description

Since making the new MW Content Tables production quality, we have marked the old snapshot based tables deprecated:

tabledeprecation date
wmf.mediawiki_wikitext_history2025-01-29
wmf.mediawiki_wikitext_current2025-04-30

A cursory search shows that the history table is only being referenced by our own DPE code.

For the current table, our own code, plus one deprecated project references the table.

In this task we should:

  • Figure out the ownership of generate_anchor_dictionary_spark.py code, and get a commitment to migrate code.
    • Code is indeed being migrated via T398950. See reference to new tables here.
  • Remove code that generates the content of both tables
    • remove code that imports from dumps servers to HDFS at modules/profile/manifests/analytics/refinery/job/import_mediawiki_dumps.pp
    • remove data purge code at modules/profile/manifests/analytics/refinery/job/data_purge.pp
    • remove airflow jobs
  • Remove tables
  • Remove any remaining data from the raw imports at /mnt/hdfs/wmf/data/raw/mediawiki/dumps

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
analytics: remove mediawiki_wikitext_* DAGsrepos/data-engineering/airflow-dags!1578xcollazoremove-wikitext-jobsmain
Customize query in GitLab

Event Timeline

Change #1167224 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] analytics: Absent rsync scripts that import Dumps 1 XML into HDFS

https://gerrit.wikimedia.org/r/1167224

xcollazo changed the task status from Open to In Progress.Jul 8 2025, 3:54 PM
xcollazo claimed this task.

Change #1167224 merged by Btullis:

[operations/puppet@production] analytics: Absent rsync scripts that import Dumps 1 XML into HDFS

https://gerrit.wikimedia.org/r/1167224

Change #1172113 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/puppet@production] analytics: Remove rsync scripts that import Dumps 1 XML into HDFS

https://gerrit.wikimedia.org/r/1172113

xcollazo updated the task description. (Show Details)

Manually deleted the db representation for the following DAGs from the Airflow UI:

mediawiki_wikitext_current
mediawiki_wikitext_history
wikidata_wikitext_history

Change #1172113 merged by Btullis:

[operations/puppet@production] analytics: Remove rsync scripts that import Dumps 1 XML into HDFS

https://gerrit.wikimedia.org/r/1172113

Removed remaining Dumps1 XML artifacts from HDFS:

$ hostname -f
an-launcher1002.eqiad.wmnet
$ whoami 
analytics

$ kerberos-run-command analytics hdfs dfs -ls /wmf/data/raw/mediawiki/dumps
Found 3 items
drwxr-x---   - analytics analytics-privatedata-users          0 2025-07-01 05:00 /wmf/data/raw/mediawiki/dumps/pages_meta_current
drwxr-x---   - analytics analytics-privatedata-users          0 2025-07-01 03:00 /wmf/data/raw/mediawiki/dumps/pages_meta_history
drwxr-x---   - analytics analytics-privatedata-users          0 2025-07-20 05:00 /wmf/data/raw/mediawiki/dumps/siteinfo_namespaces

$ kerberos-run-command analytics hdfs dfs -du -h /wmf/data/raw/mediawiki/dumps/pages_meta*
406.0 G  /wmf/data/raw/mediawiki/dumps/pages_meta_current/20250401
408.4 G  /wmf/data/raw/mediawiki/dumps/pages_meta_current/20250501
411.5 G  /wmf/data/raw/mediawiki/dumps/pages_meta_current/20250601
406.2 G  /wmf/data/raw/mediawiki/dumps/pages_meta_current/20250701
6.3 T  /wmf/data/raw/mediawiki/dumps/pages_meta_history/20250401
6.4 T  /wmf/data/raw/mediawiki/dumps/pages_meta_history/20250501
6.4 T  /wmf/data/raw/mediawiki/dumps/pages_meta_history/20250601
4.0 T  /wmf/data/raw/mediawiki/dumps/pages_meta_history/20250701

$ kerberos-run-command analytics hdfs dfs -rm -r /wmf/data/raw/mediawiki/dumps/pages_meta*
25/07/25 16:33:57 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/wmf/data/raw/mediawiki/dumps/pages_meta_current' to trash at: hdfs://analytics-hadoop/user/analytics/.Trash/Current/wmf/data/raw/mediawiki/dumps/pages_meta_current
25/07/25 16:33:57 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/wmf/data/raw/mediawiki/dumps/pages_meta_history' to trash at: hdfs://analytics-hadoop/user/analytics/.Trash/Current/wmf/data/raw/mediawiki/dumps/pages_meta_history

Mentioned in SAL (#wikimedia-analytics) [2025-07-25T16:36:40Z] <xcollazo> removed remaining raw Dumps1 XML files from HDFS. See T396031#11035363 for details.

The only remaining task here is to DROP the tables themselves. Will do that soon.