Page MenuHomePhabricator

Declare wmf_content.mediawiki_content_history_v1 a production table
Closed, ResolvedPublic

Description

In this task we should:

  • Rename the table from wmf_dumps.wikitext_raw_rc2 to wmf_content.mediawiki_content_history_v1. If we cannot rename, then re-create, re-backfill, and run data quality checks.
  • Check all DAG dependencies to make sure they are sending alerts to the team.
  • Check all DQ checks to make sure they are sending alerts to the team.
    • Although DQ checks are being made, alerting on them has been moved to Phase III.
  • Add config to have it on DataHub
  • Announce its availability on the usual channels.

More:

Looks like production is happily catching up: https://airflow-analytics.wikimedia.org/home?tags=mediawiki_content.

I think we can close this task, but wrting down a few clean up tasks that perhaps we can do elsewhere:

Additionally:

  • Delete wikitext_raw tables.

Event Timeline

xcollazo renamed this task from Declare wmf_dumps.wikitext_raw a production table to Declare wmf_content.mediawiki_content_history_v1 a production table.Jan 21 2025, 4:07 PM
xcollazo updated the task description. (Show Details)
xcollazo changed the task status from Open to In Progress.Jan 28 2025, 6:14 PM
xcollazo claimed this task.
xcollazo triaged this task as High priority.
xcollazo moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

Change #1114790 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[operations/deployment-charts@master] Scale down mw-content-history-reconcile-enrich for nominal events intake

https://gerrit.wikimedia.org/r/1114790

Change #1114790 merged by jenkins-bot:

[operations/deployment-charts@master] Scale down mw-content-history-reconcile-enrich for nominal events intake

https://gerrit.wikimedia.org/r/1114790

Mentioned in SAL (#wikimedia-operations) [2025-01-29T18:44:57Z] <xcollazo@deploy2002> Started deploy [airflow-dags/analytics@5b0aeae]: Deploying latest DAGs to the analytics Airflow instance. T358375.

Mentioned in SAL (#wikimedia-analytics) [2025-01-29T18:45:19Z] <xcollazo> Deploying latest DAGs to the analytics Airflow instance. T358375.

Mentioned in SAL (#wikimedia-operations) [2025-01-29T18:45:33Z] <xcollazo@deploy2002> Finished deploy [airflow-dags/analytics@5b0aeae]: Deploying latest DAGs to the analytics Airflow instance. T358375. (duration: 00m 35s)

Ran the following to get rid of old data under wmf_dumps:

sudo -u analytics bash

kerberos-run-command analytics spark3-sql

spark-sql (default)> use wmf_dumps;
Response code
Time taken: 6.126 seconds

spark-sql (default)> show tables;
database	tableName	isTemporary
wmf_dumps	wikitext_inconsistent_rows_rc1	false
wmf_dumps	wikitext_raw_rc2	false
Time taken: 0.439 seconds, Fetched 2 row(s)

spark-sql (default)> DROP TABLE wikitext_inconsistent_rows_rc1;
Response code
Time taken: 2.052 seconds
spark-sql (default)> DROP TABLE wikitext_raw_rc2;
Response code
Time taken: 0.176 seconds
spark-sql (default)> DROP DATABASE wmf_dumps;
Response code
Time taken: 0.124 seconds

spark-sql (default)> quit;

analytics@an-launcher1002:/home/xcollazo$ hdfs dfs -ls /wmf/data/wmf_dumps
Found 2 items
drwxr-xr-x   - analytics analytics-privatedata-users          0 2024-11-25 20:35 /wmf/data/wmf_dumps/wikitext_inconsistent_rows_rc1
drwxrwxr-x   - analytics analytics-privatedata-users          0 2023-10-24 17:10 /wmf/data/wmf_dumps/wikitext_raw_rc2
analytics@an-launcher1002:/home/xcollazo$ hdfs dfs -rm -r /wmf/data/wmf_dumps
rm: Failed to move hdfs://analytics-hadoop/wmf/data/wmf_dumps to trash hdfs://analytics-hadoop/user/analytics/.Trash/Current/wmf/data/wmf_dumps: Permission denied: user=analytics, access=WRITE, inode="/wmf/data":hdfs:hadoop:drwxrwxr-x

sudo -u hdfs bash

hdfs@an-launcher1002:/home/xcollazo$ hdfs dfs -ls /wmf/data/wmf_dumps
Found 2 items
drwxr-xr-x   - analytics analytics-privatedata-users          0 2024-11-25 20:35 /wmf/data/wmf_dumps/wikitext_inconsistent_rows_rc1
drwxrwxr-x   - analytics analytics-privatedata-users          0 2023-10-24 17:10 /wmf/data/wmf_dumps/wikitext_raw_rc2
hdfs@an-launcher1002:/home/xcollazo$ hdfs dfs -rm -r /wmf/data/wmf_dumps
25/01/29 19:06:18 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/wmf/data/wmf_dumps' to trash at: hdfs://analytics-hadoop/user/hdfs/.Trash/Current/wmf/data/wmf_dumps

Mentioned in SAL (#wikimedia-analytics) [2025-01-29T19:08:56Z] <xcollazo> Ran the following to get rid of old data under wmf_dumps: T358375#10506036

To deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1114790, followed instructions from https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments, ran the following:

ssh deployment.eqiad.wmnet
cd /srv/deployment-charts/helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich
helmfile -e dse-k8s-eqiad -i apply --context 5
xcollazo updated the task description. (Show Details)