Page MenuHomePhabricator

Consider renaming columns and/or table to abide by the data modeling guidelines
Closed, ResolvedPublic

Description

Since this table will go to prod after the data modeling guidelines were established, we should strive to abide by it:

I will share some thoughts on naming, for whatever they're worth. I recognize most of these names were used for historical continuity but since table migrations are so rare, I think it's worth taking the opportunity to clarify and standardize things for the next decade of users.

  • wmf_dumps isn't the most useful name, since lots of different datasets get dumped and most of them will be in other databases. What about wmf_content?
  • revision_timestamp, error_timestamp: according to the data modeling guidelines, we should use revision_dt and error_dt instead.
  • wiki_db: according to the data modeling guidelines, we should use wiki_id instead.
  • revision_is_minor_edit: it would be less redundant to use revision_is_minor
  • user_is_visible, revision_comment_is_visible, revision_content_is_visible: "visible" is actually quite a good term for this, so I personally I kind of want to keep it, but it's not used elsewhere. The official name for the functionality is revision deletion, although some parts of the interface do use "visible". mediawiki_history provides this as an array named revision_deleted_parts. Maybe it's worth emulating that? 🤷🏽‍♂️
  • page_redirect_title: I think this is a tiny bit confusing (what the title of a redirect?). Maybe page_redirect_target instead, as "target" seems to be the common term (e.g. on en:w:Wikipedia:Redirect and mw:Help:Redirects).
  • revision_size and content_size: according to the data modeling guidelines, these should be suffixed by the unit (revision_size_bytes and content_size_bytes).
  • row_last_update, row_visibility_last_update: according to the data modeling guidelines, these should be suffixed by _dt (although personally I find that a bit redundant)
  • content_body: the two words seem redundant to me. What about just content?

Regarding the table name, as per T366542#10150586, we have decided to rename it to:
wmf_content.mediawiki_content_history_v1

Regarding the reconcile event streams, as per T366542#10299725 and Slack, we have decided to rename them to:

mediawiki_content_history_reconcile
and
mediawiki_content_history_reconcile_enriched

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Rename all 'Dumps 2.0' DAGs to 'MediaWiki Content' DAGsrepos/data-engineering/airflow-dags!1022xcollazothe-big-rename-pt-2main
Rename columns to their final names.repos/data-engineering/mediawiki-content-pipelines!57xcollazothe-big-rename-pt-3main
Remove jobs for backfilling from Dumps 1.0.repos/data-engineering/mediawiki-content-pipelines!56xcollazothe-big-rename-pt-2main
Table and column rename to follow predecent and modeling guidelines.repos/data-engineering/mediawiki-content-pipelines!55xcollazothe-big-renamemain
Customize query in GitLab

Event Timeline

We should probably add something to data modeling guidelines about making choices between consistency with existing fields, and best practices for new fields.

In MediaWiki state change events, in some cases we decided to stick closely with the MediaWiki db field names, in others we decided to go with newer guidelines.

Generally, I'd say if renaming an existing field will cause significantly more confusion and misunderstanding than keeping it the same, we should keep it the same and document the Modeling Guidelines violation.

While you are at this, consider looking at the event entity fragment schemas we made a year or so ago.

https://schema.wikimedia.org/#!/primary/jsonschema/fragment/mediawiki/state/entity

It would be nice if these could be close, since we did a lot of work bikeshedding names there.

We'd probably like to create a revision change stream one day too, to model things like visibility changes to past revisions. Since dumps is about revisions, it would be nice if whatever we come up with here aligns to what we might do in the future.

decided to stick closely with the MediaWiki db field names

E.g. we went with rev_id and rev_dt instead of revision_id and revision_dt. You can see its a mix. rev_id is a foreign key for revision table, so revision_id would be better. We decided that our _dt convention was important, so we decided not to use keep _timestamp. But, for field name consistency, we kept the rev_ part, and got rev_dt. ¯\_(ツ)_/¯

xcollazo renamed this task from Rename columns and/or table to abide by the data modeling guidelines to Consider renaming columns and/or table to abide by the data modeling guidelines.Jun 28 2024, 3:53 PM

Slack thread about this table name.

This table contains mediawiki revision content for multiple content models, not just wikitext.

Is mediawiki_revision_content a good name? Or something like that?

On a meeting yeterday with the Dumps 2.0 team, we agreed that the table name will be:

wmf_dumps.mediawiki_content_history

While a future table with just the current revision for each wiki page will be named:
wmf_dumps.mediawiki_content_current

Thank you!

wmf_dumps.mediawiki_content_current

FWIW, we'd probably bike shed this decision a bit more, buuuuuut wmf_dumps.mediawiki_content_history absolutely! Let's go.

On a meeting yeterday with the Dumps 2.0 team, we agreed that the table name will be:

wmf_dumps.mediawiki_content_history

Just kidding! Name will be:

wmf_dumps.mediawiki_content_history_v1

Regarding the reconcile event streams, we have decided to rename them to:

mediawiki_content_history_reconcile
and
mediawiki_content_history_reconcile_enriched

From MR 55:

In this MR we do the first step towards renaming wmf_dumps.wikitext_raw_rc2 to its final home at wmf_content.mediawiki_content_history_v1. We will rename the file itself separately to keep proper context.

  • Rename columns to better abide by the data modeling guidelines, but also follow the precedent set with the work done for the event entity fragment schemas.
  • Removes the errors column. This never worked as designed, and it would be confusing for an end user with no benefit to operations. Thus removing. Table wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 is better suited to track data quality issues.
  • Removes bloom filter on revision_id. This also did not work as intended. See https://phabricator.wikimedia.org/T375402 if interested on that saga.

@Ottomata for your review.

MR 56:

This MR removes jobs backfill_create_intermediate_table.py and backfill_merge_into.py , as they are not needed anymore, and it will be easier to rename columns elsewhere without worrying on these ones being correct for no benefit.

xcollazo changed the task status from Open to In Progress.Jan 9 2025, 9:09 PM
xcollazo claimed this task.
xcollazo triaged this task as High priority.
xcollazo moved this task from Sprint Backlog to In Process on the Dumps 2.0 (Kanban Board) board.

Ran the following as per latest DDL from MR 57:

hostname -f
an-launcher1002.eqiad.wmnet

sudo -u analytics bash
kerberos-run-command analytics spark3-sql

CREATE TABLE wmf_content.mediawiki_content_history_v1 (
    page_id                     BIGINT    COMMENT 'The (database) page ID of the page.',
    page_namespace_id           INT       COMMENT 'The id of the namespace this page belongs to.',
    page_title                  STRING    COMMENT 'The normalized title of the page. If page_namespace_id = 0, then this is the non namespaced title. If page_namespace_id != 0, then the title is prepended with the localized namespace. Examples for "enwiki": "Main_Page" and "Talk:Main_Page".',
    page_redirect_target        STRING    COMMENT 'title of the redirected-to page, if any. Same rules as page_title.',
    user_id                     BIGINT    COMMENT 'id of the user that made the revision; null if anonymous, zero if old system user, and -1 when deleted or malformed XML was imported',
    user_text                   STRING    COMMENT 'text of the user that made the revision (either username or IP)',
    user_is_visible             BOOLEAN   COMMENT 'Whether the user that made the revision is visible. If this is false, then the user should be redacted when shown publicly. See RevisionRecord->DELETED_USER.',
    revision_id                 BIGINT    COMMENT 'The (database) revision ID.',
    revision_parent_id          BIGINT    COMMENT 'The (database) revision ID. of the parent revision',
    revision_dt                 TIMESTAMP COMMENT 'The (database) time this revision was created. This is rev_timestamp in the MediaWiki database.',
    revision_is_minor_edit      BOOLEAN   COMMENT 'True if the editor marked this revision as a minor edit.',
    revision_comment            STRING    COMMENT 'The comment left by the user when this revision was made.',
    revision_comment_is_visible BOOLEAN   COMMENT 'Whether the comment of the revision is visible. If this is false, then the comment should be redacted when shown publicly. See RevisionRecord->DELETED_COMMENT.',
    revision_sha1               STRING    COMMENT 'Nested SHA1 hash of hashes of all content slots. See https://www.mediawiki.org/wiki/Manual:Revision_table#rev_sha1',
    revision_size               BIGINT    COMMENT 'the sum of the content_size of all content slots',
    revision_content_slots      MAP<
                                    STRING,
                                    STRUCT<content_body:   STRING,
                                           content_format: STRING,
                                           content_model:  STRING,
                                           content_sha1:   STRING,
                                           content_size:   BIGINT
                                    >
                                >         COMMENT 'a MAP containing all the content slots associated to this revision. Typically just the "main" slot, but also "mediainfo" for commonswiki.',
    revision_content_is_visible BOOLEAN   COMMENT 'Whether revision_content_slots is visible. If this is false, then any content should be redacted when shown publicly. See RevisionRecord->DELETED_TEXT.',
    wiki_id                     STRING    COMMENT 'The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc.',
    row_content_update_dt       TIMESTAMP COMMENT 'Control column. Marks the timestamp of the last content event or backfill that updated this row',
    row_visibility_update_dt    TIMESTAMP COMMENT 'Control column. Marks the timestamp of the last visibility event or backfill that updated this row',
    row_move_update_dt          TIMESTAMP COMMENT 'Control column. Marks the timestamp of the last move event or backfill that updated this row'
)
USING ICEBERG
PARTITIONED BY (wiki_id)                             -- wiki_id partitioning is familiar to users
TBLPROPERTIES (
    'format-version' = '2',                          -- allow merge-on-read
    'write.format.default' = 'parquet',              -- parquet is currently the only format with min/max stats
    'write.target-file-size-bytes' = '134217728',    -- cap files at 128MB files so executors with 1 core, 16GB RAM can read the entire table
    'write.metadata.previous-versions-max' = '10',
    'write.metadata.delete-after-commit.enabled' = 'true'
)
COMMENT 'Contains all of the revisions for all pages for all wikis. Updated on a daily basis.'
LOCATION '/wmf/data/wmf_content/mediawiki_content_history_v1';

ALTER TABLE wmf_content.mediawiki_content_history_v1 WRITE ORDERED BY wiki_id, page_id, revision_dt;

Now running the following to backfill all wikis except the usual suspects:

hostname -f
an-launcher1002.eqiad.wmnet

screen -S mw-content-history-backfill

sudo -u analytics bash

kerberos-run-command analytics spark3-sql --driver-cores 8 --driver-memory 32G --master yarn --conf spark.dynamicAllocation.maxExecutors=128 --conf spark.executor.memoryOverhead=3G --conf spark.sql.shuffle.partitions=2048 --executor-memory 16G --executor-cores 2 --name xcollazo-mw-content-history-backfill


INSERT INTO wmf_content.mediawiki_content_history_v1
SELECT
  page_id,
  page_namespace AS page_namespace_id,
  page_title,
  page_redirect_title AS page_redirect_target,
  user_id,
  user_text,
  user_is_visible,
  revision_id,
  revision_parent_id,
  revision_timestamp AS revision_dt,
  revision_is_minor_edit,
  revision_comment,
  revision_comment_is_visible,
  revision_sha1,
  revision_size,
  revision_content_slots,
  revision_content_is_visible,
  wiki_db AS wiki_id,
  row_last_update AS row_content_update_dt,
  row_visibility_last_update AS row_visibility_update_dt,
  row_move_last_update AS row_move_update_dt
FROM
  wmf_dumps.wikitext_raw_rc2
WHERE
  wiki_db NOT IN ('enwiki', 'commonswiki', 'wikidatawiki')
ORDER BY wiki_db, page_id, revision_timestamp

Yarn app: https://yarn.wikimedia.org/proxy/application_1734703658237_837267/

Job failed. Reattempting with:

kerberos-run-command analytics spark3-sql --driver-cores 8 --driver-memory 32G --master yarn --conf spark.dynamicAllocation.maxExecutors=160 --conf spark.executor.memoryOverhead=3G --conf spark.sql.shuffle.partitions=10240 --executor-memory 16G --executor-cores 1 --name xcollazo-mw-content-history-backfill

Yarn app: https://yarn.wikimedia.org/proxy/application_1734703658237_839578/


Done:

Response code
Time taken: 18165.134 seconds

That's 5 hours.

Now running wikidatawiki:

INSERT INTO wmf_content.mediawiki_content_history_v1
SELECT
  page_id,
  page_namespace AS page_namespace_id,
  page_title,
  page_redirect_title AS page_redirect_target,
  user_id,
  user_text,
  user_is_visible,
  revision_id,
  revision_parent_id,
  revision_timestamp AS revision_dt,
  revision_is_minor_edit,
  revision_comment,
  revision_comment_is_visible,
  revision_sha1,
  revision_size,
  revision_content_slots,
  revision_content_is_visible,
  wiki_db AS wiki_id,
  row_last_update AS row_content_update_dt,
  row_visibility_last_update AS row_visibility_update_dt,
  row_move_last_update AS row_move_update_dt
FROM
  wmf_dumps.wikitext_raw_rc2
WHERE
  wiki_db IN ('wikidatawiki')
ORDER BY wiki_db, page_id, revision_timestamp

Same yarn app.


Done:

Response code
Time taken: 12883.735 seconds

That's 3.6 hours.

Now running enwiki:

INSERT INTO wmf_content.mediawiki_content_history_v1
SELECT
  page_id,
  page_namespace AS page_namespace_id,
  page_title,
  page_redirect_title AS page_redirect_target,
  user_id,
  user_text,
  user_is_visible,
  revision_id,
  revision_parent_id,
  revision_timestamp AS revision_dt,
  revision_is_minor_edit,
  revision_comment,
  revision_comment_is_visible,
  revision_sha1,
  revision_size,
  revision_content_slots,
  revision_content_is_visible,
  wiki_db AS wiki_id,
  row_last_update AS row_content_update_dt,
  row_visibility_last_update AS row_visibility_update_dt,
  row_move_last_update AS row_move_update_dt
FROM
  wmf_dumps.wikitext_raw_rc2
WHERE
  wiki_db IN ('enwiki')
ORDER BY wiki_db, page_id, revision_timestamp

Same yarn app.


Done:

Response code
Time taken: 12420.584 seconds

That's 3.5 hours.

Ok for backfilling, only commonswiki remains, but I will wait till T382953 is done as I don't want to reconcile all of that again unnecessarily.

Now running commonswiki:

INSERT INTO wmf_content.mediawiki_content_history_v1
SELECT
  page_id,
  page_namespace AS page_namespace_id,
  page_title,
  page_redirect_title AS page_redirect_target,
  user_id,
  user_text,
  user_is_visible,
  revision_id,
  revision_parent_id,
  revision_timestamp AS revision_dt,
  revision_is_minor_edit,
  revision_comment,
  revision_comment_is_visible,
  revision_sha1,
  revision_size,
  revision_content_slots,
  revision_content_is_visible,
  wiki_db AS wiki_id,
  row_last_update AS row_content_update_dt,
  row_visibility_last_update AS row_visibility_update_dt,
  row_move_last_update AS row_move_update_dt
FROM
  wmf_dumps.wikitext_raw_rc2
WHERE
  wiki_db IN ('commonswiki')
ORDER BY wiki_db, page_id, revision_timestamp

Same Yarn app: https://yarn.wikimedia.org/proxy/application_1734703658237_839578


Done:

Response code
Time taken: 1316.923 seconds

That's 22 mins.

xcollazo updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1022

Rename all 'Dumps 2.0' DAGs to 'MediaWiki Content' DAGs

This is now just merged, had forgot to tag it with the phab number.

Mentioned in SAL (#wikimedia-operations) [2025-01-17T17:11:51Z] <xcollazo@deploy2002> Started deploy [airflow-dags/analytics@b0cd4df]: Deploy latest DAGs for 'analytics' Airflow instance. T366542.

Mentioned in SAL (#wikimedia-operations) [2025-01-17T17:12:24Z] <xcollazo@deploy2002> Finished deploy [airflow-dags/analytics@b0cd4df]: Deploy latest DAGs for 'analytics' Airflow instance. T366542. (duration: 00m 32s)

Mentioned in SAL (#wikimedia-analytics) [2025-01-17T17:13:38Z] <xcollazo> Deployed latest DAGs for 'analytics' Airflow instance. T366542.

There are schema changes to be applied for inconsistent_rows_of_mediawiki_content_history_v1, and while we could run multiple ALTERs, we are still running Iceberg 1.2.1 in production, which has some bugs, and it is small data. Thus it seems easier to just copy to temp table and then recreate.

Get a session:

ssh an-launcher1002.eqiad.wmnet
sudo -u analytics bash

For SQL, I used:

kerberos-run-command analytics spark3-sql --driver-cores 8 --driver-memory 32G --master yarn --conf spark.dynamicAllocation.maxExecutors=128 --conf spark.executor.memoryOverhead=3G --conf spark.sql.shuffle.partitions=2048 --executor-memory 16G --executor-cores 2

Create temp v2 table:

CREATE TABLE wmf_content.inconsistent_rows_of_mediawiki_content_history_v2 (
    wiki_id                        STRING              COMMENT 'The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc.',
    page_id                        BIGINT              COMMENT 'The (database) page ID of the page.',
    revision_id                    BIGINT              COMMENT 'The (database) revision ID.',
    revision_dt                    TIMESTAMP           COMMENT 'The (database) time this revision was created. This is rev_timestamp in the MediaWiki database.',
    reasons                        ARRAY<STRING>       COMMENT 'The set of reasons detected that make us think we need to reconcile this revision.',
    computation_dt                 TIMESTAMP           COMMENT 'The logical time at which this inconsistency was calculated. Useful to see trends over time, and also to be able to delete data efficiently.',
    computation_class              STRING              COMMENT 'One of "last-24h" or "all-of-wiki-time". This segregates between runs that cover one day of inconsistencies as of computation_dt, versus runs that retroactively check all revisions as of computation_dt.',
    reconcile_emit_dt              TIMESTAMP           COMMENT 'The time at which this inconsistency was emitted for eventual reconcile. If NULL, it has not been submitted yet.'
)
USING ICEBERG
PARTITIONED BY (wiki_id, computation_class)
TBLPROPERTIES (
    'format-version' = '2',                          -- allow merge-on-read if needed
    'write.format.default' = 'parquet',              -- parquet is currently the only format with min/max stats
    'write.target-file-size-bytes' = '134217728',    -- cap files at 128MB files
    'commit.retry.num-retries' = '10'                -- bump retries from default of 4 due to many concurrent INSERTs
)
COMMENT 'We make checks between wmf_content.mediawiki_content_history_v1 and the Analytics replicas to detect inconsistent rows. If we do detect any, we add them here, to be reconciled, alerted, and analyzed.'
LOCATION '/wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v2';


ALTER TABLE wmf_content.inconsistent_rows_of_mediawiki_content_history_v2 WRITE ORDERED BY wiki_id, computation_class, computation_dt, revision_dt;

Copy all data to it:

INSERT INTO wmf_content.inconsistent_rows_of_mediawiki_content_history_v2
SELECT
  wiki_db AS wiki_id,
  page_id,
  revision_id,
  revision_timestamp AS revision_dt,
  reasons,
  computation_dt,
  computation_class,
  reconcile_emit_dt
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
ORDER BY wiki_db, computation_class, computation_dt, revision_timestamp;

Verify copy looks ok:

spark-sql (default)> select count(1) from wmf_content.inconsistent_rows_of_mediawiki_content_history_v2;
count(1)
826721216
Time taken: 12.314 seconds, Fetched 1 row(s)
spark-sql (default)> select count(1) from wmf_content.inconsistent_rows_of_mediawiki_content_history_v1;
count(1)
826721216
Time taken: 269.214 seconds, Fetched 1 row(s)

Now let's nuke and recreate inconsistent_rows_of_mediawiki_content_history_v1:

DROP TABLE wmf_content.inconsistent_rows_of_mediawiki_content_history_v1;

hdfs dfs -rmr /wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v1
rmr: DEPRECATED: Please use '-rm -r' instead.
25/01/17 18:10:22 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v1' to trash at: hdfs://analytics-hadoop/user/analytics/.Trash/Current/wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v1

Now recreate and copy data back:

CREATE TABLE wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 (
    wiki_id                        STRING              COMMENT 'The wiki ID, which is usually the same as the MediaWiki database name. E.g. enwiki, metawiki, etc.',
    page_id                        BIGINT              COMMENT 'The (database) page ID of the page.',
    revision_id                    BIGINT              COMMENT 'The (database) revision ID.',
    revision_dt                    TIMESTAMP           COMMENT 'The (database) time this revision was created. This is rev_timestamp in the MediaWiki database.',
    reasons                        ARRAY<STRING>       COMMENT 'The set of reasons detected that make us think we need to reconcile this revision.',
    computation_dt                 TIMESTAMP           COMMENT 'The logical time at which this inconsistency was calculated. Useful to see trends over time, and also to be able to delete data efficiently.',
    computation_class              STRING              COMMENT 'One of "last-24h" or "all-of-wiki-time". This segregates between runs that cover one day of inconsistencies as of computation_dt, versus runs that retroactively check all revisions as of computation_dt.',
    reconcile_emit_dt              TIMESTAMP           COMMENT 'The time at which this inconsistency was emitted for eventual reconcile. If NULL, it has not been submitted yet.'
)
USING ICEBERG
PARTITIONED BY (wiki_id, computation_class)
TBLPROPERTIES (
    'format-version' = '2',                          -- allow merge-on-read if needed
    'write.format.default' = 'parquet',              -- parquet is currently the only format with min/max stats
    'write.target-file-size-bytes' = '134217728',    -- cap files at 128MB files
    'commit.retry.num-retries' = '10'                -- bump retries from default of 4 due to many concurrent INSERTs
)
COMMENT 'We make checks between wmf_content.mediawiki_content_history_v1 and the Analytics replicas to detect inconsistent rows. If we do detect any, we add them here, to be reconciled, alerted, and analyzed.'
LOCATION '/wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v1';


ALTER TABLE wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 WRITE ORDERED BY wiki_id, computation_class, computation_dt, revision_dt;

INSERT INTO wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
SELECT
  *
FROM wmf_content.inconsistent_rows_of_mediawiki_content_history_v2
ORDER BY wiki_id, computation_class, computation_dt, revision_dt;

spark-sql (default)> select count(1) from wmf_content.inconsistent_rows_of_mediawiki_content_history_v1;
count(1)
826721216
Time taken: 15.62 seconds, Fetched 1 row(s)

Finally, let's nuke the temp table:

DROP TABLE wmf_content.inconsistent_rows_of_mediawiki_content_history_v2;

analytics@an-launcher1002:/home/xcollazo$ hdfs dfs -rmr /wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v2
rmr: DEPRECATED: Please use '-rm -r' instead.
25/01/17 18:18:17 INFO fs.TrashPolicyDefault: Moved: 'hdfs://analytics-hadoop/wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v2' to trash at: hdfs://analytics-hadoop/user/analytics/.Trash/Current/wmf/data/wmf_content/inconsistent_rows_of_mediawiki_content_history_v2

Looks like production is happily catching up: https://airflow-analytics.wikimedia.org/home?tags=mediawiki_content.

I think we can close this task, but wrting down a few clean up tasks that perhaps we can do elsewhere:

Looks like production is happily catching up: https://airflow-analytics.wikimedia.org/home?tags=mediawiki_content.

I think we can close this task, but wrting down a few clean up tasks that perhaps we can do elsewhere:

Moved this tasks to T358375.