Page MenuHomePhabricator

Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw
Closed, ResolvedPublic3 Estimated Story Points

Description

For Dumps 2.0, we are generating an intermediate table, wmf_dumps.wikitext_raw, that has intrinsic value other than as a stepping stone.

This table will be, effectively, a more up to date version of the existing wmf.mediawiki_wikitext_history table. We intend to have this intermediate table updated every hour, while the existing wmf.mediawiki_wikitext_history table is only updated once per month. This intermediate table thus has the potential to accelerate existing data pipelines, as in the issue discussed in T357859.

So far though, the schema, data quality and availability for this table has only been discussed amongst the Data Platform Team.

In this task we should discuss the following with other internal teams:

  • Double check schema is sufficient to replace mediawiki_wikitext_history. Schema is sufficient.
  • What kind of data retention they are looking for. Is the ability to time travel over 90 days enough? 90 days is good.
  • Although the target is for data to be ingested to this table on an hourly basis, for a myriad of reasons, this table can drift to be several hours away from MediaWiki production. What kind of data availability and data visibility do they need? Is T354761 and T357684 enough? It looks like the proposed DQ checks are good.

Related Objects

Event Timeline

@xcollazo have time to pick this up this week?

Let's do it!

Doing some investigations on the main issues defined in description so that we have a base to discuss on:

Double check schema is sufficient to replace mediawiki_wikitext_history

Here is wmf.mediawiki_wikitext_history's schema:

presto:wmf> describe wmf.mediawiki_wikitext_history;

         Column          |      Type      |     Extra     |                                                               Comment                                                           >
-------------------------+----------------+---------------+--------------------------------------------------------------------------------------------------------------------------------->
 page_id                 | bigint         |               | id of the page                                                                                                                  >
 page_namespace          | integer        |               | namespace of the page                                                                                                           >
 page_title              | varchar        |               | title of the page                                                                                                               >
 page_redirect_title     | varchar        |               | title of the redirected-to page                                                                                                 >
 page_restrictions       | array(varchar) |               | restrictions of the page                                                                                                        >
 user_id                 | bigint         |               | id of the user that made the revision; null if anonymous, zero if old system user, and -1 when deleted or malformed XML was impo>
 user_text               | varchar        |               | text of the user that made the revision (either username or IP)                                                                 >
 revision_id             | bigint         |               | id of the revision                                                                                                              >
 revision_parent_id      | bigint         |               | id of the parent revision, null when this is the first revision in the chain                                                    >
 revision_timestamp      | varchar        |               | timestamp of the revision (ISO8601 format)                                                                                      >
 revision_minor_edit     | boolean        |               | whether this revision is a minor edit or not                                                                                    >
 revision_comment        | varchar        |               | Comment made with revision                                                                                                      >
 revision_text_bytes     | bigint         |               | bytes number of the revision text                                                                                               >
 revision_text_sha1      | varchar        |               | sha1 hash of the revision text                                                                                                  >
 revision_text           | varchar        |               | text of the revision                                                                                                            >
 revision_content_model  | varchar        |               | content model of the revision                                                                                                   >
 revision_content_format | varchar        |               | content format of the revision                                                                                                  >
 user_is_visible         | boolean        |               | true if this revision has not had its user deleted via rev_deleted                                                              >
 comment_is_visible      | boolean        |               | true if this revision has not had its comment deleted via rev_deleted                                                           >
 content_is_visible      | boolean        |               | true if this revision has not had its text content deleted via rev_deleted                                                      >
 snapshot                | varchar        | partition key | Versioning information to keep multiple datasets (YYYY-MM for regular imports)                                                  >
 wiki_db                 | varchar        | partition key | The wiki_db project                                                                                                             >
(22 rows)

And wmf_dumps.wikitext_raw_rc2's:

describe wmf_dumps.wikitext_raw_rc2;

           Column            |                                                                            Type                                                                            | >
-----------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+->
 page_id                     | bigint                                                                                                                                                     | >
 page_namespace              | integer                                                                                                                                                    | >
 page_title                  | varchar                                                                                                                                                    | >
 page_redirect_title         | varchar                                                                                                                                                    | >
 user_id                     | bigint                                                                                                                                                     | >
 user_text                   | varchar                                                                                                                                                    | >
 user_is_visible             | boolean                                                                                                                                                    | >
 revision_id                 | bigint                                                                                                                                                     | >
 revision_parent_id          | bigint                                                                                                                                                     | >
 revision_timestamp          | timestamp                                                                                                                                                  | >
 revision_is_minor_edit      | boolean                                                                                                                                                    | >
 revision_comment            | varchar                                                                                                                                                    | >
 revision_comment_is_visible | boolean                                                                                                                                                    | >
 revision_sha1               | varchar                                                                                                                                                    | >
 revision_size               | bigint                                                                                                                                                     | >
 revision_content_slots      | map(varchar, row("content_body" varchar, "content_format" varchar, "content_model" varchar, "content_sha1" varchar, "content_size" bigint))                | >
 revision_content_is_visible | boolean                                                                                                                                                    | >
 wiki_db                     | varchar                                                                                                                                                    | >
 errors                      | array(row("error_upstream_id" varchar, "error_upstream_timestamp" timestamp, "error_producer" varchar, "error_message" varchar, "error_severity" varchar)) | >
 row_last_update             | timestamp                                                                                                                                                  | >
 row_visibility_last_update  | timestamp                                                                                                                                                  | >
(21 rows)

The only column that mediawiki_wikitext_history has that wikitext_raw_rc2 does not is page_restrictions. But this column seems to not have any content:

select count(1) as count, wiki_db
from (
    select wiki_db
    from wmf.mediawiki_wikitext_history
    where snapshot = '2024-03' and cardinality(page_restrictions) > 0
)
group by wiki_db;


count	wiki_db
Time taken: 3079.22 seconds

Additionally, mediawiki_wikitext_history only includes the main revision. That is, it does not support Multi-Content Revisions (https://www.mediawiki.org/wiki/Multi-Content_Revisions), while wikitext_raw_rc2 does. This fact makes getting the main content a bit involved:

In mediawiki_wikitext_history you'd do:

SELECT revision_text
FROM wmf.mediawiki_wikitext_history
WHERE snapshot = '2024-03'
  AND wiki_db = 'enwiki'
LIMIT 1
;

While in wikitext_raw_rc2, if you're only interested in the main slot, you have to do:

SELECT revision_content_slots['main'].content_body AS revision_text
FROM wmf_dumps.wikitext_raw_rc2
WHERE wiki_db = 'enwiki'
LIMIT 1
;

Which is more verbose, granted, but it exposes MCR which opens new uses cases with this mechanism.

What kind of data retention they are looking for. Is the ability to time travel over 90 days enough?

wmf.mediawiki_wikitext_history is partitioned, in the Hive sense, by snapshot and wiki_db. The snapshots are typically kept for 60 days. Thus, a user of this table that queries it on the day a particular snapshot is made available can expect to be able to reproduce their query over the next ~60 days.

wmf_dumps.wikitext_raw_rc2 is different in that there is no longer a snapshot partition column. This mechanism no longer makes sense as this table is updated every hour. However, a user may want to be able to reproduce a particular query result, and Iceberg does provide a time travel mechanism for this. This mechanism, on both Presto and Spark, looks roughly like so:

SELECT * FROM wmf_dumps.wikitext_raw_rc2 TIMESTAMP AS OF '2024-01-01 00:00:00';

The above query will return results as if the table had been queried at that specific TIMESTAMP.

This guarantees exact result sets on separate runs as long as we keep the underlying (Iceberg) snapshots as of the TIMESTAMP specified. This is why we suggest we may want to keep Iceberg snapshots for, say, 90 days.

However:

Our current production version of Presto (0.283) unfortunately does not support this mechanism. It is available as of Presto 0.286.

Our current production version of Spark (3.1.2) unfortunately does not support this mechanism either. It is available as of Spark 3.2+. (Note that we do have Spark 3.3 and 3.4 availabe for use, but to "enable" Spark 3.3+, a user needs to create a custom conda environment (See T344910#9331963). This is not ideal for a non-expert user.)

Considering that wmf_dumps.wikitext_raw_rc2 is partitioned by revision_timestamp, you can efficiently pseudo time travel by using a ceiling filter on that column (I.e. revision_timestamp <= '2024-01-01 00:00:00'), however, this is not semantically the same, as data can and will arrive late, and thus results can be affected, and thus not equivalent to the snapshot partitioning of wmf.mediawiki_wikitext_history.

My conclusion here is that, if folks would be interested in this repeatability, we'd have to bump both Presto and Spark, and bumps are typically expensive in terms of dev time. If this is not immediately interesting, then we could just get the benefit eventually when we bump for other reasons.

Although the target is for data to be ingested to this table on an hourly basis, for a myriad of reasons, this table can drift to be several hours away from MediaWiki production. What kind of data availability and data visibility do they need? Is T354761 and T357684 enough?

T354761 proposed and implemented the following data drift checks:

  • Calculate the last N revisions from a MariaDB replica (say, enwiki) that have had their visibility suppressed. Check on data lake table whether these suppressions are reflected, and print a summary (example: 99.999% match).
  • Calculate the last N revisions from a MariaDB replica. Check on data lake table whether these revisions' sha1 and length match. Print a summary.
  • Calculate the revision count of the last N page_ids that have been recently revised. Check on data lake table whether the revision count matches. Print a summary.

T357684 will use Data Engineering's Data Quality Instrumentation stack to make these metrics available to all users via Superset (example for webrequest here), and to alert the Data Product team if the above checks steer away from a certain threshold (TBD, but someting around 99%). The idea for these metrics is to provide confidence to a user of wmf_dumps.wikitext_raw that the revision history is complete and accurate.

So the question is: Will these be reasonable checks for folks to feel confident to use this table in production use cases?

  • Availability: Research is mostly treating the current history dumps as a pseudo incremental dataset- i.e. pipelines that depend on the history wait for a new snapshot to be released and then only use the "new" data from that snapshot (aka the revisions created in the month since the last snapshot was generated). This means that the wmf_dumps.wikitext_raw allows to significantly reduce the latency - roughly from 1 month (wait for snapshot interval to trigger) +12days (dump processing) to a few hours.
  • Schema: As the schemas are almost identical, my main question is about extending the existing dataset in ways that depend on the snapshot mechanism. For example research has a number of use cases that involve comparing the revision text with the parent revision text. This involves a computationally expensive self join and some pitfalls, so there is a wikidiff job that creates (yet another) version of the wikitext history that includes a column with the unified diff between the current and parent revision.
    • Could we add the diff to the proposed wmf_dumps.wikitext_raw? As the parent revision could be at any point in the past, this would likely involve the equivalent to the wmf.mediawiki_wikitext_current available when new revisions are ingested into the dataset.
    • More generally, what is the replacement for the wmf.mediawiki_wikitext_current?
  • Data quality: the discussion around correctness of the events data T120242 also applies in this context. For research in particular, many use cases don't have high requirements (e.g. for training datasets for ML, or for metrics datasets that involve models that can also be "incorrect"), and we could/would migrate existing jobs to the new dumps table once it is available/supported in prod.

More on "Availability" / time travel. This question is not easy to answer, as it also relates to the current snapshot approach, which forces a pipeline to reason about the past in a rather limiting way. Aka "do you want the data as it looked today, or 1 month ago, or 2 month ago?", and finding out if/how the past data is different is not trivial and rarely practical. Generally pipelines either

  1. offload dealing with the snapshot semantics to the consumers by producing snapshotted datasets themselves
  2. implement a pseudo-incremental dataset by disregarding the "new past" and any changes it might contain.

For this reason I find it hard to define requirements for time travel, it is basically a new capability (for example, the replacement for "mediawiki_wikitext_current" could be a transformation of a time travel query). Starting with 90 days should be sufficient as it is strictly an improvement to what one can do now.

That said, in my understanding one reason for why the snapshots are the way they are is that there are delete actions which are legally required - I assume this is also reflected in the new wikitext_raw table? E.g. if an old revision is suppressed, it will disappear from the relevant file "from the past"? How does this mechanism work with time travel?

  • Availability: Research is mostly treating the current history dumps as a pseudo incremental dataset- i.e. pipelines that depend on the history wait for a new snapshot to be released and then only use the "new" data from that snapshot (aka the revisions created in the month since the last snapshot was generated). This means that the wmf_dumps.wikitext_raw allows to significantly reduce the latency - roughly from 1 month (wait for snapshot interval to trigger) +12days (dump processing) to a few hours.

Additionally, If you are indeed interested in changes only, Iceberg provides a CDC-like interface to consume only changes between (Iceberg) snapshots: https://iceberg.apache.org/docs/nightly/spark-procedures/#change-data-capture. So if you folks built pipelines to make these calculations, you could deprecate them and adopt this zero code option instead.

  • Schema: As the schemas are almost identical, my main question is about extending the existing dataset in ways that depend on the snapshot mechanism. For example research has a number of use cases that involve comparing the revision text with the parent revision text. This involves a computationally expensive self join and some pitfalls, so there is a wikidiff job that creates (yet another) version of the wikitext history that includes a column with the unified diff between the current and parent revision.
    • Could we add the diff to the proposed wmf_dumps.wikitext_raw? As the parent revision could be at any point in the past, this would likely involve the equivalent to the wmf.mediawiki_wikitext_current available when new revisions are ingested into the dataset.

wmf_dumps.wikitext_raw is available hourly because it piggybacks on the page page content change enrichment Flink job being generated by Event Platform. This keeps the computation cheap from our side, by just having the cost of a SQL MERGE into the Iceberg table.

I don't think wmf_dumps.wikitext_raw is the place for diffs because we want to keep this core table cheap, but considering that you could consume wmf_dumps.wikitext_raw in a CDC fashion as mentioned above, you could compute the diff only for the new revisions instead of the whole revision history, potentially making your wikidiff pipeline much faster.

  • More generally, what is the replacement for the wmf.mediawiki_wikitext_current?

Although not planned yet, a new version of wmf.mediawiki_wikitext_current is possible as a downstream table from wmf_dumps.wikitext_raw as well. We can, again, just consume wmf_dumps.wikitext_raw in a CDC fashion and SQL MERGE into the new table the changes. This would be a cheap alternative to the current wmf.mediawiki_wikitext_current, which is a separate job that calculates everything every month.

  • Data quality: the discussion around correctness of the events data T120242 also applies in this context. For research in particular, many use cases don't have high requirements (e.g. for training datasets for ML, or for metrics datasets that involve models that can also be "incorrect"), and we could/would migrate existing jobs to the new dumps table once it is available/supported in prod.

Agreed that T120242 applies, and although I forgot to mention it here, we do have a plan to address it. We won't solve T120242 with this work, but we will solve the correctness issue for wmf_dumps.wikitext_raw via T358373.

...
That said, in my understanding one reason for why the snapshots are the way they are is that there are delete actions which are legally required - I assume this is also reflected in the new wikitext_raw table? E.g. if an old revision is suppressed, it will disappear from the relevant file "from the past"? How does this mechanism work with time travel?

We mark visbility/suppressions with the respective user_is_visible, revision_comment_is_visible and revision_content_is_visible flags. We do not remove the content, and neither does MW remove it from the wiki databases because all of it can be 'unsuppressed' as well.

Whenever we dump from this table to the public, we indeed honor the flags. A WMF/NDA user of this table would also need to remember to honor them... Hmm. Perhaps it would be best if this was taken care of with a VIEW?

Having said that, limiting the availability of (Iceberg) snapshots to ~90 days also helps, considering that most suppressions are never unsuppressed.

From @fkaelin via Slack:

Some practical question.:
Is there an example for wow do we trigger an airflow job based on the arrival of new data?

Ah yes. Great question. The idea is to sense on the successful completion of the dumps_merge_events_to_wikitext_raw DAG from the analytics Airflow instance like sp:

sensor = ExternalTaskSensor(
    task_id="wait_for_source_data",
    external_dag_id="dumps_merge_events_to_wikitext_raw",
)

Unfortunately, this doesn't work across Airflow instances. The Data-Engineering team is looking into expanding that mechanism so that we can use it across Airflow instances. I don't have a ticket to share right now but will let you know when we have it.

I noticed the dataset includes wikidata - could this be used for T364045 in a "roll forward" approach

All wikis are included in wmf_dumps.wikitext_raw. I'd be happy to help folks to start building PoCs on it, but I must state: it is not production quality yet (and thus the 'release candidate' append). We are tracking getting it to prod on T358877.

How come the /wmf/data/wmf_dumps/wikitext_raw_rc2is ~150T while the wikitext history is ~40TB? Is the Multi-Content Revisions support making up for that large difference?

wikitext_raw_rc2 actually takes about ~30TB on a full set, with the size advantage being that we now use Parquet underneath so compression per column is much more efficient (and yes, we made sure to test whether the parquet files were readable, even on the worst offenders that had cause issues before). The current size of ~150TB is because we are missing a job to remove old snapshots and this table has been backfilled for 5 months now since I last manually cleaned it up. Thanks for noticing, I should do another manual cleanup, and we will solve this with automation for good with T358365.

I don't have much basis for feedback, since I've actually never used mediawiki_wikitext_history directly! The reason I've been trying to get it delivered faster is so that we can use @fkaelin's derived datasets more quickly (T365387) 😁

Since I'm here and have read through the thread, I will share some thoughts on naming, for whatever they're worth. I recognize most of these names were used for historical continuity but since table migrations are so rare, I think it's worth taking the opportunity to clarify and standardize things for the next decade of users.

  • wmf_dumps isn't the most useful name, since lots of different datasets get dumped and most of them will be in other databases. What about wmf_content?
  • revision_timestamp, error_timestamp: according to the data modeling guidelines, we should use revision_dt and error_dt instead.
  • wiki_db: according to the data modeling guidelines, we should use wiki_id instead.
  • revision_is_minor_edit: it would be less redundant to use revision_is_minor
  • user_is_visible, revision_comment_is_visible, revision_content_is_visible: "visible" is actually quite a good term for this, so I personally I kind of want to keep it, but it's not used elsewhere. The official name for the functionality is revision deletion, although some parts of the interface do use "visible". mediawiki_history provides this as an array named revision_deleted_parts. Maybe it's worth emulating that? 🤷🏽‍♂️
  • page_redirect_title: I think this is a tiny bit confusing (what the title of a redirect?). Maybe page_redirect_target instead, as "target" seems to be the common term (e.g. on en:w:Wikipedia:Redirect and mw:Help:Redirects).
  • revision_size and content_size: according to the data modeling guidelines, these should be suffixed by the unit (revision_size_bytes and content_size_bytes).
  • row_last_update, row_visibility_last_update: according to the data modeling guidelines, these should be suffixed by _dt (although personally I find that a bit redundant)
  • content_body: the two words seem redundant to me. What about just content?

I recognize most of these names were used for historical continuity but since table migrations are so rare, I think it's worth taking the opportunity to clarify and standardize things for the next decade of users.

Thanks for the feedback @nshahquinn-wmf. Agreed that since this table will go to prod after the data modeling guidelines were established, we should strive to abide by it. I will move your comment to a separate task to tackle it as part of our work.

I am marking this task as done, considering I have been able to answer the main questions from the description.

Considering @fkaelin's comments on a replacement for wmf.mediawiki_wikitext_current, I am also opening a task to explore a replacement table using the aforementioned https://iceberg.apache.org/docs/nightly/spark-procedures/#change-data-capture mechanism. It will be a great example use case for explaining how to consume this new table in a CDC fashion.

I recognize most of these names were used for historical continuity but since table migrations are so rare, I think it's worth taking the opportunity to clarify and standardize things for the next decade of users.

Thanks for the feedback @nshahquinn-wmf. Agreed that since this table will go to prod after the data modeling guidelines were established, we should strive to abide by it. I will move your comment to a separate task to tackle it as part of our work.

Opened T366542.

Considering @fkaelin's comments on a replacement for wmf.mediawiki_wikitext_current, I am also opening a task to explore a replacement table using the aforementioned https://iceberg.apache.org/docs/nightly/spark-procedures/#change-data-capture mechanism. It will be a great example use case for explaining how to consume this new table in a CDC fashion.

Opened T366544.

One of the Structured Content team's medium-term aspirations is to gather a dataset of uploads with their content, so we can look for patterns in what gets deleted. The fact that wmf_dumps.wikitext_raw does not contain archived pages hampers this because we can't get the wikitext (or mediainfo data) for anything that has been deleted

Is there any plan to include content slots for archived pages in wikitext_raw? Or to make them available in some other way?

One of the Structured Content team's medium-term aspirations is to gather a dataset of uploads with their content, so we can look for patterns in what gets deleted. The fact that wmf_dumps.wikitext_raw does not contain archived pages hampers this because we can't get the wikitext (or mediainfo data) for anything that has been deleted

Is there any plan to include content slots for archived pages in wikitext_raw? Or to make them available in some other way?

I think this will be difficult to accomplish on wikitext_raw with the way revisions get deleted in MediaWiki: they get DELETEd from revision and INSERTed on archive.

I had heard though that getting rid of the archive table was picking up steam (See T20493). If that were to be true, we could piggyback on that effort?

This has begun I think, but is gonna be a long process. I summon @Ladsgroup !

The deletion storage refactor will happen after I'm done with image table refactor (T28741) but that's going to take a while and even if I start the work, it'll be quite some time before it would impact you