We will be soon dropping the column from our production. It seems it's being used by the data engineering infrastructure. There is content_sha1 in content table which has the data. The way it's being computed for rev_sha1 (if I'm reading RevisionSlots::computeSha1() correctly) is this: If the revision has only one slot, take the hash of that slot, if more than one, append hash of the previous slot and compute base36 sha1 of the concatenated value and continue until it's one string left.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Zabe | T389026 Rethink rev_sha1 field | |||
| Resolved | xcollazo | T405503 Prepare data engineering infrastructure for drop of rev_sha1 | |||
| Resolved | xcollazo | T405641 Adapt MW Content pipelines to the removal of upstream revision.rev_sha1 | |||
| Resolved | JAllemandou | T406000 Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1 | |||
| Duplicate | None | T406644 Stop sqooping revision.rev_sha1 |
Event Timeline
This will have an impact on the mediawiki_history identity-revert field and related fields. We need to spend time on this @Ahoelzl .
FWIW, the revision.content_slots.content_sha1 field is available in mediawiki.page_change.v1. If we had done T258511: Data Lake incremental Data Updates and Dumps 2 reconciliation focused on reconciling page_change, (instead of just mediawiki content), we could use it for incremental mediawiki_history.
(Sorry Amir, this is my squeaky wheel soapbox about a decision we made a year ago :) )
Anyway, it seems we do sqoop the content table which has content_sha1 in it. I suppose the mediawiki_history algorithm just needs to join revision with the content table and use content_sha1 instead of rev_sha1.
Could be done in 30 days, but we might have to drop something else, TBD!
This will affect wmf_content.mediawiki_content_history_v1 as well.
And it will also affect T384382: Production-level file export (aka dump) of MW Content in XML.
Since RevisionSlots::computeSha1() is php, we will have to reimplement that algorithm on our side if we are to continue offering that field in the table and in File Export.
Please note that this will also cause archive.ar_sha1 to be dropped, as the archive table contains the deleted revisions if you also use this column.
Thanks. Yeah. We need to clean up some stuff and will do that after Oct 25. Will you encounter issues if the columns still exists for a while? (it's going to be rolled out gradually and will take time regardless)
Will you encounter issues if the columns still exists for a while?
We should be good. We have already adapted the two important pipelines. We still need to do T406644: Stop sqooping revision.rev_sha1, but we hope to do that next week.