This field exists in every row of revision table and is the largest field of that table. Overall it takes ~20% of the revision table and given how large this table is (206GB in enwiki, 377GB in wikidata) this adds up to quite a large amount of storage (I estimate this would be on average 40GB per replica = 8TB in total).
I'm not against storage of checksums and I see the value but I'm not sure the status quo is taking all trade-offs into account. It's important to think of the purpose of this field. Is it proof for integrity or proof for authenticity? If it's the former, then this is quite an overkill. If it's latter, then without a Merkle-tree and something stronger than sha1 (which is provably not cryptographically secure) we can't provide that (i.e. status quo can't be used for proof of authenticity). My point is that these we store revisions, not TLS certificates.
Another problem with rev_sha1 is that it stores the hex value of the checksum as string (varbinary(31)) which is quite wasteful.
Proposal 0: Drop rev_sha1 and compute it on the fly from content_sha1. <- We are going with this.
Proposal 1: Store the checksum as an unsigned int. That would be enough to protect against random corruptions and even revert detection. To get the value, it'd be enough to take first 8 digits of the hex value, turn into a decimal and get reminder of max value of int in mysql.
- Pros: It saves a lot of space. 28 bytes per row which will add up to ~35GB being saved on average from each replica.
- Cons: Slightly higher chance of collision: 1 in 4B revisions might collide (theoretically, in practice it's higher)
Proposal 2: Store the checksum as an unsigned bigint instead.
- Pros: Pretty low chance of collision.
- Cons: Saves only 30GB per replica
Proposal 3: Normalize the value into another table (there is some repetitions due to reverts and such).
- Pros: It removes a lot from revision table, specially good for innodb buffer pool.
- Cons: It doesn't remove the data and conceptually it's better to keep the checksum near the object itself. Also since the pointer in revision has to be a bigint, it doesn't save much.
See also T158986: Migrate SHA-1 hashes to SHA-256 (tracking) and subtasks.