Page MenuHomePhabricator

purgeRedundantText: Potential data loss
Open, HighPublic

Description

The purgeOldText.php script can be used to purge text no longer active on the text table. However, as the text_table description says, this table not only contains revision texts, but can potentially contain blobs from extensions.

For example, AbuseFilter extension stores data there. See storeVarDump method. Log details of past edits are stored there.

Looking at the code of purgeRedundantText it only looks for text referenced from the revision and archive tables. It’s not taking into account extensions, and this could potentially cause data loss.

Note that the purgeRedundantText method is in the Maintenance class, because it’s used by several maintenance scripts, which is scary.

grep for purgeRedundantText on MediaWiki 1.31:

./maintenance/nukeNS.php:93: $this->purgeRedundantText( true );
./maintenance/deleteOrphanedRevisions.php:80: $this->purgeRedundantText( true );
./maintenance/deleteArchivedRevisions.php:59: $this->purgeRedundantText( true );
./maintenance/deleteOldRevisions.php:97: $this->purgeRedundantText( true );
./maintenance/nukePage.php:88: $this->purgeRedundantText( true );
./maintenance/Maintenance.php:1241: public function purgeRedundantText( $delete = true ) {
./maintenance/purgeOldText.php:40: $this->purgeRedundantText( $this->hasOption( 'purge' ) );

After investigating it, I found that, effectively, AbuseFilter log history data was wiped from my database after I ran nukePage.php maintenance script. I find this unacceptable.

Event Timeline

Ideally the (largely undocumented) ExternalStore should be used for storing extension data. (Or I guess SqlBlobStore these days?) AbuseFilter supports it but falls back to the text table when unconfigured. Not sure what would be the nice way to handle this; a hook so extensions can provide query conditions for recognizing text table rows as being in use?

In fact, I have ExternalStore on my wiki configured, but AbuseFilter stores the reference to the external store to the text table. This method wipes the reference to it on the text table, but doesn't remove the external store data (T213480)

I like Tgr's idea. However, I can also gladly say that this is a good moment to fix the problem: T213006 outlines a script to make various fixes to the afl_var_dump column, and consequently to the text table. If we want to change something in old entries, speak now...

Yeah, apparently AF just clones what Revision did, which is use the text table as a reference that points to external storage. Which makes sense for references (which are expected to be in the text table, and only moved out of there to reduce table size), not so much for AF (which should just store the external store URL in the same place where it stores the text id in the non-ExternalStore case).

But it seems like even generic helpers do that at this point (see e.g. the comment in SqlBlobStore::storeBlob()) so it wouldn't be fair to blame AF here.

@Tgr Well, storing to the text table makes it possible to use its flags just like revision does, e.g. 'external'... Of course we could also store the var dump directly inside afl_var_dump, but I can't say if it's worth it.

Probably not, it's a huge dump of data and rarely needed. Just what external storage was meant for.

The direction MCR is moving towards is that you either store tt://<text id> or an external storage URL. But apparently the abstraction layer is not quite there yet.

Probably not, it's a huge dump of data and rarely needed. Just what external storage was meant for.

True.

The direction MCR is moving towards is that you either store tt://<text id> or an external storage URL. But apparently the abstraction layer is not quite there yet.

Could be done. I guess I should CR-1 the patch for now, unless we have some more background... And wait for T34478, too.

Blocking other high priority tasks.

If we want to change the storage format retroactively, or just add a new flag to entries in the text table, this is blocking T213006.

If we want to change the storage format retroactively, or just add a new flag to entries in the text table, this is blocking T213006.

But well, reading it again... Would it be enough to just add a hook as proposed above? Something like 'PurgeRedundantText', and extensions may provide a list of IDs to spare. For, AF, it would be something along the lines of:

$res = $dbr->selectFieldValues( 'abuse_filter_log', 'afl_var_dump', '', __METHOD__ );
$ret = [];
foreach ( $res as $val ) {
   if ( strpos( $val, 'stored-text' ) !== false ) {
      $ret[] = (int)str_replace( $val, 'stored-text:', '' );
   }
}
return $ret;

If this is it, we don't even need to retroactively change abuse_filter_log rows, nor to add special flags. We'd just have to update the hook handler as part of the migration.

@Tgr would it be enough?

But well, reading it again... Would it be enough to just add a hook as proposed above? Something like 'PurgeRedundantText', and extensions may provide a list of IDs to spare. For, AF, it would be something along the lines of:

This works if you have a way to quickly decide on the fly whether a given text_id is owned by AF or not. The code you posted above would prove a list of such IDs if I understand correctly but... wouldn't this list be potentially huge? Millions of entries? That wouldn't work... we'd have to put them into a separate table, first. Maybe a temporary table would be an option.

Another alternative would be to put this into a separate table. It would be pretty easy to make a simplified version of SqlBlobStore that takes a table name as a constructor argument, and extended BlobStoreFactory in a way that allows it to provide blob stores that are bound to different tables. This would be a bit like T106363: Migrate Flow content to new separate logical External Store in production. Actually, that would also work: An ESBlobStore pointing to a separate ES setup.

But well, reading it again... Would it be enough to just add a hook as proposed above? Something like 'PurgeRedundantText', and extensions may provide a list of IDs to spare. For, AF, it would be something along the lines of:

This works if you have a way to quickly decide on the fly whether a given text_id is owned by AF or not. The code you posted above would prove a list of such IDs if I understand correctly but... wouldn't this list be potentially huge? Millions of entries? That wouldn't work... we'd have to put them into a separate table, first. Maybe a temporary table would be an option.

Yes, it wouldn't perform really well. OTOH, the maintenance script has:

SELECT rev_text_id FROM revision

and

SELECT ar_text_id FROM archive

which are equally, if not more expensive.

Of note, this is intended as a quickly-available solution. I'm sure there are better ways to do that.

which are equally, if not more expensive.

Some quick stats from enwiki: the revision table has 778.547.790, archive has 53.893.106, and abuse_filter_log has 24.312.505. So yes, definitely less expensive.

Still slow, of course, but I find it acceptable in this context.

Yes, it wouldn't perform really well. OTOH, the maintenance script has:

SELECT rev_text_id FROM revision

and

SELECT ar_text_id FROM archive

which are equally, if not more expensive.

Oh wow, I didn't realize these queries were not batched. The result of these queries are even turned into a list and submitted as part of another query!

This means that purgeRedundantText() is completely broken for any wiki of non-trivial size, and the maintenance scripts that use this don't work. This makes me think that nobody actually uses this code. Maybe we can just kill it :)

Anyway, you are right: your solution will work for the cases for which the rest of the code works. On wikis up to a thousand pages or so.

Oh wow, I didn't realize these queries were not batched. The result of these queries are even turned into a list and submitted as part of another query!

Hah! I didn't even see the list part. One of the hook parameters may be the batch size, so that extensions can use them.

This means that purgeRedundantText() is completely broken for any wiki of non-trivial size, and the maintenance scripts that use this don't work. This makes me think that nobody actually uses this code. Maybe we can just kill it :)

Or this solution, of course.

Anyway, you are right: your solution will work for the cases for which the rest of the code works. On wikis up to a thousand pages or so.

I think we can add batching + hook for now. Maybe there are use cases outside of WMF wikis?

At any rate, it's pretty clear now that this isn't blocking T213006 anymore.

Yes, it wouldn't perform really well. OTOH, the maintenance script has:

SELECT rev_text_id FROM revision

and

SELECT ar_text_id FROM archive

which are equally, if not more expensive.

This has been reported before, just in case someone is interested: T22651: Maintenance.php function purgeRedundantText() not able to deal with big data set

But well, reading it again... Would it be enough to just add a hook as proposed above? Something like 'PurgeRedundantText', and extensions may provide a list of IDs to spare.

Theoretically, it is bad form to require a hook to protect rows from being deleted. If something goes wrong with it, you only notice when your content is gone.