Page MenuHomePhabricator

purgeRedundantText: Potential data loss
Open, HighPublic

Description

The purgeOldText.php script can be used to purge text no longer active on the text table. However, as the text_table description says, this table not only contains revision texts, but can potentially contain blobs from extensions.

For example, AbuseFilter extension stores data there. See storeVarDump method. Log details of past edits are stored there.

Looking at the code of purgeRedundantText it only looks for text referenced from the revision and archive tables. It’s not taking into account extensions, and this could potentially cause data loss.

Note that the purgeRedundantText method is in the Maintenance class, because it’s used by several maintenance scripts, which is scary.

grep for purgeRedundantText on MediaWiki 1.31:

./maintenance/nukeNS.php:93: $this->purgeRedundantText( true );
./maintenance/deleteOrphanedRevisions.php:80: $this->purgeRedundantText( true );
./maintenance/deleteArchivedRevisions.php:59: $this->purgeRedundantText( true );
./maintenance/deleteOldRevisions.php:97: $this->purgeRedundantText( true );
./maintenance/nukePage.php:88: $this->purgeRedundantText( true );
./maintenance/Maintenance.php:1241: public function purgeRedundantText( $delete = true ) {
./maintenance/purgeOldText.php:40: $this->purgeRedundantText( $this->hasOption( 'purge' ) );

After investigating it, I found that, effectively, AbuseFilter log history data was wiped from my database after I ran nukePage.php maintenance script. I find this unacceptable.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 10 2019, 8:25 PM
Tgr added a subscriber: Tgr.Jan 11 2019, 2:47 AM
Tgr added a comment.Jan 11 2019, 2:54 AM

Ideally the (largely undocumented) ExternalStore should be used for storing extension data. (Or I guess SqlBlobStore these days?) AbuseFilter supports it but falls back to the text table when unconfigured. Not sure what would be the nice way to handle this; a hook so extensions can provide query conditions for recognizing text table rows as being in use?

In fact, I have ExternalStore on my wiki configured, but AbuseFilter stores the reference to the external store to the text table. This method wipes the reference to it on the text table, but doesn't remove the external store data (T213480)

I like Tgr's idea. However, I can also gladly say that this is a good moment to fix the problem: T213006 outlines a script to make various fixes to the afl_var_dump column, and consequently to the text table. If we want to change something in old entries, speak now...

Daimona moved this task from Backlog to Next on the User-Daimona board.
Tgr added a comment.Jan 11 2019, 6:35 PM

Yeah, apparently AF just clones what Revision did, which is use the text table as a reference that points to external storage. Which makes sense for references (which are expected to be in the text table, and only moved out of there to reduce table size), not so much for AF (which should just store the external store URL in the same place where it stores the text id in the non-ExternalStore case).

But it seems like even generic helpers do that at this point (see e.g. the comment in SqlBlobStore::storeBlob()) so it wouldn't be fair to blame AF here.

@Tgr Well, storing to the text table makes it possible to use its flags just like revision does, e.g. 'external'... Of course we could also store the var dump directly inside afl_var_dump, but I can't say if it's worth it.

Tgr added a comment.Jan 11 2019, 6:44 PM

Probably not, it's a huge dump of data and rarely needed. Just what external storage was meant for.

The direction MCR is moving towards is that you either store tt://<text id> or an external storage URL. But apparently the abstraction layer is not quite there yet.

Probably not, it's a huge dump of data and rarely needed. Just what external storage was meant for.

True.

The direction MCR is moving towards is that you either store tt://<text id> or an external storage URL. But apparently the abstraction layer is not quite there yet.

Could be done. I guess I should CR-1 the patch for now, unless we have some more background... And wait for T34478, too.

Daimona moved this task from Next to Waiting on the User-Daimona board.Feb 11 2019, 5:52 PM
Daimona triaged this task as High priority.Sat, Mar 30, 11:48 AM

Blocking other high priority tasks.