Page MenuHomePhabricator

Improve AbuseFilter database schema to reduce duplication
Open, Needs TriagePublic

Description

In the abuse_filter table, we store information about all the filters, including the last editor and timestamp of the most recent update. But the same information is also available in abuse_filter_history where the data about each version of each filter is stored. So is there any problem with replacing af_user, af_user_text and af_timestamp (perhaps also af_group, af_deleted, af_actions, af_public_comments) with a foreign key to abuse_filter_history.afh_id and JOIN them?

See also:

Event Timeline

Denormalization is generally done for two reasons: efficiency and convenience. Having the af_user in abuse_filter means one less join, and near instantaneous access to its value.

That said, the information af_user, af_user_text and af_timestamp is not routinely retrieved (it is only retrieved when you go to ViewEdit page to edit a filter, which is <1% of all the times code is run. So as long as we have a multicolumn index on abuse_filter_history(afh_filter, afh_timestamp), removing those three columns from abuse_filter and changing the code everywhere to use a join should be fine.

In contrast, af_public_comments and af_actions are needed every time a filter is triggered, and denormalizing them was a good idea, IMHO. AbuseFilter is already slow, and adding another join, no matter the index, is only going to make it slower.