Page MenuHomePhabricator

Normalize cu_changes table
Open, Needs TriagePublic

Description

In every row of the cu_changes table the user agent is saved in the cuc_agent column. User agents are usually quite common, thus this quite a data duplication with the same user agent being saved up to million of times in the db.

Perhaps it makes sence to normalize the user agent by creating a new table where we map ids to user agents and then only reference those ids in the cu_changes table.

We might also want to do something similar for cuc_actiontext.

Event Timeline

For some of the values for cuc_actiontext there is also T41013

Huji added a subscriber: Huji.

This is a good idea, and would likely make T147894 obsolete. It would also make T234980 much easier.

I wonder if it gets worked on though, given T295073: <Org-Wide Impact> Google Chrome User-Agent Deprecation Impact

Generally sounds good but:

  • Let's not do it while the actor/comment migration is happening on CU tables. One thing at a time.
  • Old UA strings should be removed

I agree with both comments by @Ladsgroup
For the second one, we should modify purgeOldData.php to also run a query that looks for any rows in the new cu_user_agents table where its identifier (cuua_id) doesn't occur in the cu_changes table in its now-normalized cu_agent_id column. Those rows should then be deleted. This further justifies why cu_changes.cu_agent_id should be indexed, which is desirable for T147894 anyway.