Page MenuHomePhabricator

Purge rows from cu_useragent
Closed, ResolvedPublic1 Estimated Story Points

Description

The cu_useragent table added in T359312: Create cu_useragent table has private data (user agent strings) which, per DBA and to ensure privacy, must be purged. The exact strategy to do this has not been decided and needs consideration because when implementing Client Hints, we did not purge rows from cu_useragent_client_hints because of the difficulty in reliably deleting rows.

The following strategies could be used:

  1. Only deleting when the primary key is below 99% (as suggested in T359312#9614750)
  2. Using a "pending deletion" column strategy:
    1. If a cu_useragent row is un-used, mark the row as "pending deletion" by updating a column for the row to be deleted in the cu_useragent table - A dedicated column would need to be added for this.
    2. Wait for this change to be applied to all replicas
    3. Check if the row pending deletion is being used again
      1. If the row is being used, then mark the row as no longer pending deletion
      2. If the row isn't being used, then delete it.

Related Objects

StatusSubtypeAssignedTask
Resolved TBolliger
OpenNone
OpenNone
OpenNone
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedDreamy_Jazz
ResolvedMarostegui
DeclinedNone
ResolvedPapaul
Resolvedtaavi
ResolvedMarostegui
ResolvedSnwachukwu

Event Timeline

Dreamy_Jazz set the point value for this task to 1.Aug 11 2025, 2:13 PM

Having investigated the first option I think that seems unlikely to be reliable enough:

  • If a replica read gets the ID that is being currently purged, it use that in a new row even though it has been purged
    • Reading the ID from the primary DB doesn't solve this (without locks) because the transaction for purging may run the check first and then commit after the reference was fetched by a different primary DB connection
  • While only touching primary keys that are in the 1% of smallest primary keys should significantly reduce the chance of that happening, it feels like that doesn't really solve the problem
  • Additionally, doing the delete and then checking for uses after replication also won't be 100% reliable because the replica being checked may not be the one that is the furthest behind in terms of replication lag

I think the only safe approach here is to create a new column in cu_useragent which stores the date the row was deleted:

  • Tthis could be an integer field as it could store the integer 20251217 which is parsed as a date but stored as an integer to save on space, given that we don't need too much accuracy in the deleted timestamp
  • When row has been deleted for over a day (or a smaller supported interval of time if the field is a timestamp field), a new check can be made to ensure that it's not referenced anywhere and then the code can either delete it for real or undo the deletion

@Ladsgroup does the above sound okay? I know you suggested a 99% deletion but I am concerned that this wouldn't be reliable enough? Are there examples of where this is used elsewhere in MediaWiki while ensuring data is never lost?

Adding an extra column will add a lot of complexity that'll be hard to maintain. We do a lot of purging and clean ups and we never do something like that. Do you have numbers on why 99% is not reliable enough? You can also run some queries to see what value would be the most reliable (maybe 98%). We use that 99% in PruneUnusedLinkTargetRows already with no issues. Wikibase also has a job to clean up target when it becomes orphan which has been working reliably for around half a decade: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/Store/Sql/Terms/CleanTermsIfUnusedJob.php on top of that, even if we lose pointers to let's say one edit a month, would that be really bad? I don't think so. Most actions won't (and shouldn't) be checked by a CU. We call that error budget: https://sre.google/sre-book/embracing-risk/

Adding an extra column will add a lot of complexity that'll be hard to maintain. We do a lot of purging and clean ups and we never do something like that. Do you have numbers on why 99% is not reliable enough? You can also run some queries to see what value would be the most reliable (maybe 98%). We use that 99% in PruneUnusedLinkTargetRows already with no issues. Wikibase also has a job to clean up target when it becomes orphan which has been working reliably for around half a decade: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/Store/Sql/Terms/CleanTermsIfUnusedJob.php on top of that, even if we lose pointers to let's say one edit a month, would that be really bad? I don't think so. Most actions won't (and shouldn't) be checked by a CU. We call that error budget: https://sre.google/sre-book/embracing-risk/

My concern was primarily from the point of loosing pointers, but I guess you are right about this being within a reasonable error budget. Thanks for the comments.

I'll look at PruneUnusedLinkTargetRows and use that. Thanks

Change #1223208 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] [WIP] Prune rows from cu_useragent

https://gerrit.wikimedia.org/r/1223208

Change #1223208 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Prune rows from cu_useragent in purgeOldData.php

https://gerrit.wikimedia.org/r/1223208