Page MenuHomePhabricator

Create maintenance script to remove autopatrol actions from logging table
Closed, ResolvedPublic

Description

As part of T184485: Stop logging autopatrol actions, we need a maintenance script to remove all autopatrol actions from the logging table – including all patrol actions prior to April 2016 (before that, manual and automatic patrolling weren’t distinguished).

Since that “April 2016” depends on when code was deployed to an installation, that shouldn’t be hard-coded :) instead, there should be flags to control the time range when to remove autopatrol actions, and another flag to control whether all patrols or only autopatrols are removed.

Event Timeline

including all patrol actions prior to April 2016

I don't think this is the right approach for removal of autopatrols out of the logging table. Autopatrols before April 2016 are distinguishable, even if not by log_action, and by doing this we would lose a lot of actual data in contrast to removing only autopattrol actions.

Yeah, autopatrols should be distinguishable from manual patrols at pretty much all times:

I checked some log entries from 2012 and found that they include information whether they are autopatrols or manual patrols. Very old log entries (I checked some from mid-2009) have the revision id handy… with that we can check which user created the revision (in the revision or archive table)… if the log user equals the revision/ archive user, we have an autopatrol.

Ladsgroup moved this task from Tasks to In Progress on the Wikidata-Ministry-Of-Magic board.
Ladsgroup subscribed.

Yeah, autopatrols should be distinguishable from manual patrols at pretty much all times:

I checked some log entries from 2012 and found that they include information whether they are autopatrols or manual patrols. Very old log entries (I checked some from mid-2009) have the revision id handy… with that we can check which user created the revision (in the revision or archive table)… if the log user equals the revision/ archive user, we have an autopatrol.

Yes, We looked into it and I'm pretty sure it's possible but that query would be too expensive (if not almost impossible) when running the maintenance script to delete 300M rows that happened for wikidata before April 2016. As I said before, there is no plan to remove patrol actions prior to April 2016 for any wiki other than Wikidata.

Change 420214 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/core@master] Introduce deletePatrolLogs maintenance script

https://gerrit.wikimedia.org/r/420214

For entries prior to April 2016, the script should check the log_params field to determine autopatrol status. It is absurd to kill all the manual patrol entries before then, because it would give the incorrect impression that the log has never existed prior to April 2016. Only autopatrols should be killed off. Otherwise, the manual patrols prior to April 2016 should be moved to a temporary "backup" table that will then be remerged back to the original logging table.

Hm… even if the script needs to check the status for all 300m rows in PHP that should be bearable… it might take a while, but I can't see a huge problem with that.

Hm… even if the script needs to check the status for all 300m rows in PHP that should be bearable… it might take a while, but I can't see a huge problem with that.

Agreed. I think if this is considered infeasible, then we should fallback to not deleting the older ones, and instead only purge everything after April 2016. That should still give us a big win, right? If we do want to delete older ones, we should do it properly and not delete actual records of distinct user contributions.

We have been talking about this and how we can do it the best way possible, since it's in the blob, a join is not possible. We can use the maintenance script to make batches on 1000 rows and check them, delete needed ones and move to the next batch. One other idea (that @Lucas_Werkmeister_WMDE brought up) was to do an analysis of rows that can be deleted using a script that we can run on stats machine (so we don't use too much from production database hosts) and save it on a file and move the file to terbium and delete rows using that file. I'm leaning towards the latter option.

@Ladsgroup That seems fine, but it might also make sense for this to be more re-usable and less manual/error-prone work. For example, we could have a maintenance script that goes through logging/patrol/patrol<2017 (as you mentioned) but instead of deleting things, it could SET log_action='autopatrol' for the ones it finds.

Then once that's done, we can purge autopatrols without special parameters.

Benefits:

  • The purge script will only need to focus on autopatrol.
  • We'll be able to re-use it for (eventually, low-priority) correcting older logs on other wikis to provide filtering and separation of patrol vs autopatrol.
  • We'll be able to re-use it when we want to do purging on another wiki in the future. (Without manual queries and deletion.)
  • It'll be usable by third-parties.
  • We could even add the patrol-marker script to the installer by default to run once on upgrades, so that small/third-party wikis will have all old autopatrol records properly marked as such.

That is a very good idea. I filed T190447: Create maintenance script to properly mark old autopatrol actions in logging table and I'm planning to tackle that the next.

@Ladsgroup I'm not sure how this relates to revision. Patrol logs have always varied between manual actions and automatic self-actions. The only difference is that pre-2016, it was just a variable in the interface message based on the serialised params blob. After that, in 2016 we introduce patrol/autopatrol as separate type-action pair for autopatrols from the regular patrol/patrol. This script would be to batch through the old patrol/patrol logs, and where the serialised blobs contains auto and is true, update log_action to be autopatrol. Basically, to correct the old logs.

We can then run that script separately from the purge script ahead of time, potentially also run the converter on other wikis where we don't want to do the purge. Plus, we can add the conversion script to the installer as an upgrade script for to help third-parties correct their logs as well.

All of this only makes sense, however, if we plan to keep support for autopatrol logs in some way. Because separating the scripts, allows us to run the conversion cleanup on all wikis, and for third-parties, whilst only disabling/purging on some wikis.

But, we have already approved the parent task (T184485) that intends to remove autopatrol logging more permanently. Not configurable. And that makes sense, given they don't add value, especially once the recentchanges issue is fixed to keep the distinction there.

As such, I agree we don't need the separation. We only need one script, and we should run it everywhere (eventually). And also part of the installer/upgrader, like we do for other cleanup/conversation scripts.

Change 423601 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/core@master] Add further test cases to deleteAutoPatrolLogsTest

https://gerrit.wikimedia.org/r/423601

Change 420214 merged by jenkins-bot:
[mediawiki/core@master] Introduce deleteAutoPatrolLogs maintenance script

https://gerrit.wikimedia.org/r/420214

Change 423601 merged by jenkins-bot:
[mediawiki/core@master] Add further test cases to deleteAutoPatrolLogsTest

https://gerrit.wikimedia.org/r/423601

hoo removed a project: Patch-For-Review.