Page MenuHomePhabricator

Create maintenance script for bad actor data cleanup
Open, Needs TriagePublic

Description

It is possible for an actor table row to contain no useful user information. This can cause errors and log noise. Code that uses actor information can check for this, but it is also helpful to clean up the bad data. Create a maintenance script that detects bad actor table rows and replaces them with references to "Unknown user".

As an example, the following is from live English Wikipedia as of July 10, 2019:

SELECT * FROM actor WHERE actor_id=36635239;

actor_idactor_useractor_name
36635239NULL

This actor row has NULL as an actor_user value and an empty string as an actor_name value. It doesn't point at a user row and, and also doesn't contain anything like an IP address. As one example of how this can cause issues, when RevisionStore::newFromRow() calls User::newFromAnyId(), the only valid id it can supply is actor_id. A $user object is created, but no user id or name is set (or can possibly be found).

Because there is a unique index on actor_name, there will only ever be one such row per wiki.

The maintenance script should:

  1. if a wiki already has an "Unknown user" row, replace all references to the bad actor id with references to the "Unknown user" actor_id, then delete the bad row. (For enwiki, the "Unknown user" actor_id is 188390315.)
  2. if a wiki doesn't already have an "Unknown user" row, update the bad actor row to instead have actor_user of "Unknown user"

More information and discussion in T224368.

Event Timeline

Assuming this task is about MediaWiki-Maintenance-system hence adding project tag so others can find this task under that project.

Unassigning myself as I am not currently working on this.