Page MenuHomePhabricator

Refactor "user" & "user_text" fields into "actor" reference table
Open, Needs TriagePublic

Description

Subtask split out from T161671 'Compacting the revision table', to refactor the (id-or-0, name-or-address) tuples in revision, logging, image, and other tables through a common reference to an actor table (actor as in one who acts).

This makes those references a fixed size: 64-bit bigints instead of a 32-bit int followed by a variable size string. Moving the name-or-address out of the key also means we avoid duplications of the name-or-address string between multiple rows when a single user makes many edits, and make relevant indexes for by-user lookups smaller.

End state

Database new tables:

--
-- Revision authorship is represented by an actor row.
-- This avoids duplicating common usernames/IP addresses
-- in the revision table, and makes rename processing quicker.
--
CREATE TABLE /*_*/actor (
  -- Unique ID to identify each entry
  actor_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Key to user.user_id of the user who made this edit.
  -- Stores NULL for anonymous edits and for some mass imports.
  actor_user int unsigned,

  -- Text username or IP address of the editor.
  actor_text varchar(255) binary NOT NULL default ''
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/actor_user ON /*_*/actor (actor_user);
CREATE UNIQUE INDEX /*i*/actor_text ON /*_*/actor (actor_text);

Database changes tables:

  • revision adds rev_actor, and drops rev_user & rev_user_text
  • archive adds ar_actor, and drops ar_user & ar_user_text
  • image adds img_actor, and drops img_user & img_user_text
  • oldimage adds oi_actor, and drops oi_user & oi_user_text
  • filearchive adds fa_actor, and drops fa_user & fa_user_text
  • recentchanges adds rc_actor, and drops rc_user & rc_user_text
  • logging adds log_actor, and drops log_user & log_user_text

Migration state

To avoid blocking on changes to the revision table (so we can coordinate a large schema update), revision and image (FIXME: check if we need to do this on others) will gain temporary sibling tables to hold the reference:

CREATE TABLE /*_*/revision_actor_temp (
  revactor_rev bigint unsigned NOT NULL,     -- → revision.rev_id
  revactor_actor bigint unsigned NOT NULL, -- → actor.actor_id
  PRIMARY KEY (revactor_rev, revactor_actor)
) /*$wgDBTableOptions*/;

-- will this one work sanely? we have no img_id key srsly?
CREATE TABLE /*_*/image_actor_temp (
  imgactor_img varchar(255) binary NOT NULL,     -- → image.img_name
  imgactor_actor bigint unsigned NOT NULL, -- → actor.actor_id
  PRIMARY KEY (imgactor_img, imgactor_actor)
) /*$wgDBTableOptions*/;

MediaWiki changes and migration

Add feature flag for migration states:

  • Read and write old columns only
  • Write both old and new columns. Read from new preferentially, falling back to old.
  • Write only new columns. Read from new preferentially, falling back to old.
  • Read and write the new columns only.

ActorMigration (?) class to manage state & offer methods -- coordinate the basic layout of this with our similar changes at T166732 & friends!

  • use outer joins on reads to map in the extra temporary tables when used, expose the necessary query options.
  • helper function for query setup when searching for a specific user or address's references
  • helper function for storing to the right columns and whether two inserts are needed for temp tables.

Further work:

  • find and fix uses of the direct fields for lookups to use the abstraction
  • probably want a maintenance script to do forced migration that can be run manually in production scenarios or run automatically from the default updaters for small installations

Potential pitfalls

  • There may be more direct usages than expected and they may be a pain to fix
  • There might be "surprise" usages of user_text fields or indexes for querying multiple lookups
  • apiqueryrevisions ;)
  • Are there any problems with still duplicating a username between actor.actor_name and user.user_name?
  • Migration may be harder than expected.
  • Still need to plan the 'final megamigration' from revision's friend tables to merged columns

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
OpenNone
OpenNone
OpenAnomie
OpenAnomie
ResolvedAnomie
ResolvedNone
ResolvedMarostegui
ResolvedBstorm
DeclinedNone
ResolvedAnomie
ResolvedAnomie
OpenNone
OpenAnomie
OpenAnomie
ResolvedJdforrester-WMF
OpenCiencia_Al_Poder
OpenNone
OpenNone
ResolvedMarostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 419520 merged by jenkins-bot:
[mediawiki/extensions/Renameuser@master] Update for the actor table change

https://gerrit.wikimedia.org/r/419520

Change 421120 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Update for the actor table change

https://gerrit.wikimedia.org/r/421120

Change 420843 merged by jenkins-bot:
[mediawiki/extensions/UploadWizard@master] Update for the actor table change

https://gerrit.wikimedia.org/r/420843

Change 421115 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] Update for the actor table change

https://gerrit.wikimedia.org/r/421115

Change 420842 merged by jenkins-bot:
[mediawiki/extensions/UniversalLanguageSelector@master] Update for the actor table change

https://gerrit.wikimedia.org/r/420842

Change 418981 merged by jenkins-bot:
[mediawiki/extensions/OAuth@master] Update for the actor table change

https://gerrit.wikimedia.org/r/418981

Change 421129 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Update for the actor table change

https://gerrit.wikimedia.org/r/421129

Change 418149 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Update for the actor table change

https://gerrit.wikimedia.org/r/418149

Any relation as mentioned here T193254#4165780?

Any relation as mentioned here T193254#4165780?

Seems unlikely.

Change 420840 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Update for the actor table change

https://gerrit.wikimedia.org/r/420840

daniel added a subscriber: daniel.

Removing MCR tag, since this is orthogonal.

Change 418974 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Update for the actor table change

https://gerrit.wikimedia.org/r/418974

Change 418146 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@master] Update for the actor table change

https://gerrit.wikimedia.org/r/418146

Change 420418 merged by jenkins-bot:
[mediawiki/extensions/SecurePoll@master] Update for the actor table change

https://gerrit.wikimedia.org/r/420418

Change 418148 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Update for the actor table change

https://gerrit.wikimedia.org/r/418148

Will this task (and/or the parent task T161671) allow us to do cross-searches between page and user? For example: "Give me the edits on Wikipedia:Village_pump_(technical) by Nirmos".

Anomie added a comment.Jun 4 2018, 4:26 PM

Will this task (and/or the parent task T161671) allow us to do cross-searches between page and user? For example: "Give me the edits on Wikipedia:Village_pump_(technical) by Nirmos".

No. The most it might do is make the SQL queries for "Give me the edits on Wikipedia:Village_pump_(technical) by Nirmos" and "Give me the edits on Wikipedia:Village_pump_(technical) by 192.0.2.34" match, instead of one probably using rev_user and the other using rev_user_text.

Also, you can do that already, at least via the API: https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Wikipedia:Village%20pump%20(technical)&rvuser=Nirmos

Change 437372 had a related patch set uploaded (by Paladox; owner: Anomie):
[mediawiki/extensions/FlaggedRevs@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/437372

Change 420854 merged by jenkins-bot:
[mediawiki/extensions/UserMerge@master] Update for the actor table change

https://gerrit.wikimedia.org/r/420854

Change 437372 abandoned by Paladox:
Update for the actor table change

https://gerrit.wikimedia.org/r/437372

Change 472509 had a related patch set uploaded (by Paladox; owner: Anomie):
[mediawiki/extensions/CentralAuth@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/472509

Change 472509 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/472509

Change 481814 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: MR70):
[mediawiki/extensions/WikimediaMaintenance@master] Fix: use selectRow instead of selectField in UnsuppressCrossWiki.

https://gerrit.wikimedia.org/r/481814

Change 482110 had a related patch set uploaded (by Paladox; owner: Anomie):
[mediawiki/extensions/UserMerge@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/482110

Change 481814 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Fix: use selectRow instead of selectField in UnsuppressCrossWiki.

https://gerrit.wikimedia.org/r/481814

Change 482110 merged by jenkins-bot:
[mediawiki/extensions/UserMerge@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/482110

@Pchelolo awww maaaan, we called this performer in event schemas. Now we'll have two separate names for the same thing. :(

daniel added a comment.EditedApr 8 2019, 4:22 PM

@Pchelolo awww maaaan, we called this performer in event schemas. Now we'll have two separate names for the same thing. :(

The name comes from an RFC that was discussed two years ago, see https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2. The table with that name was merged into core in February 2018, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/380669.

The RFC process is intended to avoid such divergence... I wonder what we could do to make that work better.

Ottomata added a comment.EditedApr 8 2019, 8:38 PM

Aye, hm not sure. The event schema fields were bikeshed with myself, Marko, Petr, Dan Andreescu, Aaron Halfakar and Joseph Allemandou as part of T134502: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need and in https://gerrit.wikimedia.org/r/#/c/mediawiki/event-schemas/+/301284/ in July of 2016. IIRC, we decided to use the performer name since that was how Mediawiki code referred to it. How did yall settle on actor?

I don't follow all RFCs, and I'm not sure if Marko or Dan were on the TechComm at the time.

daniel added a comment.Apr 9 2019, 9:14 AM

Not sure how "actor" happened, to be honest. May have been a brainstorming during an IRC meeting.
I now realize that MW core uses "performer" in LogEntry and RecentChange. Bummer.

Yar yeah really is too bad. Neither are easy to change now. I guess we'll just ride into the sunset with the different names, like we do with article vs page?

daniel added a comment.EditedApr 9 2019, 2:26 PM

Yar yeah really is too bad. Neither are easy to change now. I guess we'll just ride into the sunset with the different names, like we do with article vs page?

*sigh*

Probably my fault, too. "Actor" sounds like something I'd come up with ;)

mahmoud added a subscriber: mahmoud.Jun 8 2019, 7:50 PM

Hey all, just refactored a couple tools to work with the new schema, wasn't too bad. Sad about the performer thing, though I have to admit I'm more used to "actor" from other codebases.

One other thing I noticed: The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

Jar added a subscriber: Jar.Jun 9 2019, 1:31 AM

The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

That's not always true:

  • When you revert an edit, if you have the markbotedits permission, you can revert "as a bot", hiding from recent changes both your rollback action and the edit(s) affected by your rollback. See https://www.mediawiki.org/wiki/Manual:Administrators#Rollback
  • A user can be added to the bot group, but that doesn't mean past edits should be hidden from recent changes. And the opposite, removing a user from the bot group.

Hence, some edits may be marked as a bot while the actor is not in the bot group, or vice versa.

One other thing I noticed: The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

That's T19237.

One other thing I noticed: The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

The terminology in use is a bit confusing. The term "bot flag" might refer to any of

  1. There's the rc_bot column in recentchanges, which is used for the "bot edits" filter in Special:RecentChanges and Special:Watchlist.
  2. There's the "bot" user right, which enables setting rc_bot in the recentchanges entry when making an action.
    • Via index.php, this makes edits be flagged by default. Via the Action API, edits are not flagged by default, the client has to specify &bot=1 or the like in their query.
  3. There's the "bot" group, which grants the "bot" user right as well as several others (e.g. "apihighlimits").

When you say "the bot flag seems like something intrinsic to the actor" you're probably thinking about #2 or #3, while rc_bot is #1.

Ah, makes sense, thanks @Anomie, @Nirmos, and @Ciencia_Al_Poder! Not sure if you want to add that either to the rc_bot part of the recentchanges table docs, but that's good info! Thanks again!