Page MenuHomePhabricator

Refactor "user" & "user_text" fields into "actor" reference table
Open, Needs TriagePublic

Description

Subtask split out from T161671 'Compacting the revision table', to refactor the (id-or-0, name-or-address) tuples in revision, logging, image, and other tables through a common reference to an actor table (actor as in one who acts).

This makes those references a fixed size: 64-bit bigints instead of a 32-bit int followed by a variable size string. Moving the name-or-address out of the key also means we avoid duplications of the name-or-address string between multiple rows when a single user makes many edits, and make relevant indexes for by-user lookups smaller.

End state

Database new tables:

--
-- Revision authorship is represented by an actor row.
-- This avoids duplicating common usernames/IP addresses
-- in the revision table, and makes rename processing quicker.
--
CREATE TABLE /*_*/actor (
  -- Unique ID to identify each entry
  actor_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,

  -- Key to user.user_id of the user who made this edit.
  -- Stores NULL for anonymous edits and for some mass imports.
  actor_user int unsigned,

  -- Text username or IP address of the editor.
  actor_name varchar(255) binary NOT NULL default ''
) /*$wgDBTableOptions*/;

CREATE UNIQUE INDEX /*i*/actor_user ON /*_*/actor (actor_user);
CREATE UNIQUE INDEX /*i*/actor_name ON /*_*/actor (actor_name);

Database changes tables:

  • revision adds rev_actor, and drops rev_user & rev_user_text
  • archive adds ar_actor, and drops ar_user & ar_user_text
  • image adds img_actor, and drops img_user & img_user_text
  • oldimage adds oi_actor, and drops oi_user & oi_user_text
  • filearchive adds fa_actor, and drops fa_user & fa_user_text
  • recentchanges adds rc_actor, and drops rc_user & rc_user_text
  • logging adds log_actor, and drops log_user & log_user_text

Migration state

To avoid blocking on changes to the revision table (so we can coordinate a large schema update), revision and image (FIXME: check if we need to do this on others) will gain temporary sibling tables to hold the reference:

CREATE TABLE /*_*/revision_actor_temp (
  revactor_rev bigint unsigned NOT NULL,     -- → revision.rev_id
  revactor_actor bigint unsigned NOT NULL, -- → actor.actor_id
  PRIMARY KEY (revactor_rev, revactor_actor)
) /*$wgDBTableOptions*/;

-- will this one work sanely? we have no img_id key srsly?
CREATE TABLE /*_*/image_actor_temp (
  imgactor_img varchar(255) binary NOT NULL,     -- → image.img_name
  imgactor_actor bigint unsigned NOT NULL, -- → actor.actor_id
  PRIMARY KEY (imgactor_img, imgactor_actor)
) /*$wgDBTableOptions*/;

MediaWiki changes and migration

Add feature flag for migration states:

  • Read and write old columns only
  • Write both old and new columns. Read from new preferentially, falling back to old.
  • Write only new columns. Read from new preferentially, falling back to old.
  • Read and write the new columns only.

ActorMigration (?) class to manage state & offer methods -- coordinate the basic layout of this with our similar changes at T166732 & friends!

  • use outer joins on reads to map in the extra temporary tables when used, expose the necessary query options.
  • helper function for query setup when searching for a specific user or address's references
  • helper function for storing to the right columns and whether two inserts are needed for temp tables.

Further work:

  • find and fix uses of the direct fields for lookups to use the abstraction
  • probably want a maintenance script to do forced migration that can be run manually in production scenarios or run automatically from the default updaters for small installations

Potential pitfalls

  • There may be more direct usages than expected and they may be a pain to fix
  • There might be "surprise" usages of user_text fields or indexes for querying multiple lookups
  • apiqueryrevisions ;)
  • Are there any problems with still duplicating a username between actor.actor_name and user.user_name?
  • Migration may be harder than expected.
  • Still need to plan the 'final megamigration' from revision's friend tables to merged columns

Details

SubjectRepoBranchLines +/-
mediawiki/extensions/Translatemaster+226 -74
mediawiki/extensions/UserMergeREL1_31+103 -9
mediawiki/extensions/WikimediaMaintenancemaster+2 -2
mediawiki/extensions/UserMergemaster+103 -9
mediawiki/extensions/CentralAuthREL1_31+24 -9
mediawiki/extensions/CentralAuthmaster+24 -9
mediawiki/coremaster+5 K -1 K
mediawiki/extensions/Renameusermaster+111 -35
mediawiki/extensions/FlaggedRevsREL1_31+246 -96
mediawiki/extensions/FlaggedRevsmaster+246 -96
mediawiki/extensions/CentralNoticemaster+31 -20
mediawiki/extensions/SecurePollmaster+238 -141
mediawiki/extensions/UniversalLanguageSelectormaster+29 -10
mediawiki/extensions/CheckUsermaster+31 -39
mediawiki/extensions/WikimediaEventsmaster+21 -14
mediawiki/extensions/Wikibasemaster+8 -5
mediawiki/extensions/OAuthmaster+6 -5
mediawiki/extensions/UploadWizardmaster+14 -3
mediawiki/extensions/WikimediaMaintenancemaster+42 -18
mediawiki/extensions/PageTriagemaster+10 -6
mediawiki/extensions/Nukemaster+19 -5
mediawiki/extensions/CleanChangesmaster+12 -10
mediawiki/extensions/LiquidThreadsmaster+50 -10
mediawiki/extensions/Thanksmaster+4 -3
mediawiki/extensions/VisualEditormaster+24 -15
mediawiki/extensions/MobileFrontendmaster+15 -17
mediawiki/extensions/AbuseFiltermaster+22 -10
mediawiki/extensions/ORESmaster+45 -11
mediawiki/extensions/ContentTranslationmaster+9 -3
mediawiki/extensions/Echomaster+5 -3
mediawiki/extensions/Flowmaster+96 -80
mediawiki/coremaster+366 -14
mediawiki/coremaster+25 -0
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 472509 had a related patch set uploaded (by Paladox; owner: Anomie):
[mediawiki/extensions/CentralAuth@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/472509

Change 472509 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/472509

Change 481814 had a related patch set uploaded (by Thiemo Kreuz (WMDE); owner: MR70):
[mediawiki/extensions/WikimediaMaintenance@master] Fix: use selectRow instead of selectField in UnsuppressCrossWiki.

https://gerrit.wikimedia.org/r/481814

Change 482110 had a related patch set uploaded (by Paladox; owner: Anomie):
[mediawiki/extensions/UserMerge@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/482110

Change 481814 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Fix: use selectRow instead of selectField in UnsuppressCrossWiki.

https://gerrit.wikimedia.org/r/481814

Change 482110 merged by jenkins-bot:
[mediawiki/extensions/UserMerge@REL1_31] Update for the actor table change

https://gerrit.wikimedia.org/r/482110

@Pchelolo awww maaaan, we called this performer in event schemas. Now we'll have two separate names for the same thing. :(

@Pchelolo awww maaaan, we called this performer in event schemas. Now we'll have two separate names for the same thing. :(

The name comes from an RFC that was discussed two years ago, see https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2. The table with that name was merged into core in February 2018, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/380669.

The RFC process is intended to avoid such divergence... I wonder what we could do to make that work better.

Aye, hm not sure. The event schema fields were bikeshed with myself, Marko, Petr, Dan Andreescu, Aaron Halfakar and Joseph Allemandou as part of T134502: Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need and in https://gerrit.wikimedia.org/r/#/c/mediawiki/event-schemas/+/301284/ in July of 2016. IIRC, we decided to use the performer name since that was how Mediawiki code referred to it. How did yall settle on actor?

I don't follow all RFCs, and I'm not sure if Marko or Dan were on the TechComm at the time.

Not sure how "actor" happened, to be honest. May have been a brainstorming during an IRC meeting.
I now realize that MW core uses "performer" in LogEntry and RecentChange. Bummer.

Yar yeah really is too bad. Neither are easy to change now. I guess we'll just ride into the sunset with the different names, like we do with article vs page?

Yar yeah really is too bad. Neither are easy to change now. I guess we'll just ride into the sunset with the different names, like we do with article vs page?

*sigh*

Probably my fault, too. "Actor" sounds like something I'd come up with ;)

Hey all, just refactored a couple tools to work with the new schema, wasn't too bad. Sad about the performer thing, though I have to admit I'm more used to "actor" from other codebases.

One other thing I noticed: The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

That's not always true:

  • When you revert an edit, if you have the markbotedits permission, you can revert "as a bot", hiding from recent changes both your rollback action and the edit(s) affected by your rollback. See https://www.mediawiki.org/wiki/Manual:Administrators#Rollback
  • A user can be added to the bot group, but that doesn't mean past edits should be hidden from recent changes. And the opposite, removing a user from the bot group.

Hence, some edits may be marked as a bot while the actor is not in the bot group, or vice versa.

One other thing I noticed: The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

That's T19237.

One other thing I noticed: The bot flag seems like something intrinsic to the actor, not the action, but rc_bot was left on the recentchanges table. Also doesn't appear deprecated. Any plans there?

The terminology in use is a bit confusing. The term "bot flag" might refer to any of

  1. There's the rc_bot column in recentchanges, which is used for the "bot edits" filter in Special:RecentChanges and Special:Watchlist.
  2. There's the "bot" user right, which enables setting rc_bot in the recentchanges entry when making an action.
    • Via index.php, this makes edits be flagged by default. Via the Action API, edits are not flagged by default, the client has to specify &bot=1 or the like in their query.
  3. There's the "bot" group, which grants the "bot" user right as well as several others (e.g. "apihighlimits").

When you say "the bot flag seems like something intrinsic to the actor" you're probably thinking about #2 or #3, while rc_bot is #1.

Ah, makes sense, thanks @Anomie, @Nirmos, and @Ciencia_Al_Poder! Not sure if you want to add that either to the rc_bot part of the recentchanges table docs, but that's good info! Thanks again!

Should user_newtalk also use actors?

Current
--
-- Stores notifications of user talk page changes, for the display
-- of the "you have new messages" box
--
CREATE TABLE /*_*/user_newtalk (
  -- Key to user.user_id
  user_id int unsigned NOT NULL default 0,
  -- If the user is an anonymous user their IP address is stored here
  -- since the user_id of 0 is ambiguous
  user_ip varbinary(40) NOT NULL default '',
  -- The highest timestamp of revisions of the talk page viewed
  -- by this user
  user_last_timestamp varbinary(14) NULL default NULL
) /*$wgDBTableOptions*/;
-- Indexes renamed for SQLite in 1.14
CREATE INDEX /*i*/un_user_id ON /*_*/user_newtalk (user_id);
CREATE INDEX /*i*/un_user_ip ON /*_*/user_newtalk (user_ip);
Proposed
--
-- Stores notifications of user talk page changes, for the display
-- of the "you have new messages" box
--
CREATE TABLE /*_*/user_newtalk (
  -- Key to actor.actor_id
  actor_id int unsigned NOT NULL default 0,
  -- The highest timestamp of revisions of the talk page viewed
  -- by this user
  user_last_timestamp varbinary(14) NULL default NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/un_actor_id ON /*_*/user_newtalk (actor_id);

Changes:

  • user_newtalk adds actor_id, and drops user_id & user_ip

Should user_newtalk also use actors?

Maybe. That should probably be proposed as a separate task.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

I'm a bit confused now, how is it supposed to get the first uploader of the file now?

I'm a bit confused now, how is it supposed to get the first uploader of the file now?

By joining on img_actor = actor_id, then you can use the actor_name and actor_user fields.

We worked on this at the hackathon a bit and I think we figured it out on our usecase (the first uploader is sometimes in the image table and sometimes in the oldimage table, so you have to use both. From what I understand, the image table always contains the uploader of the current version, and then the oldimage table only contains uploaders of previous versions).

We worked on this at the hackathon a bit and I think we figured it out on our usecase (the first uploader is sometimes in the image table and sometimes in the oldimage table, so you have to use both.

You are right of course.

That schema is terrible, but hasn't (yet) caused enough problems to warrant changing it.

We worked on this at the hackathon a bit and I think we figured it out on our usecase (the first uploader is sometimes in the image table and sometimes in the oldimage table, so you have to use both.

You are right of course.

That schema is terrible, but hasn't (yet) caused enough problems to warrant changing it.

See also: T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys)

@daniel do you know if there is more to do for this specific task, or can it be closed in favor of more narrow (sub)tasks?