Page MenuHomePhabricator

Add a primary key to user_newtalk
Open, HighPublic

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

What should we call the primary key? un_id? unt_id? user_newtalk_id? I was thinking, maybe down the road we would want to use the un_ prefix for something more important.

CREATE TABLE `user_newtalk` (
  `user_id` int(5) NOT NULL DEFAULT '0',
  `user_ip` varbinary(40) NOT NULL DEFAULT '',
  `user_last_timestamp` varbinary(14) DEFAULT NULL,
  KEY `user_id` (`user_id`),
  KEY `user_ip` (`user_ip`)
) ENGINE=InnoDB DEFAULT CHARSET=binary ROW_FORMAT=DYNAMIC

Can we just use user_id as a PK?

Can we just use user_id as a PK?

Unfortunately not... As we have more than 1 row with the default of 0 (we use this table for anonymous users too)

If user_id is 0, we'll be storing user_ip

If user id != 0, user_ip seems to be '' (ala default)

Composite of user_id and user_ip?

Probably yes, this table is a bit strange, but it will probably be refactored with an "actor_id" PK in the future (identifiers of both registered and unregistered users).

ALTER ... ADD PRIMARY KEY (user_id, user_ip), DROP KEY user_id;

There has been a lot of abuses of 0 and '' where NULL should have been used (see for example: https://gerrit.wikimedia.org/r/337390 ). That is probably work for another day.

Want me to go ahead and add this PK in the meantime then?

Change 376068 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] [WIP] Add PRIMARY KEY to user_newtalk

https://gerrit.wikimedia.org/r/376068

Reedy triaged this task as Medium priority.Sep 5 2017, 6:33 PM

During all the consistency checks we have been doing across all the shards (T183735), this table has been one of the most "corrupted" as there is not PK, or even an UNIQUE, so there are lots of duplicated rows.
@Reedy I would like to start working on this table soonish and start adding the PK at https://gerrit.wikimedia.org/r/#/c/376068/, do you think we could have this merge sometime soon? (no need to rush, just trying to bring this back to the table)

Thanks!

During all the consistency checks we have been doing across all the shards (T183735), this table has been one of the most "corrupted" as there is not PK, or even an UNIQUE, so there are lots of duplicated rows.
@Reedy I would like to start working on this table soonish and start adding the PK at https://gerrit.wikimedia.org/r/#/c/376068/, do you think we could have this merge sometime soon? (no need to rush, just trying to bring this back to the table)

Thanks!

Need to add pg/oracle/mssql patches.. Then should be able to get this merged

mssql

-- Stores notifications of user talk page changes, for the display
-- of the "you have new messages" box
-- Changed user_id column to user_id to avoid clashing with user_id function
CREATE TABLE /*_*/user_newtalk (
   user_id INT         NOT NULL REFERENCES /*_*/mwuser(user_id) ON DELETE CASCADE,
   user_ip NVARCHAR(40) NOT NULL DEFAULT '',
   user_last_timestamp varchar(14) DEFAULT NULL,
);
CREATE INDEX /*i*/un_user_id ON /*_*/user_newtalk (user_id);
CREATE INDEX /*i*/un_user_ip ON /*_*/user_newtalk (user_ip);

oracle

CREATE TABLE &mw_prefix.user_newtalk (
  user_id  NUMBER DEFAULT 0 NOT NULL,
  user_ip  VARCHAR2(40)        NULL,
  user_last_timestamp         TIMESTAMP(6) WITH TIME ZONE
);
ALTER TABLE &mw_prefix.user_newtalk ADD CONSTRAINT &mw_prefix.user_newtalk_fk1 FOREIGN KEY (user_id) REFERENCES &mw_prefix.mwuser(user_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED;
CREATE INDEX &mw_prefix.user_newtalk_i01 ON &mw_prefix.user_newtalk (user_id);
CREATE INDEX &mw_prefix.user_newtalk_i02 ON &mw_prefix.user_newtalk (user_ip);

pgsql

CREATE TABLE user_newtalk (
  user_id              INTEGER      NOT NULL  REFERENCES mwuser(user_id) ON DELETE CASCADE DEFERRABLE INITIALLY DEFERRED,
  user_ip              TEXT             NULL,
  user_last_timestamp  TIMESTAMPTZ
);
CREATE INDEX user_newtalk_id_idx ON user_newtalk (user_id);
CREATE INDEX user_newtalk_ip_idx ON user_newtalk (user_ip);

so there are lots of duplicated rows. [..]

Duplicate or not

Yes, although note that those duplicates are most likely not the result of replication issues. Some of them could be, but I would expect most (or even all) to be put there by explicitly by MediaWiki and are not actually duplicates. They differ in the value for user_last_timestamp, the insertion uses INSERT ... IGNORE to avoid "real" duplicates by (user_id/user_ip, user_last_timestamp) - but that still remains unsafe for replication, of course, given the lack of a unique index.

This column was introduced in 2008 for T14701, which intended to help properly pluralize the word "messages" in the "You have unread messages" notification, by the timestamp of the oldest unread revision, and at run-time figure out how many revisions there are. (This uses timestamps rather than revision IDs, to account for race conditions, merges/undeletes/imports etc.)

Single row

The field was designed and documented by @aaron with 3c60497de (r33520) as "The highest timestamp of revisions of the talk page viewed by this user". Unfortunately, when, finally, 4 years later, the patch to use the field was completed (b082e920), it didn't get it right. Instead of using an upsert in a way that updates the timestamp to the lower one of the current and added revision, it just adds a new row always, and uses MIN(user_last_timestamp) at run-time.

However, given the current nature of this, until that is fixed, we shouldn't add a unique index or primary key because it would break inserts for older revisions.

Multiple rows

Aside from the bug in the insertion logic, other logic has since also come to depend on there being multiple rows. In 2013 (dfc3e3df90a3 / T43759), logic to partially "mark as read" until the currently viewed revision only was implemented by deleting all the user_newtalk rows until the viewed revisions' timestamp. So that needs to be updated as well (to update the main row).

Earlier, @Reedy ran some queries and (based on the above) it shouldn't be a surprise that we've got various popular bots that (obviously) never view their talk page, and as such have accumulated ~20,000 rows in user_newtalk over the years, including many of which are actually having timestamp=null, which... is odd.

Nulls

The nulls are really odd because the way the feature is written, nulls make no sense. Marking as read is implemented as deleting all rows. And detecting an unread message is implemented as selecting MIN(timestamp). And both with MIN() and MAX(), the nulls don't participate. It kind of makes sense that the field was made nullable to not break existing data, but I don't see why we'd ever actively insert a null.

From the handful of codepaths that perform an insert, most takes Revision as required parameter. But there's one code path, relating to the 2013 partial mark-as-read feature, where it (unintentionally?) allows timestamp to be null.

User::setNewtalk takes $val, $curRev = null. The main reason $curRev is optional is because this method is used both to insert and to delete rows. If $val is true, it means to insert a row, based on $curRev. If $val is false, it means to delete any rows. The two cases don't share much code, but for some reason, they go through this common method.

The problem is, that while unintended, there is a case where User::setNewtalk can be called with val=true and also curRev=null. I haven't confirmed, but I guess it's due to User:: clearNotification() at User.php#L3910-L3920:

User::clearNotification
// If there is a new, unseen, revision, use its timestamp
$nextid = $oldid
	? $title->getNextRevisionID( $oldid, Title::GAID_FOR_UPDATE )
	: null;
if ( $nextid ) {
	$this->setNewtalk( true, Revision::newFromId( $nextid ) );
}

Here Revision::newFromId can return null. I'm not sure how that could be the case. Maybe slave lag, but afaik our fallback from slave to master for those in RevisionStore would account for that. Unless it's not in the same request, but then ChronologyProtector? Unless it's from the other user, but then.. how much lag would that require.

Running some more queries:

  • enwiki.user_newtalk: contains 9M rows, 7M have timestamp=null, 2M non-null. This distribution toward null may be due to old rows from before 2009 (although unlikely that many?).
  • wikidatawiki.user_newtalk (created after 2009): contains 24K rows, of which 17K null and 7K non-null.

This last data point suggest it isn't because of pre-2009 data, but rather a current issue. And running count() queries for null a few times showed that it was definitely still increasing even today.

@Krinkle @Reedy could you give this some priority so we can get a PK in place for user_newtalk? Right now we are dealing with recoveries on T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") and by not having a PK here it slows down a bit the recoveries on that table, so another reason to have a PK for it :-)

Thank you guys!

Marostegui raised the priority of this task from Medium to High.Oct 17 2018, 9:35 AM

Guess we need to decide how to architect it (as it's probably going to require some changes to MW)

Guess we need to decide how to architect it (as it's probably going to require some changes to MW)

Unless I am missing something it is as easy as: can one column or a combination of columns be the PK? Add it. It cannot? => Add an autoinc. What does that not work? We just need to identify individual rows for maintenance, nothing else.

Guess we need to decide how to architect it (as it's probably going to require some changes to MW)

Unless I am missing something it is as easy as: can one column or a combination of columns be the PK? Add it. It cannot? => Add an autoinc. What does that not work? We just need to identify individual rows for maintenance, nothing else.

No, we can't use any of the current columns (without changes to how MW uses them first); see Krinkle's reply from May.

We can just add an auto inc PK, as a simple way around it. Feels rather hacky though.

But there is a lot of data bloat in the table that simply doesn't need to be there too, which adds to the maintenance burden

But there is a lot of data bloat in the table that simply doesn't need to be there too, which adds to the maintenance burden

I don't disagree, but the lack of a PK is causing maintenance and consistency issues right now, the extra data issue (which is annoying) can be fixed later with more time. I would say the second is a different problem with lower priority- which of course should be in mind at the same time, but we have to draw a line somewhere to avoid small tasks becoming too large. But this is only my take.

Feels rather hacky though.

I just quoted from the "Official Jynus (TM) Guide to add primary keys to production, volumen 2", I don't think it is hacky. Any future alter to that table will require a unique identifier first.

So user_id,user_ip as PK would require changes to MW?

So user_id,user_ip as PK would require changes to MW?

Yeah, because user_id,user_ip doesn't end up as a unique identifier unfortunately (many users have many rows)... And the code has been modified/made to work based on the fact there are multiple rows...

An auto-inc will do, doing another schema change to remove it or add extra columns etc is "easy" (even removing the autoinc), once a PK is in place. Without a soime changes are technically impossible (they requires an export and import to ensure consistency). INSERT IGNORE doesn't work as some of you said, without a UNIQUE or PK restriction to skip duplicates.

Ok, how urgently do you need/want this doing?

Change 467966 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Add auto_increment PK to user_newtalk

https://gerrit.wikimedia.org/r/467966

T238168: Convert "user_newtalk" table to use actors would create a unique key for user_newtalk which could be the primary key for this (its never null)

@DannyS712 It would have duplicates if T146585#4233276 is still true.

I take that back - there would indeed still be duplicates

The only place that I could find user_last_timestamp being set is in User::updateNewtalk
To avoid duplicating rows (not technically "duplicate" rows, but rows for the same user/ip), while remaining compatible with current data in the table I suggest that updateNewtalk be modified to:

  1. Find the current oldest user_last_timestamp, in the same way that getNewMessageLinks queries for the minimum
  2. If there is a row for that user:
    1. Delete all rows for the relevant user that do not match that oldest
    2. Update the oldest with the new timestamp
  3. If not:
    1. Insert a row as currently done

@Krinkle would this solve the multiple row issue?

Change 376068 abandoned by Reedy:
Add PRIMARY KEY to user_newtalk

Reason:
Dumping this one per https://phabricator.wikimedia.org/T146585#4233276

https://gerrit.wikimedia.org/r/376068

Change 467966 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Add auto_increment PK to user_newtalk

https://gerrit.wikimedia.org/r/467966