Page MenuHomePhabricator

Ensure Linker::userLink is not called with empty user name
Closed, ResolvedPublic2 Estimate Story Points

Description

When fixing T222529, Linker::userLink was changed to warn when called with an empty user name (change Id65bdf9666). Since then, the log has recorded numerous such warnings from several code paths (see graphana). CodeSearch turns up over 80 usages that potentially have this problem.

The logs and callers need to be surveyed and any problematic calls fixed. From the logs, it seems to following two are the most problematic:

NOTE: the Logstash link will only function for 30 days, until June 25! New occurrences are not logged any more due to T224050 (change I93826e486951e9).

Event Timeline

daniel created this task.May 26 2019, 1:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 26 2019, 1:38 PM
WDoranWMF set the point value for this task to 2.May 28 2019, 7:21 PM
BPirkle claimed this task.May 30 2019, 4:22 AM
daniel updated the task description. (Show Details)Jun 4 2019, 3:22 PM

The first item in the description turns out to be actual bad data rather than user suppression. @daniel , I dug into this further after we discussed it in our 1:1. My initial analysis failed to consider actor migration. It turns out there is less bad data than I originally thought. Specifically, there is exactly one bad row of this particular type. Details follow.

The page I evaluated, taken from the logs, is https://en.wikipedia.org/w/index.php?title=Gaspard-Gustave_de_Coriolis&oldid=3775415
You can see the text "(no username available)(now username available)" in the red box near the top of the page.

Here's a somewhat simplified version of the query used to load the revision row containing (among other things) the user info for that page:

SELECT rev_id,rev_page,rev_timestamp, actor_rev_user.actor_user AS rev_user,actor_rev_user.actor_name AS rev_user_text, temp_rev_user.revactor_actor AS rev_actor,user_name FROM revision FORCE INDEX (page_timestamp) JOIN revision_comment_temp temp_rev_comment ON ((temp_rev_comment.revcomment_rev = rev_id)) JOIN comment comment_rev_comment ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id)) JOIN revision_actor_temp temp_rev_user ON ((temp_rev_user.revactor_rev = rev_id)) JOIN actor actor_rev_user ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor)) LEFT JOIN user ON ((actor_rev_user.actor_user != 0) AND (user_id = actor_rev_user.actor_user)) WHERE rev_page = '13126' AND rev_id=3775415 ORDER BY rev_timestamp DESC LIMIT 51

rev_idrev_pagerev_timestamprev_userrev_user_textrev_actoruser_name
37754151312620040413150416NULL36635239NULL

As you can see, both rev_user and rev_user_text are missing.
To confirm why this is happening in a more readable way:

SELECT * FROM actor WHERE actor_id=36635239;

actor_idactor_useractor_name
36635239NULL

Here is our bad data. We have an actor row that doesn't point at a user. When RevisionStore::newFromRow calls User::newFromAnyId(), the only valid id it can supply is actor_id. A $user object is created, but no user id or name is set (or can possibly be found).

Okay, so how many unhappy rows like this do we have in the actor table? Let's be bold and select all of them:

SELECT * FROM actor WHERE ISNULL(actor_user) AND actor_name = '';

actor_idactor_useractor_name
36635239NULL

Oh, look. There's just the one. It was fortunate that someone hit a page involving that user while logging was turned on.

Now, what do we do about it? We discussed making RevisionStore::newRevisionFromRow() assign the "Unknown user" in this case. But I'm not sure how to do that.

We could have RevisionStore::newRevisionFromRow() require $row to have a user id and user name. But isn't the whole point of User::newFromAnyId() that $row only needs a subset of the information? I'm not up to speed on actor migration, but I'm guessing that $row having only an actor id and not a user id or user name is a desirable goal. So checking for things we'd like to remove seems like a step backward.

We could call $user->load() in RevisionStore::newRevisionFromRow() and check that we have the user id/name, but that would hurt performance for cases where that info wasn't needed. And it seems like an awkward place to make that check.

User::load() seems like a central place to detect this. But it already falls back to User::loadDefaults(), even if the actor row is missing entirely. So it looks like we've intentionally decided that load() should just do the best it can with what it has.

Linker::revUserTools() could check these values before constructing the link. But that puts responsibility for the consistency of a user object onto its callers, which feels messy. Still, this may be the best option. After thinking way too long about this, I'm going to go with this unless someone says otherwise.

daniel added a subscriber: Anomie.Jun 5 2019, 8:26 PM

Linker::revUserTools() could check these values before constructing the link. But that puts responsibility for the consistency of a user object onto its callers, which feels messy. Still, this may be the best option. After thinking way too long about this, I'm going to go with this unless someone says otherwise.

It's probably the best we can do right now, but I agree that it's less than ideal. I feel like getUserFromanyId should fail in this case. I'm not sure this row in the database should even exist.

@Anomie, do you have an opinion on this?

Anomie added a comment.Jun 6 2019, 4:31 PM

Okay, so how many unhappy rows like this do we have in the actor table?

Just one, because there's a unique index on actor_name.

I'm not up to speed on actor migration, but I'm guessing that $row having only an actor id and not a user id or user name is a desirable goal. So checking for things we'd like to remove seems like a step backward.

It would probably be nice to not have to always join with actor. Although getting to the point that RevisionStore::getQueryInfo() could stop doing that by default might take a while for the deprecation process.

We could call $user->load() in RevisionStore::newRevisionFromRow() and check that we have the user id/name, but that would hurt performance for cases where that info wasn't needed. And it seems like an awkward place to make that check.

+1 to the "but", this isn't a good solution.

User::load() seems like a central place to detect this. But it already falls back to User::loadDefaults(), even if the actor row is missing entirely. So it looks like we've intentionally decided that load() should just do the best it can with what it has.

Basically, it trusts that the name it was given is the name it should use. That makes it work reasonably for IPs and such that don't have a user table row.

Linker::revUserTools() could check these values before constructing the link. But that puts responsibility for the consistency of a user object onto its callers, which feels messy. Still, this may be the best option. After thinking way too long about this, I'm going to go with this unless someone says otherwise.

Meh. Ok short-term solution, anyway.

I feel like getUserFromanyId should fail in this case.

It could fail when directly passed empty-string for $userName, sure, but that wouldn't really solve the problem since if passed null the empty string would just be loaded from the actor row later. So probably not really a good solution.

I'm not sure this row in the database should even exist.

That's the best option: create and run a maintenance script to clean the database up. In the enwiki case it would have to replace all references to actor_id 36635239 with actor_id 188390315, then delete the former row. If a wiki doesn't already have "Unknown user" as an actor, it could just update the actor row directly. We might also have MediaWiki refuse to reinsert an empty-string row.

Ideally we'd have had migrateActors.php turn empty-string into "Unknown user" during the migration, but it's too late for that for Wikimedia now.

I'll post the short-term fix to Linker::revUserTools() here. I also created T225469 for @Anomie 's maintenance script suggestion.

Change 516440 had a related patch set uploaded (by BPirkle; owner: BPirkle):
[mediawiki/core@master] Check for valid user info before calling Linker::userLink() in Linker::revUserTools()

https://gerrit.wikimedia.org/r/516440

daniel updated the task description. (Show Details)Jul 1 2019, 12:53 PM

Change 516440 merged by jenkins-bot:
[mediawiki/core@master] Check for valid user info before calling Linker::userLink() in Linker::revUserTools()

https://gerrit.wikimedia.org/r/516440

Change 520494 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] LogFormatter::formatParameterValue: handle bad user names

https://gerrit.wikimedia.org/r/520494

Change 520494 merged by jenkins-bot:
[mediawiki/core@master] LogFormatter::formatParameterValue: handle bad user names

https://gerrit.wikimedia.org/r/520494

daniel closed this task as Resolved.Jul 5 2019, 1:35 PM
daniel updated the task description. (Show Details)