Page MenuHomePhabricator

Multiple user_ids per username in account creation events from ServerSideAccountCreation log
Closed, ResolvedPublic

Description

There are ca. 14K account creation events generated in April (from both the mobile and the desktop site) with the same user_name associated to multiple user_id's in ServerSideAccountCreation:

SELECT event_userName, COUNT(event_userId) AS dupe, MIN(timestamp), MAX(timestamp) FROM ServerSideAccountCreation_5487345 WHERE event_isSelfMade = 1 AND LEFT(timestamp, 6) = '201404' GROUP BY 1 HAVING dupe > 1 ORDER BY dupe DESC;

(all events for April seem to have been generated between April 4 and April 24)

The problem goes back to the very beginning of the log (June 2013) and continues until now, for a total of 200K entries with multiple user_ids over the course of a year.

I haven't done any further investigation but this affects anyone counting usernames as opposed to distinct user_ids.


Version: unspecified
Severity: normal
Whiteboard: u=noone c=none p=0 s=none
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=39996

Details

Reference
bz66101

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:10 AM
bzimport set Reference to bz66101.

Given that webhosts are different I do not think this is a bug, as that schema is userNames are not unique per ServerSideAccountCreation event but rather per event and webhost.

Please see:

select id, uuid,webHost from ServerSideAccountCreation_5487345 where event_userName='<removed>';

+---------+----------------------------------+------------------+

iduuidwebHost

+---------+----------------------------------+------------------+

19692444b20276622cc59788f029aeb2099f9dbbg.wikipedia.org
1970649cba01f0c458e590382b19ae4827b707dwww.wikidata.org

+---------+----------------------------------+------------------+
2 rows in set (1.97 sec)

select id, uuid,webHost from ServerSideAccountCreation_5487345 where event_userName='<removed>';

+---------+----------------------------------+------------------+

iduuidwebHost

+---------+----------------------------------+------------------+

196855714fb6a363ac25f92bdd9efa4ed9ab40cmr.wikipedia.org
2320001f6b428c53ca450458d33074775c6a035ar.wikipedia.org

+---------+----------------------------------+------------------+
2 rows in set (2.16 sec)

I reached out to Chris Steipp and James Forrester to hear from them about SUL implications. Event IDs should be unique by design so that's as expected. Here I am referring to new user_names associated to multiple user_ids (which suggests that users still have the ability to register on multiple sites with the same username). If I get a confirmation that this is indeed related to the way SUL is implemented, I'll move the ticket to the corresponding owner.

Here I am referring to new user_names associated to multiple user_ids

Right, we see that on the records (same user name for two user_ids) but that should not be a bug on EL data, rather on the account creation process itself.

Now, there is a bug regarding encoding and storage on that table:
non ascii chars are entered like '???', which will return "false" matching. If anyone is grabbing usernames from there they likely have run into encoding issues.

Many records are of this type:
???????? ?????? ?????????

I have file a bug for the encoding issue:

https://bugzilla.wikimedia.org/show_bug.cgi?id=66123

If you're pulling those user_ids from the local wikis, then it's totally expected for them to have different user_ids across wikis. The text of the username is how we relate accounts.

Each wiki will assign a new user_id (user.user_id in the local wiki db) to a username when the user is created or autocreated there-- sequential for that wiki. When a user is created the first time, they should get a CentralAuth global id (globaluser.gu_id in the centralauth database). There shouldn't be a way for the same username to have multiple gu_id's in the centralauth db. If that's happening, we have a very, very bad problem. But gu_name is declared as a unique key in the database, so someone would have had to manually update our centralauth db for that to happen.

I'm aware that local wikis have different user_ids, but should I still expect new users to be able to register on multiple wikis with the same user_name and without their account being automatically unified (and a record being created in the centralauth DB)? Shouldn't all new accounts be unified by default?

For example:

SELECT * FROM ServerSideAccountCreation_5487345 WHERE event_username = 'Jlmcnamara';

has 3 separate account creation events in 2014 (on enwikisourcewiki, commonswiki, specieswiki) but no record in globaluser.

Under what conditions does this happen?

Sorry, I didn't understand what you were saying. If no global account is being created, that's definitely a SUL bug. I don't think we intentionally allow that anywhere. I'll look into it.

Could we move this bug to the corresponding category? It does not seem a bug on EL but rather on the account creation process itself.

u=nuria@wikimedia.org c=Wikimetrics p=0 s=2014-05-29

u=nuria@wikimedia.org c=EventLogging p=0 s=2014-05-29

(In reply to nuria from comment #8)

u=nuria@wikimedia.org c=EventLogging p=0 s=2014-05-29

-> I've put this into the Whiteboard field so it can get picked up

(In reply to Dario Taraborelli from comment #5)

Shouldn't all new accounts be
unified by default?

Should, but are not: see bug 39996 and friends. There are over 15k broken accounts on Meta-Wiki only, see bug 61876.

In https://bugzilla.wikimedia.org/show_bug.cgi?id=39996#c73 Aaron mentioned,

Look at the addUser() method, which has the line:

if ( !$central->exists() && !$central->listUnattached() ) {
...
}

Which I think is actually directly related to _this_ bug. This check means that for someone registering a new account, if there isn't a global account, but there are unattached accounts with this same name, the account isn't created in centralauth-- it's kept as a local only account.

I believe this is the behavior causing the issue that Analytics saw, right?

I think there's a question as to what the correct behavior should be in this case.

You probably shouldn't be able to register an account name if someone already has a global account with that name. You should need to pick a new name.

(In reply to Aaron Halfaker from comment #12)

You probably shouldn't be able to register an account name if someone
already has a global account with that name. You should need to pick a new
name.

Chris's question was about a different scenario, when there isn't a global account, but there are local accounts on other wikis with the same name.

However, I think it's still better to not allow creation in this scenario. This should reduce the number of account renames or people getting stuck with Example~xywiki usernames when we do Single User Finalization.

(In reply to Matthew Flaschen from comment #13)

However, I think it's still better to not allow creation in this scenario.
This should reduce the number of account renames or people getting stuck
with Example~xywiki usernames when we do Single User Finalization.

If we prevent the local account from being created at this point, then any users who can't currently globalize their account (someone else with more edits has their name on another wiki, and hasn't globalized the name for some reason), then the user is prevented from doing any cross-wiki work. I'm not seeing a good way to resolve that pre-finalization.

Once we do finalize, merging one more ~wiki account isn't much extra work on our end.

(In reply to Chris Steipp from comment #14)

If we prevent the local account from being created at this point, then any
users who can't currently globalize their account (someone else with more
edits has their name on another wiki, and hasn't globalized the name for
some reason), then the user is prevented from doing any cross-wiki work.

You're right. I didn't think of that. They could still create an account, of course, but they'd have to have different usernames across the cluster, which is definitely not ideal.

Once we do finalize, merging one more ~wiki account isn't much extra work on
our end.

True.

@csteipp can you give us an update on this?

@kevinator SUL Finalization is a ways off. We'll be starting it this spring, but the estimates for when we can call it "done" range up to several months.

Until we finalize, my opinion hasn't changed-- we can't stop creating local-only accounts without either causing more work for ourselves or hurting long-time editors.

Thanks Chris. No rush, we were just wondering what the timeline looked like.

@Tnegrin: This issue has been assigned to you a while ago. Could you please share a status update? Are you still working (or still plan to work) on this issue? Is there anything that others could help with? If you do not plan to work on this issue anymore, please remove yourself as assignee (via Add Action...Assign / Claim in the dropdown menu) so others could work on it. Thanks a lot!

Nuria removed a subscriber: Tnegrin.
Nuria added a subscriber: Tgr.

Not sure who owns this now but likely to toby. Removing hi, adding @Tgr just in case he knows, removing analytics tag

@Nuria it's not clear what needs to be owned here. Do you want to clean up past logs somehow? Or do you think this is still happening?

The initial proble is a bunch of accounts geberated with same user-name, that seems to be a problem. found via logging data in eventlogging Whether it is happening or not I could not say but I imagine someone owns account creation and verify whether that is still the case and if so, it should probably be fixed (if possible, might be spam)

The query does not give me any dupes. Presumably the issue is not happening anymore and old EventLogging entries have already been cleaned up. I don't think there is anything to do here.