Decide the numerical order for temporary accounts (Scramble/Serial)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Niharika
	Aug 30 2023, 2:09 PM

Description

Motivation

Temporary accounts were switched to being generated in a "scrambled" order. This task is to determine what is the preferred order for temporary accounts.
This came up in the discussion on T332805: Decide the prefix character for temporary usernames

Relevant comments from that discussion:

In T332805#9075416, @Tgr wrote:

The numbers aren't incrementing, they are pseudo-random (at least that's how the test setup is currently configured). They don't reset, but with pseudo-random numbers there is no apparent difference anyway.

In T332805#9078131, @Tchanders wrote:

In T332805#9075501, @RHo wrote:

In T332805#9075416, @Tgr wrote:

The numbers aren't incrementing, they are pseudo-random (at least that's how the test setup is currently configured). They don't reset, but with pseudo-random numbers there is no apparent difference anyway.

@Niharika or @Tchanders could you please confirm? This whole time myself and previous AHT designer had been operating under the understanding that it was an incrementing number. If not the case, can it be made so for the benefits mentioned?

It was set to scramble a couple of weeks ago in this patch from the Growth Team: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/938915

In T332805#9079420, @Urbanecm_WMF wrote:

In T332805#9079288, @RHo wrote:

In T332805#9078131, @Tchanders wrote:

In T332805#9075501, @RHo wrote:

In T332805#9075416, @Tgr wrote:

The numbers aren't incrementing, they are pseudo-random (at least that's how the test setup is currently configured). They don't reset, but with pseudo-random numbers there is no apparent difference anyway.

@Niharika or @Tchanders could you please confirm? This whole time myself and previous AHT designer had been operating under the understanding that it was an incrementing number. If not the case, can it be made so for the benefits mentioned?

It was set to scramble a couple of weeks ago in this patch from the Growth Team: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/938915

@Tgr @Urbanecm_WMF - per above, can we reset to not scramble but serial?

We not only use scramble as of now; we also use multiple shards of the serial provider (aka multiple sources of incrementing integers, where each time a source is selected randomly). The way how this works is that each number source generates every Nth number (if we have three of them, the first one generates numbers 1, 4, 7, ..., the second numbers 2, 5, 8, ... and the third one numbers like 3, 6, 9, ...). This means that if we switched back to serial, the temporary account names probably wouldn't form a perfectly incrementing sequence. The following could be a perfectly valid sequence of temporary account names:

*Unregistered 1

*Unregistered 4

*Unregistered 2

*Unregistered 5

*Unregistered 3

Scrambling takes this a level up and makes the account names seemingly random. Unfortunately, merely switching back to serial wouldn't give us a perfectly incrementing series of account names, as illustrated above. I'm not sure how scrambling contributes to the interpretation: for big numbers, users probably won't see minor ordering hiccups, unless they're by an order of magnitude wrong.

I'm not really sure about the technical reason for switching to scrambling. About switching to multiple shards of the serial provider, my assumption is that using only one shard would put a lot of burden on a single counter shared across all wikis. The counter can't be really made local to each wiki, as we need to ensure the generated usernames are unique across all wikis (temp accounts can switch between projects, retaining the same temp account, just as regular users do, so we need to "reserve" their name on all projects).

@Tgr and @tstarling (who originally suggested switching to scrambling and increasing the shard count on the patch), please correct me if I'm mistaken in any part of the comment above.

Related Objects
Search...

Status	Assigned	Task
In Progress	• Niharika	T324492 Temporary accounts - MVP
Open	• Niharika	T345760 [Epic] Temporary username format
Resolved	• Niharika	T345255 Decide the numerical order for temporary accounts (Scramble/Serial)

Event Timeline

• Niharika triaged this task as Medium priority.Aug 30 2023, 2:09 PM

• Niharika created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 30 2023, 2:09 PM

• Niharika mentioned this in T332805: Decide the prefix character for temporary usernames.Aug 30 2023, 2:19 PM

Titore subscribed.Aug 30 2023, 2:24 PM

@tstarling as the one who made this change would you be able to elaborate why you preferred the scramble method over the serial? Is there a benefit to using that?

I believe it was @Prtksxna who expressed the concern that consecutive numbers, with most of the digits being the same between users created at a similar time, would be too hard for reviewers to distinguish in a changes list. I suggested a pseudo-random sequence as a way of making the numbers be easier to visually distinguish.

When this is deployed to production, names like "Unregistered 1" and "Unregistered 2" will be gone in an eye blink. Think about what it will look like with 7 or 8 digit numbers.

I'm not really sure about the technical reason for switching to scrambling.

ScrambleMapping is just for humans, it's not meant to make anything better for computers. But beta should have the same configuration as we intend to use in production, so that potential production issues can be detected.

I prefer "serial" (even though it won't be perfectly in order; even though some numbers will be skipped). Because of the format (which we expect to reach approximately User:~2023-12345-678 in a typical year), the serial numbers in the middle of the year will look something like this:

User:~2023-61728-37
User:~2023-61728-38
User:~2023-61728-39
User:~2023-61728-40
User:~2023-61728-41

– which are easy to tell apart, especially since you will not see all of these at one wiki – and in the final weeks of the end of the year, you will see something like this:

User:~2023-12345-676
User:~2023-12345-677
User:~2023-12345-678
User:~2023-12345-679
User:~2023-12345-680

which is a little bit more challenging but still IMO feasible.

I think I would feel very different about this if we weren't breaking up the numbers into chunks.

In T345255#9132511, @tstarling wrote:

I believe it was @Prtksxna who expressed the concern that consecutive numbers, with most of the digits being the same between users created at a similar time, would be too hard for reviewers to distinguish in a changes list. I suggested a pseudo-random sequence as a way of making the numbers be easier to visually distinguish.

When this is deployed to production, names like "Unregistered 1" and "Unregistered 2" will be gone in an eye blink. Think about what it will look like with 7 or 8 digit numbers.

Thanks for this context. @RHo do you share Prateek's concerns?

• Niharika added a parent task: T345760: [Epic] Temporary username format.Sep 6 2023, 5:17 PM

In T345255#9146855, @Niharika wrote:

In T345255#9132511, @tstarling wrote:

I believe it was @Prtksxna who expressed the concern that consecutive numbers, with most of the digits being the same between users created at a similar time, would be too hard for reviewers to distinguish in a changes list. I suggested a pseudo-random sequence as a way of making the numbers be easier to visually distinguish.

When this is deployed to production, names like "Unregistered 1" and "Unregistered 2" will be gone in an eye blink. Think about what it will look like with 7 or 8 digit numbers.

Thanks for this context. @RHo do you share Prateek's concerns?

No, my preference is same as @Whatamidoing-WMF to serial for being able to tell the rough time period of temp creation which seems extremely useful to know at a glance. My hesitation is the format with different separations of the number that is not thousand separator as this may lead to temp editors mistaking it for an IP address identifier. But the question on the task serial c's random, would go for serial (even if imperfect).

Adding @KColeman-WMF in case she has additional considerations from the patroller viewpoint.

In T345255#9155923, @RHo wrote:

No, my preference is same as @Whatamidoing-WMF to serial for being able to tell the rough time period of temp creation which seems extremely useful to know at a glance.

If we want the username to reflect the time of creation, why not just include a fuller timestamp (e.g. ~2023-09-11~123 instead of ~2023-12345-678)?

In T345255#9156978, @Tgr wrote:

In T345255#9155923, @RHo wrote:

No, my preference is same as @Whatamidoing-WMF to serial for being able to tell the rough time period of temp creation which seems extremely useful to know at a glance.

If we want the username to reflect the time of creation, why not just include a fuller timestamp (e.g. ~2023-09-11~123 instead of ~2023-12345-678)?

My presumption was that the unique identifying number is still required, so including the fuller timestamp leads to a much longer and less distinguishable name. Unless @Tgr you're suggesting that this would reduce the length because the count would reset each day? E.g., 2023-09-11~123 and 2023-09-12~123?)

Currently the number does not reset, but I don't think there is any reason it couldn't. It would require a DB schema change but otherwise seems straightforward.

Granted it would be somewhat confusing because we would have to use the same timezone (UTC presumably) on all wikis, so the date in the username could be off by one compared to the actual registration date. Although that happens with the year-only scheme as well, just much more infrequently.

In T345255#9157205, @Tgr wrote:

Granted it would be somewhat confusing because we would have to use the same timezone (UTC presumably) on all wikis, so the date in the username could be off by one compared to the actual registration date. Although that happens with the year-only scheme as well, just much more infrequently.

Maybe I'm missing something obvious, but why is that? The user-specific timezone is a no-go for obvious reasons, but if the database looked like this:

uas_shard	uas_date	uas_value
0	2023-09-11	223
0	2023-09-12	12

Then we can either get the id from the first row (223) or the second row (12) based on which date it is according to the local timezone on whichever wiki the user edited on. Granted, this wouldn't help for multilingual wikis (where users of many timezones collaborate) or when an user is switching between projects (as username is global).

It seems it would be easier for us to use UTC everywhere (to ensure temp accounts are numbered equally regardless of which project they formally come from), but logically and semantically, wiki timezone seems like a more reasonable choice to me.

Yeah, you can match the timezone of the user's "home project", just not the timezone of other wikis they visit. That would have some weird side effects, such as out-of-order dates on loginwiki, but on the whole probably does not cause much problem.

Ltrlg subscribed.Sep 14 2023, 8:33 AM

Tchanders mentioned this in T345855: Update temporary username format .Oct 23 2023, 11:10 AM

Tchanders mentioned this in T349501: Update temporary user names to start their counter again each year.Oct 23 2023, 12:17 PM

Tchanders mentioned this in T349503: Update the serial mapping config for generating temporary user names on beta.Oct 23 2023, 12:37 PM

We have decided to go with serial instead of scramble.

Decide the numerical order for temporary accounts (Scramble/Serial)Closed, ResolvedPublicActions

Description

Motivation

Related ObjectsSearch...

Event Timeline

Decide the numerical order for temporary accounts (Scramble/Serial)
Closed, ResolvedPublic
Actions

Related Objects
Search...