Page MenuHomePhabricator

Decide the numerical order for temporary accounts (Scramble/Serial)
Closed, ResolvedPublic

Description

Motivation

Temporary accounts were switched to being generated in a "scrambled" order. This task is to determine what is the preferred order for temporary accounts.
This came up in the discussion on T332805: Decide the prefix character for temporary usernames

Relevant comments from that discussion:

The numbers aren't incrementing, they are pseudo-random (at least that's how the test setup is currently configured). They don't reset, but with pseudo-random numbers there is no apparent difference anyway.

The numbers aren't incrementing, they are pseudo-random (at least that's how the test setup is currently configured). They don't reset, but with pseudo-random numbers there is no apparent difference anyway.

@Niharika or @Tchanders could you please confirm? This whole time myself and previous AHT designer had been operating under the understanding that it was an incrementing number. If not the case, can it be made so for the benefits mentioned?

It was set to scramble a couple of weeks ago in this patch from the Growth Team: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/938915

The numbers aren't incrementing, they are pseudo-random (at least that's how the test setup is currently configured). They don't reset, but with pseudo-random numbers there is no apparent difference anyway.

@Niharika or @Tchanders could you please confirm? This whole time myself and previous AHT designer had been operating under the understanding that it was an incrementing number. If not the case, can it be made so for the benefits mentioned?

It was set to scramble a couple of weeks ago in this patch from the Growth Team: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/938915

@Tgr @Urbanecm_WMF - per above, can we reset to not scramble but serial?

We not only use scramble as of now; we also use multiple shards of the serial provider (aka multiple sources of incrementing integers, where each time a source is selected randomly). The way how this works is that each number source generates every Nth number (if we have three of them, the first one generates numbers 1, 4, 7, ..., the second numbers 2, 5, 8, ... and the third one numbers like 3, 6, 9, ...). This means that if we switched back to serial, the temporary account names probably wouldn't form a perfectly incrementing sequence. The following could be a perfectly valid sequence of temporary account names:

  • *Unregistered 1
  • *Unregistered 4
  • *Unregistered 2
  • *Unregistered 5
  • *Unregistered 3

Scrambling takes this a level up and makes the account names seemingly random. Unfortunately, merely switching back to serial wouldn't give us a perfectly incrementing series of account names, as illustrated above. I'm not sure how scrambling contributes to the interpretation: for big numbers, users probably won't see minor ordering hiccups, unless they're by an order of magnitude wrong.

I'm not really sure about the technical reason for switching to scrambling. About switching to multiple shards of the serial provider, my assumption is that using only one shard would put a lot of burden on a single counter shared across all wikis. The counter can't be really made local to each wiki, as we need to ensure the generated usernames are unique across all wikis (temp accounts can switch between projects, retaining the same temp account, just as regular users do, so we need to "reserve" their name on all projects).

@Tgr and @tstarling (who originally suggested switching to scrambling and increasing the shard count on the patch), please correct me if I'm mistaken in any part of the comment above.

Event Timeline

Niharika created this task.

@tstarling as the one who made this change would you be able to elaborate why you preferred the scramble method over the serial? Is there a benefit to using that?

I believe it was @Prtksxna who expressed the concern that consecutive numbers, with most of the digits being the same between users created at a similar time, would be too hard for reviewers to distinguish in a changes list. I suggested a pseudo-random sequence as a way of making the numbers be easier to visually distinguish.

When this is deployed to production, names like "Unregistered 1" and "Unregistered 2" will be gone in an eye blink. Think about what it will look like with 7 or 8 digit numbers.

I'm not really sure about the technical reason for switching to scrambling.

ScrambleMapping is just for humans, it's not meant to make anything better for computers. But beta should have the same configuration as we intend to use in production, so that potential production issues can be detected.

I prefer "serial" (even though it won't be perfectly in order; even though some numbers will be skipped). Because of the format (which we expect to reach approximately User:~2023-12345-678 in a typical year), the serial numbers in the middle of the year will look something like this:

User:~2023-61728-37
User:~2023-61728-38
User:~2023-61728-39
User:~2023-61728-40
User:~2023-61728-41

– which are easy to tell apart, especially since you will not see all of these at one wiki – and in the final weeks of the end of the year, you will see something like this:

User:~2023-12345-676
User:~2023-12345-677
User:~2023-12345-678
User:~2023-12345-679
User:~2023-12345-680

which is a little bit more challenging but still IMO feasible.

I think I would feel very different about this if we weren't breaking up the numbers into chunks.

I believe it was @Prtksxna who expressed the concern that consecutive numbers, with most of the digits being the same between users created at a similar time, would be too hard for reviewers to distinguish in a changes list. I suggested a pseudo-random sequence as a way of making the numbers be easier to visually distinguish.

When this is deployed to production, names like "Unregistered 1" and "Unregistered 2" will be gone in an eye blink. Think about what it will look like with 7 or 8 digit numbers.

Thanks for this context. @RHo do you share Prateek's concerns?

I believe it was @Prtksxna who expressed the concern that consecutive numbers, with most of the digits being the same between users created at a similar time, would be too hard for reviewers to distinguish in a changes list. I suggested a pseudo-random sequence as a way of making the numbers be easier to visually distinguish.

When this is deployed to production, names like "Unregistered 1" and "Unregistered 2" will be gone in an eye blink. Think about what it will look like with 7 or 8 digit numbers.

Thanks for this context. @RHo do you share Prateek's concerns?

No, my preference is same as @Whatamidoing-WMF to serial for being able to tell the rough time period of temp creation which seems extremely useful to know at a glance. My hesitation is the format with different separations of the number that is not thousand separator as this may lead to temp editors mistaking it for an IP address identifier. But the question on the task serial c's random, would go for serial (even if imperfect).

Adding @KColeman-WMF in case she has additional considerations from the patroller viewpoint.

No, my preference is same as @Whatamidoing-WMF to serial for being able to tell the rough time period of temp creation which seems extremely useful to know at a glance.

If we want the username to reflect the time of creation, why not just include a fuller timestamp (e.g. ~2023-09-11~123 instead of ~2023-12345-678)?

No, my preference is same as @Whatamidoing-WMF to serial for being able to tell the rough time period of temp creation which seems extremely useful to know at a glance.

If we want the username to reflect the time of creation, why not just include a fuller timestamp (e.g. ~2023-09-11~123 instead of ~2023-12345-678)?

My presumption was that the unique identifying number is still required, so including the fuller timestamp leads to a much longer and less distinguishable name. Unless @Tgr you're suggesting that this would reduce the length because the count would reset each day? E.g., 2023-09-11~123 and 2023-09-12~123?)

Currently the number does not reset, but I don't think there is any reason it couldn't. It would require a DB schema change but otherwise seems straightforward.

Granted it would be somewhat confusing because we would have to use the same timezone (UTC presumably) on all wikis, so the date in the username could be off by one compared to the actual registration date. Although that happens with the year-only scheme as well, just much more infrequently.

Granted it would be somewhat confusing because we would have to use the same timezone (UTC presumably) on all wikis, so the date in the username could be off by one compared to the actual registration date. Although that happens with the year-only scheme as well, just much more infrequently.

Maybe I'm missing something obvious, but why is that? The user-specific timezone is a no-go for obvious reasons, but if the database looked like this:

uas_sharduas_dateuas_value
02023-09-11223
02023-09-1212

Then we can either get the id from the first row (223) or the second row (12) based on which date it is according to the local timezone on whichever wiki the user edited on. Granted, this wouldn't help for multilingual wikis (where users of many timezones collaborate) or when an user is switching between projects (as username is global).

It seems it would be easier for us to use UTC everywhere (to ensure temp accounts are numbered equally regardless of which project they formally come from), but logically and semantically, wiki timezone seems like a more reasonable choice to me.

Yeah, you can match the timezone of the user's "home project", just not the timezone of other wikis they visit. That would have some weird side effects, such as out-of-order dates on loginwiki, but on the whole probably does not cause much problem.

Niharika claimed this task.

We have decided to go with serial instead of scramble.