Page MenuHomePhabricator

Password length check should count unicode characters
Open, Needs TriagePublic

Description

...not bytes.

Raised at mw:Topic:Upvztrigdixy2mga. One good point made there is that counting Unicode code points is unfair towards ideogram/ideograph based languages where a character has much more entropy (and is presumably harder to remember a 8-character sentence).

Event Timeline

using mb_strlen instead of strlen would probably get us 60% of the way to something reasonable.

An alternative view would be, maybe using non-ascii characters that are less likely to appear in a cracking dictionary increases your entropy enough to counteract the bad counting of characters (No idea if that's true or not).

The internet also claims grapheme_strlen is a thing - https://secure.php.net/manual/en/function.grapheme-strlen.php but it didn't seem to work when i tested locally...

The SecLists top million file only contains two passwords with non-ASCII characters so for dictionary attacks they seem pretty strong.

The other issue is that users understand characters better than bytes. But I guess as long as any discrepancies with the stated rule are on the permissive side, that's not really a problem.

In the context of Unicode discussions “characters” is an ill-advised term because some code points—such as U+FFFE—are explicitly defined as “non-characters” .

Also, what about passwords which are not valid UTF-8 strings? Will something containing \300 or \301 be rejected on this ground?

Also, what about passwords which are not valid UTF-8 strings?

They are great ways of getting locked out of the site as soon as some low-level detail of input parsing changes.

In the context of Unicode discussions “characters” is an ill-advised term because some code points—such as U+FFFE—are explicitly defined as “non-characters” .

Also, what about passwords which are not valid UTF-8 strings? Will something containing \300 or \301 be rejected on this ground?

I havent tested but i imagine normal input normalization would apply (convert to NFC. Anything not valid utf-8 gets changed to replacement character)

The NIST guidelines say "For purposes of the above length requirements, each Unicode code point SHALL be counted as a single character."
Absent strong reasons to the contrary, we should probably go with that.