Page MenuHomePhabricator

Need to perform Unicode normalization
Closed, ResolvedPublic

Description

We need to ensure that UTF-8 input is:

  • Valid UTF-8 (strip broken chars)
  • Valid for XML output (strip illegal control characters)
  • In sensible normalization (form C)

In some cases we may need to normalize on output as well, due to old
data being corrupt. Or, we can do a one-time pass on the database
to clean it up.


Version: 1.4.x
Severity: normal

Details

Reference
bz240

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:45 PM
bzimport set Reference to bz240.

I'm working on a 'pure PHP' normalizer, though it's likely to be relatively slow. If too bad we may want a DSO extension for high-performance
sites (see comments in bug 215 on some external resources).

I've checked in some more or less functional normalization routines, in includes/normal.

Probably will want to have WebRequest call UtfNormal::toNFC() on input, or at least some input.
And/or put it in title/username normalization.

Additionally we'll want to check for broken UTF-8; these routines can probably be extended to do that too.

Now checking all(?) input for broken UTF-8 and normalizing to form C on input. Could use optimization but that's a
separate issue. :)

Checked into CVS for 1.4.