Page MenuHomePhabricator

Page-title filtering incorrectly classifies UTF-8 sequences decoding to U+xxFFFE and U+xxFFFF as invalid
Open, Needs TriagePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:

This page title trips the invalid-UTF-8 handler:

  • "Bad title"
  • "The requested page title contains an invalid UTF-8 sequence."
  • "Return to Main Page."

Any page title containing a percent-encoded UTF-8 sequence decoding to a Unicode codepoint the last 16 bits of which are ...FFFE or ...FFFF (i.e., any codepoint that is the final or the penultimate codepoint of a particular Unicode plane) is treated as containing invalid UTF-8.

What should have happened instead?:

The page-does-not-exist screen for that page title should've been displayed, rather than the invalid-UTF-8 screen. Although the last two codepoints of each of the seventeen planes are designated as noncharacters (more detail is available in chapter 23 ["Special Areas and Format Characters"] of the Unicode 14.0 standard), they correspond to valid Unicode scalar values and are specifically allowed in open interchange (although interchange of noncharacters is discouraged), and the UTF-8 sequences decoding to these codepoints are, thus, valid, well-formed UTF-8, as specifically stated in the Unicode Private-Use Characters, Noncharacters & Sentinels FAQ and in UTR #17 ("Unicode Character Encoding Model") (as well as, albeit less directly, in the official specification for UTF-8 itself, which gives the range of valid codepoints encodable by UTF-8 as being the range U+0000 - U+10FFFF, inclusive, with the exception of the high/low-surrogate range from U+D800 - U+DFFF, inclusive), and, thus, should not trip the page-title invalid-UTF-8 handler.

(While there is known to be one UTF-8 sequence that the invalid-UTF-8 handler cannot be kept from catching if included in a page title [%EF%BF%BD, decoding to the replacement character, U+FFFD; invalid UTF-8 sequences in page titles are replaced with the replacement character, and the software cannot distinguish a replacement character used as a replacement for invalid UTF-8 from a bona fide occurrence of the replacement character in a page title, so it catches all replacement characters regardless of whether or not they actually are being used by the software to replace invalid UTF-8], the particular issue that causes that particular valid UTF-8 sequence to always trigger the invalid-UTF-8 handler should not be present for the end-of-plane noncharacters, as the UTF-8 sequences corresponding to these are valid UTF-8 which is not used by the software as a placeholder for invalid UTF-8.)

In contrast, page titles containing noncharacters from the 32-codepoint range within the BMP's Arabic Presentation Forms-A block (which contains a contiguous block of 32 noncharacters from U+FDD0 - U+FDEF, inclusive), are handled correctly, as can be seen by going to, e.g., https://en.wikipedia.org/wiki/%EF%B7%90 (a redirect to https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Non-characters), which contains a UTF-8 sequence decoding to U+FDD0, the first of the noncharacters in this 32-codepoint range.

While it may well not be desirable to allow noncharacters to occur in page titles, they are, nevertheless, valid UTF-8, and, as such, if they are to be excluded from page titles, this should be handled by adding those codepoints to the page-title character blacklist, rather than by having the invalid-UTF-8 handler incorrectly classify page titles containing purely valid, well-formed UTF-8 sequences as containing invalid UTF-8.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:

English Wikipedia, MediaWiki 1.38.0-wmf.3, build 762ab25, accessed using Google Chrome 94.0.4606.71 64-bit for Linux on Ubuntu 20.04.3 LTS 64-bit x86.

Screenshot of what happens when a page title contains a UTF-8 sequence decoding to an end-of-plane noncharacter:

Screenshot from 2021-10-13 13-19-13.png (768×1 px, 144 KB)

Screenshots showing proper behavior demonstrated by a page title containing a UTF-8 sequence decoding to a noncharacter in the BMP block of 32 (first image is what happens when going to https://en.wikipedia.org/wiki/%EF%B7%90 itself, second image is the "&redirect=no" page for that title):

Screenshot from 2021-10-13 13-19-17.png (768×1 px, 386 KB)

Screenshot from 2021-10-13 13-19-25.png (768×1 px, 186 KB)