Page MenuHomePhabricator

Pageviews missing for titles with emojis since April 23, 2019
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald Transcript
Ottomata triaged this task as High priority.
Ottomata raised the priority of this task from High to Needs Triage.
Ottomata moved this task from Incoming to Data Quality on the Analytics board.
Ottomata added a project: Analytics-Kanban.
Ottomata subscribed.

Nuria assigning this to you for now since you worked with Adam on this in the past (right?) feel free to re-assign or whatever!

Just wanted to write everything I figured out in the past 3 days. I would love your feedback!

If you don't want to read several paragraphs about Unicode and regex, skip to the second-to-last heading "Recap/TL;DR so far". Or read the whole thing!

First, a little bit of history.

Unicode is a computing industry standard designed to consistently and uniquely encode characters used in written languages throughout the world. The Unicode standard uses hexadecimal to express a character. For example, the value 0x0041 represents the character A and the value 0x03C0 represents the character ฯ€.

Early Java versions represented Unicode characters using the 16-bit char data type, with characters in the hexadecimal range from 0x0000 to 0xFFFF, called UTF-16 encoding. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits.

Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in this new version, 32-bit values were adopted for UTF-32 encoding.

The definition of a character in the Java programming language could not be changed from 16 bits to 32 bits without causing millions of Java applications to no longer run properly. To correct the definition, a scheme was developed to handle characters that could not be encoded in 16 bits.

This scheme assigns 1,024 values to 16-bit high surrogates (in the range 0xD800 to 0xDBFF) and another 1,024 values to 16-bit low surrogates (in the range 0xDC00 to 0xDFFF). It uses a high surrogate followed by a low surrogate โ€” a surrogate pair โ€” to represent 1,048,576 (the product of 1,024 and 1,024) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) .

Now, on to code.

In Java, this surrogate pair is defined as a pair of char values. For instance, the emoji ๐Ÿ† has the Unicode value of 0x1F346. This value cannot be represented by 16-bits, and thus is represented as the surrogate pair 0xD83C,0xDF46. In this case, 0xD83C is the high surrogate and 0xDF46 is the low surrogate. This results in the following Java behavior:

> "\uD83C\uDF46" == "๐Ÿ†"
true

> "๐Ÿ†".length()
2

> "๐Ÿ†".charAt(0) // Printing \uD83C, which is an invalid value without an accompanying low surrogate
๏ฟฝ

Perl implements UTF-8 encoding as opposed to Java's UTF-16 encoding. In a similar way to UTF-16, UTF-8 combines multiple fewer-bit values to represent 32-bit values such as emoji. This results in the following Perl behavior:

> "\xF0\x9F\x8D\x86" == "๐Ÿ†" // \x in Perl is similar to Java's \u, but only for values from 0x00 to 0xFF
1 (true)

> "๐Ÿ†".length()
4

> substr("๐Ÿ†", 3, 1) // Similar to Java's str.charAt(0) method; Printing \xF0, which is an invalid value without accompanying characters
๏ฟฝ

I'll be coming back to Perl and UTF-8 encoding in a bit.

Up to this point, Java's behavior is precisely the same as JavaScript's behavior, both implementing UTF-16 encoding. However, there's a slight difference between Java's and JavaScript's regex library.

Recall that both Java and JavaScript represent "๐Ÿ†" as "\uD83C\uDF46". Look at the following variation between how each language treats the same regex pattern:

JavaScript:

> let re = "^[\uD83C|\uDF46]+$"
> re.test("๐Ÿ†")
true

Java:

> Pattern re = Pattern.compile("^[\uD83C|\uDF46]+$")
> re.matches("๐Ÿ†")
false

What the pattern does in both cases is check if every character is either \uD83C or \uDF46. However, JavaScript essentially treats "๐Ÿ†" as two characters of content ("\uD83C" and "\uDF46") whereas Java treats the string as one character of content ("\uD83C\uDF46").

You can see this fact in the following Java code. Note the removal of the "|" (or) character from the pattern:

> Pattern re = Pattern.compile("^[\uD83C\uDF46]+$")
> re.matches("๐Ÿ†")
true

Note that no results would change if "๐Ÿ†" were replaced with "\uD83C\uDF46" because they are identical strings.

So, what does this all have to do with this task?

MediaWiki has a regex character class (i.e. a list of characters in a format suitable for a regular expression) to allow for use in titles called
$wgLegalTitleChars. For instance, if $wgLegalTitleChars were configured to be 'A-Z!', then all titles on that MediaWiki site would only be able to use uppercase letters and exclamation points.

Currently, $wgLegalTitleChars is configured as default for all Wikimedia projects as ' %!\"$&'()*,\-./0-9:;=?@A-Z\\^_'a-z~\x80-\xFF'. The issue is that it is intended to be applied to UTF-8 encoded strings in Perl rather than to UTF-16 encoded strings in Java. See how the '\x80-\xFF' portion of the pattern matches the "\xF0\x9F\x8D\x86" UTF-8 Perl string (which is equivalent to "๐Ÿ†" as shown previously).

In March 2019, @awight added the following code to PageviewDefinition.java in the Analytics refinery-core followed by code to test that regex pattern against incoming page titles:

/**
* Simple regex to reject illegal title characters.
*
* Must be kept in sync with $wgLegalTitleChars.
*/
private final Pattern titleValidPattern = Pattern.compile(
    "^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF\\u0080-\\uffff]+$"
);

As opposed to previously, with this addition, the refinery now filtered out invalid page titles that people attempted to visit such as "Star_Wars_[film]" (square brackets are invalid) and "C#" (number signs are invalid).

Note that this regex character class differed from the existing $wgLegalTitleChars at the time:
' %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_'a-z~\\x80-\\xFF\\u0080-\\uffff' (titleValidPattern character class) versus
' %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_'a-z~\\x80-\\xFF' ($wgLegalTitleChars (with Java delimiting)).

The addition of '\\u0080-\\uffff' by Adam successfully matched individual Unicode characters between the values 0x0080 and 0xFFFF, allowing for the tracking of pages with titles including symbols such as Chinese characters, mathematical operators, and dingbats.

However, certain page titles such as "๐Ÿ†" (here) that contained emoji were left unmatched from the pattern, and, thus, classified as unknown pageviews. This was seen with the Java regex code in the previous section; no unicode matching will work for emoji unless it is in the form of a surrogate pair.

Recap/TL;DR so far

Unicode is a way to represent characters as hex values. Java chars store 16-bit values, but Unicode has evolved to include some 32-bit values. To adapt to this, certain Unicode characters, such as emoji, are represented by a pair of 16-bit chars. For example, "\uD83C\uDF46" == "๐Ÿ†".

However, during regex matching, Java treats the string as one character of content despite the fact that its counterparts, like JavaScript, classify the two characters individually ("\uD83C" and "\uDF46"). So, if the regex includes a simple unicode range from 0x0000 to 0xFFFF, then Java's emoji characters will not be matched while JavaScript's will.

MediaWiki has a regex character class to allow for use in titles called $wgLegalTitleChars. A slight variation of this regex is implemented in PageviewDefinition.java in the Analytics refinery-core:

/**
* Simple regex to reject illegal title characters.
*
* Must be kept in sync with $wgLegalTitleChars.
*/
private final Pattern titleValidPattern = Pattern.compile(
    "^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF\\u0080-\\uffff]+$"
);

As a result, page titles that contained emoji were left unmatched from the pattern. For Java, unicode matching will not work for emoji unless it is in the form of a char (surrogate) pair.

So, how do we fix this?

Now that somebody's noticed after 10 months, it's time to fix this issue. What's the best way? I don't know. But, I'll pose to you 3 different ways, and I want your feedback on what route you think I should go.

  1. The deviating way

Now, why did I previously bring up JavaScript? Because the core Mediawiki code, here, built in JavaScript, manipulates the $wgLegalTitleChars character class in a similar way to that currently-implemented in the pageview definition. However, it correctly classifies emoji because it treats each 16-bit character in the surrogate pair as its own step and matches all such characters.

We can implement a very similar solution in Java, with the pattern "^[ %!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_'a-z~\\x{80}-\\x{10FFFF}+]+$". Note how both '\x00-\xFF' and '\u0000-\uFFFF' are absent from the expression. This is because Java allows the \x{N} notation for any 32-bit Unicode value 0xN.

This is a rather straightforward solution, performing the same as $wgLegalTitleChars is intended to. However, it faces the downside of not being a direct implementation of $wgLegalTitleChars.

  1. The "if we're going to deviate from it, why not abandon it" way

This is my personal favorite. Its logic is as such: rather than match many things as legal, match a few things as illegal. Essentially, make a $wgILLegalTitleChars pattern.

The $wgLegalTitleChars page mentions how "the list of illegal characters is as follows: #<>[]|{}, non-printable characters 0 through 31 (0x1F), and 'delete' character 127 (0x7F)".

So I propose an implementation of the pattern "^[^#<>\\[\\]\\|\\x00-\\x1F\\x7F]+$"".

The pros are pretty straightforward: it is so simple, clean, and fast.

However, it a similar downside to the first way of not being a direct implementation of $wgLegalTitleChars and even more abstracted away from it. Changes in $wgLegalTitleChars may not be so simple to implement.

  1. The Perl way

Essentially, this way, just like Perl, runs a regex against all characters using an UTF-8 encoding, and, thus, is directly compatible with the exact $wgLegalTitleChars character set.

This is by far the most complex way. The title would load into a string-like data structure, get turned into a byte array, and finally get passed into the regex as 8-bit character segments.

This is the way for super long-term stability, but is slight slower. I already have this implemented, so I know it's viable and it's really not as bad as it seems.

Final Note

I have all of these above ways coded up and ready to submit for code review. Let me know what you think and what questions you have. Thanks!

Thanks so much to @awight for helping me figure out all of this!

(History Sources: Oracle, Ibrahem Shabban)

Great explanation, thank you, every time someone says "unicode" I usually have to read for an hour to remember all the tricks. So, my thoughts on the three proposals:

  1. It's close enough to mediawiki's definition, if you went this way you could add a comment that shows the current $wgLegalTitleChars to make sure we stay close to it, and just link to your comment here. I rank this choice 2nd.
  1. This is scary to me because of all the crazy characters out there. Like the five unicode characters that have zero width: https://www.ptiglobal.com/2018/04/26/the-beauty-of-unicode-zero-width-characters/ (this is the worst thing ever if you're ever debugging something with your puny human eyeballs). I rank this 3rd but since $wgLegalTitleChars does allow some of these, maybe it's ok... I donno, still scared.
  1. I like this approach. Made me wonder if there are Java regex libraries compatible with Perl, and I found http://jregex.sourceforge.net/. It boasts unicode compatibility with Perl 5.6 so may be worth a look.

Finally, testing the eventual approach should take a bunch of titles and run them through the old and new version, making sure the change is exactly what we expect. That testing may take some time, so if you have a few approaches, I suggest testing them all at the same time.

I disagree with @Milimetric, I think #3 would be a performance drag for code that might be executed 6000+ times a second, plus as a philosophical rule I think we should program in java with java constructs rather than using constructs from a different language. Let's hear what @awight's opinion is

Thank you @lexnasser for the great write-up! I'm pinning as my new favorite overview of the crazy things that can happen with string encodings ๐Ÿ™‚

My 6 cents: options (1) and (3) seem roughly the same to me. Now that we understand the problem, it's not too scary to port the UTF-8 high-bit range to something similar for UTF-16. The range you gave seems correct, Character.MAX_CODE_POINT is currently 0x10ffff, just as you have it. I wouldn't think of it as "deviating" from the MediaWiki legal characters, it's a slight adjustment to accommodate the differences between the string encoding and regex syntax across the two programming languages. My only concern would be that MAX_CODE_POINT might increase again in a few years, but that could be mitigated by asserting the value of the constant in a test. (2) is really interesting, but I would only go with that option if you're willing to make the case for flipping $wgLegalTitleChars to an "illegal" sense, and coming up with migrations / compatibility for third-parties who have configured the variable. It's a nice idea, and the Unicode complexity completely disappears, which is a win! (3) feels like too much trouble to accomplish something simple, but it would certainly do the job.

We already let through a huge range of un-audited Unicode stuff, and I agree with @Milimetric that it has potential to gum up the works. However, the PageviewDefinition change should follow the "robustness principle": we should accept and pass-through any valid title. Invisible characters might be *necessary* in some languages, for example Zero Width Joiner for combining emoji. But the downstream pipeline needs to correctly escape values to avoid breaking the CSV / TSV format. I believe this is not currently the case, if a newline, tab, or their high-Unicode equivalents slip through then the dumps will break again. Just to mention it, we aren't responsible for blocking characters like Unicode compositions which potentially result in "<", '"', or whatever might break SQL consumers, etc--our only concern should be that the dump file format is correct.

Good point Nuria, performance matters a lot. But JRegex might be fast, we should profile it. I think my point about keeping it a whitelist instead of a blacklist is valid. So the fastest option that still does that gets my vote.

Thanks everyone for your input! I'm a bit busy right now, but I'll be sure to address each of your points later today.

Just for context, option 3 would look something like this:

// This is the data structure to allow UTF-8 interpretation of Java Strings

public static class StringUtf8CharSequence implements CharSequence {

	final String str;
	final byte[] data;

	public StringUtf8CharSequence(String str) {
		this.str = str;
		this.data = str.getBytes(StandardCharsets.UTF_8);
	}

	@Override
	public int length() {
		return data.length;
	}

	@Override
	public char charAt(int index) {
		return (char) (0xFF & data[index]);
	}

	@Override
	public CharSequence subSequence(int start, int end) {
		return new StringUtf8CharSequence(str.substring(start, end));
	}

}

With implementation as following:

StringUtf8CharSequence titleUtf8 = new StringUtf8CharSequence(title);

Boolean isMatch = titleValidPattern.matcher(titleUtf8).find();

I'd expect that the speed would be pretty similar to option (1) because Java regex matching takes in a CharSequence by default with the exception that I'm unsure of the speed of str.getBytes(StandardCharsets.UTF_8);

I'll look more into the speed / time complexity later as well.

I've tested this on many examples, and it seems to work as expected. Will test even more later on.

Again, thanks for all your input!

Thanks again for all your feedback!

I implemented all 3 tests in Java, and ran them each, individually, against all the VALID pageview titles collected during a 1 hour some day this month across all Wikimedia projects, more than 22,000,000 values, on stat1007.

To my slight surprise, the following were the results across 6 trials of each:

MethodAverage TimeTime 1Time 2Time 3Time 4Time 5Time 6
1: Matching \\x{80}-\\x{10FFFF}64.595s71.150s61.545s65.891s61.273s66.025s61.684s
2: Banning illegal characters36.504s39.188s34.812s32.193s37.983s37.594s37.254s
3: Using UTF-8 (Perl-like) interpretation70.484s66.002s73.354s69.754s70.005s71.307s72.481s

As you can see, method 2 seems to be the fastest by quite a bit.

However, the fact that 3 is optimal in terms of compatibility with possible future modifications of $wgLegalTitleChars makes it now my slight preference, and it wouldn't make the current title validation process any slower.

I also think that method 1 should be removed from consideration due to the fact that it is as slow as method 3 without 3's compatibility niceness.

Your views might change according to this above info, but I'll address what you wrote anyway.


@Milimetric

  1. This is scary to me because of all the crazy characters out there. Like the five unicode characters that have zero width.

I'm not sure I agree with this critique simply because $wgLegalTitleChars itself allows it. I might be ignoring/misunderstanding something, though.

  1. I like this approach. Made me wonder if there are Java regex libraries compatible with Perl, and I found http://jregex.sourceforge.net/. It boasts unicode compatibility with Perl 5.6 so may be worth a look.

I coded up an implementation of JRegex, and it didn't work as expected, not being able to properly match !Xลฉ. I haven't extensively looked into this, but I believe it's because ลฉ = 0x0169 > 0xFF, meaning that I don't think JRegex actually interprets strings as UTF-8; the only Perl-like behavior listed seems to be compatibility with "Perl5.6's conditional statements".

I also agree with everything you mentioned in your last paragraph about testing.

I think my point about keeping it a whitelist instead of a blacklist is valid. So the fastest option that still does that gets my vote.

I'm again not sure the precise issue with a blacklist is, so it would be great if you could expand on this. If we want to keep the fastest option that employs a whitelist, that would be method 1 by a slight speed margin, but I would prefer method 3 over it for reasons listed above.


@Nuria

I think #3 would be a performance drag for code that might be executed 6000+ times a second, plus as a philosophical rule I think we should program in java with java constructs rather than using constructs from a different language.

According the above test, I think that the only way to improve performance to a significant extent would be method 2.

I also want to clarify that method 3 is not so much imitating Perl constructs but rather just interpreting with UTF-8 encoding, but I get your point.


@awight

(2) is really interesting, but I would only go with that option if you're willing to make the case for flipping $wgLegalTitleChars to an "illegal" sense, and coming up with migrations / compatibility for third-parties who have configured the variable. It's a nice idea, and the Unicode complexity completely disappears, which is a win!

What do you have in mind when you say "migrations / compatibility for third-parties who have configured the variable"? I agree that the Unicode complexity going away is very nice!

I also agree with everything you mentioned in your last paragraph about the entire pipeline.

(2) is really interesting, but I would only go with that option if you're willing to make the case for flipping $wgLegalTitleChars to an "illegal" sense, and coming up with migrations / compatibility for third-parties who have configured the variable. It's a nice idea, and the Unicode complexity completely disappears, which is a win!

What do you have in mind when you say "migrations / compatibility for third-parties who have configured the variable"? I agree that the Unicode complexity going away is very nice!

My tangent was mostly about how (2) might be the best solution overall and cleans up a nontrivial piece of technical debt (noting that several engineers have been involved in fallout from this wacky regex, multiple times now), but the challenges will be around social negotiation and communication. Maybe that sounds fun, though?

In the case that we decide to flip from $wgLegalTitleChars to $wgIllegalTitleChars, then some small changes would have to be made to core MediaWiki. First would be a short proposal describing the changes, motivation, and potential impacts. Code review would be straightforward once consensus existed to make the change (I'm in favor :-). But the trickiest detail is that we would have to consider what happens if some so-called third-party MediaWiki administrator has overridden $wgLegalTitleChars. We would need to support both the Legal and Illegal configurations for one or two major versions in order to deprecate and remove the old behavior, without breaking these sites.

I guess the question would be how much time you're willing to spend on the solution, and if you can justify expanding your scope of work to tackle the upstream debt. My estimate is that it's a small project, but still requires a low level of commitment over a few months to see it through.

@lexnasser let's do a bit more testing, from my tries with the \x{n] notation it does not work on a string context (fails at runtime) . Adding my code below, let me know if i missed anything or misunderstood.

import java.util.regex.*;

class Main {
  public static void main(String[] args) {
    System.out.println("Hello world!");
    String re = 
    "^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF\\u0080-\\uffff]+$";

  // we expect false, false is printed
  System.out.println(Pattern.matches(re,"๐Ÿ†"));

  String emoji =  "^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_'a-z~\\\\x{80}-\\\\x{10FFFF}+]+$";

  // we expect true, fails at runtime
 // System.out.println(Pattern.matches(emoji,"๐Ÿ†"));


 final Pattern titleValidPattern = Pattern.compile(
    "^[ %!\"$&'()*,\\-./0-9:;=?@A-Z\\\\^_`a-z~\\x80-\\xFF\\u0080-\\uffff]+$"
);

Matcher m = titleValidPattern.matcher("๐Ÿ†");

if (m.find()){
  // we expect print successful but that is not what we see
  System.out.println("match successful");
}

@awight: this is indeed a small projects without any aim to change mediawiki code

@Nuria

This is the code I'm using: Pattern.compile("^[ %!\"$&'()*,\\-.\\/0-9:;=?@A-Z\\\\^_a-z~\\x{80}-\\x{10FFFF}\\+]+$");`

It has a few differences from what you wrote:

  1. "/" is preceded by a delimiting "\\".
  2. "'" is rather "`" (a backtick). Had issues with inserting a backtick into the code-style on Phabricator previously.
  3. Both "x{#}" are delimited with "\\" rather than "\\\\".
  4. The first "+" is preceded by a delimiting "\\".

The file I used to test that method is attached here:

@lexnasser ok, changing those fixed it, just tested that it works on scala 2.11 and java1.8 (scala uses java's regexes but just to be extra sure as we use this code from both java and scala) . I prefer method 1 as i mentioned, just wanted to make sure to throughly test the \x[n] notation that I had not seen before.

Thanks for the tip! Luckily, I don't think that'll have any effect on the pageview processing, since we're not looking up pages in the database, and both sides of that mapping will still be valid titles according to the analytics pageview definition.

On @Milimetric 's suggestion, I tested all 3 methods against each other to verify their consistency, and found they all behaved the same over the whole Unicode range.

@Nuria and @awight , what are your thoughts on method 2 (without any changes to Mediawiki Core, etc.)?

Well, if they're identical, then the speed of method 2 seems to me too nice to ignore. The only downside there is that as new characters are added to Unicode (looks like a few thousand a year), some will not be whitelisted in Mediawiki's implementation and we'll be out of sync until we add it to our blacklist. But the way the current whitelist is written, that would happen anyway (with the character range). So I think method 2 is a strong choice.

@lexnasser I think as a general paradigm i agree with a do-not-allow list being way easier to maintain than an "allowed" list, let's test method 2 further using 1 day of data to make sure it does not exclude any pageview now marked as such. In the basence of any issues, +1 to implementation.

@lexnasser let's also test the "bad" pageview titles that should be excluded and that motivated @awight 's changes on 1st place

Change 589383 had a related patch set uploaded (by Lex Nasser; owner: Lex Nasser):
[analytics/refinery/source@master] Allow pageview titles that include Unicode character values above 0xFFFF like emoji

https://gerrit.wikimedia.org/r/589383

Just submitted a patch with the fix and some new tests: https://gerrit.wikimedia.org/r/589383


@Nuria

Confirmed against a day's worth of titles that no titles previously-accepted would be rejected.

Is there any way you suggest testing "bad" pageview titles? I was thinking just querying the titles that match this pattern but not the previous pattern and just eyeing that they're valid.

@Nuria

Thanks for the suggestions!

T144100 is an issue later in the Analytics pipeline, so I don't think a unit test should be added for it. Let me know if you disagree.

T152628 already has some unit tests.

Change 589383 merged by jenkins-bot:
[analytics/refinery/source@master] Allow pageview titles that include Unicode character values above 0xFFFF like emoji

https://gerrit.wikimedia.org/r/589383

Change 591493 had a related patch set uploaded (by Milimetric; owner: Milimetric):
[analytics/refinery@master] Update unicode handling in Pageview Definition

https://gerrit.wikimedia.org/r/591493

Change 591493 merged by Milimetric:
[analytics/refinery@master] Update unicode handling in Pageview Definition

https://gerrit.wikimedia.org/r/591493

Changes are deployed @lexnasser let's do another round of test to make sure all is good on webrequets table and to see if there titles that now appear that we were filtering before.

Thank you, all! I just wanted to further compliment T245468#6046245 -- that was a fascinating read!