Page MenuHomePhabricator

Add a link engineering: deal with word boundaries (\b) being broken in JS
Closed, InvalidPublic

Description

Regexes in JS have a very broken implementation of \w and \b, which considers characters like é to be non-word characters. This means that in the string foo café bar, there is no word boundary between the é and the space, because they're both non-word characters. This means that \bfoo\b and \bbar\b match this string, but \bcafé\b doesn't. Since our current code uses '\b' + phrase + '\b' to find phrases, any phrase that begins or ends with a "special" character can't be found (or worse, only special characters will be found).

We should probably deal with this by not using word boundaries at all, and instead using the other properties listed in T267329 to avoid false positives.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Oh wow that's sad. They could be expressed with Unicode property escapes (something like ([\p{L}\p{N}\p{M}][^\p{L}\p{N}\p{M}]|[^\p{L}\p{N}\p{M}][\p{L}\p{N}\p{M}]) I think?) but apparently IE doesn't support those.

How do we handle link boundaries that are different in wikitext and HTML, anyway? Links like [[foo]]bar are pretty common. Does mwaddlink always link the whole word? (In T128060#4316940 it was mentioned that some wikis seem to intentionally avoid that it some cases, but no one was really sure about the details.)

Oh wow that's sad. They could be expressed with Unicode property escapes (something like ([\p{L}\p{N}\p{M}][^\p{L}\p{N}\p{M}]|[^\p{L}\p{N}\p{M}][\p{L}\p{N}\p{M}]) I think?) but apparently IE doesn't support those.

Another alternative is to use unicodeJS.characterclass.patterns.word, which is a fully Unicode-compliant regex for word characters. From that we could try to concoct a regex that uses lookahead/lookbehind to implement word boundaries somehow (which we would also have to do if we use \p{L} and friends).

How do we handle link boundaries that are different in wikitext and HTML, anyway? Links like [[foo]]bar are pretty common. Does mwaddlink always link the whole word? (In T128060#4316940 it was mentioned that some wikis seem to intentionally avoid that it some cases, but no one was really sure about the details.)

I don't know how mwaddlink decides what's a word, that would be good to know, because using the same definition of what a word is in both mwaddlink and the VE frontend is the only way that an algorithm based on words/phrases is at all viable. Otherwise, we'd have to lean on the context information (text before/after) instead.

How do we handle link boundaries that are different in wikitext and HTML, anyway? Links like [[foo]]bar are pretty common. Does mwaddlink always link the whole word? (In T128060#4316940 it was mentioned that some wikis seem to intentionally avoid that it some cases, but no one was really sure about the details.)

I don't know how mwaddlink decides what's a word, that would be good to know, because using the same definition of what a word is in both mwaddlink and the VE frontend is the only way that an algorithm based on words/phrases is at all viable. Otherwise, we'd have to lean on the context information (text before/after) instead.

I'm not sure mwaddlink has the concept of a word, as such. It knows about wikilinks as determined by the parser in mwparserfromhell. In training mode, when building the datasets used for looking up candidates to links, it uses dictionaries comprised of anchor text that were returned from parsing articles and iterating over the wikilinks found in the text of those articles. In querying mode, when we send it an article to analyze for links to suggest, it slides a window across the text looking for potential matches with the values in the dataset dictionaries. (See also these notes on meta)

Does that help answer your question? @MGerlach could probably shed more light on this but he's out until 18 January.


Stepping back a bit, as mentioned in the task description, I think we should probably not use word boundaries and instead rely on other metadata from T267329 (ie instance occurence) to make the correct link.

Stepping back a bit, as mentioned in the task description, I think we should probably not use word boundaries and instead rely on other metadata from T267329 (ie instance occurence) to make the correct link.

We could use instance occurrence, if both mwaddlink and the frontend agree that that counter includes sub-word strings. For example, in A scary [[car]] careens, there would be three occurrences of "car", and the linked one would be the second one. Using context_before/context_after could potentially be more resilient, as long as mwparserfromhell and the Parsoid HTML agree on what counts as text and where block-level boundaries (e.g. paragraph boundaries) are.

Stepping back a bit, as mentioned in the task description, I think we should probably not use word boundaries and instead rely on other metadata from T267329 (ie instance occurence) to make the correct link.

We discussed this on Slack and ended up on this. Needs T271604: Add a link: instance_occurrence should be based on plaintext, not wikitext.

How do we handle link boundaries that are different in wikitext and HTML, anyway? Links like [[foo]]bar are pretty common. Does mwaddlink always link the whole word? (In T128060#4316940 it was mentioned that some wikis seem to intentionally avoid that it some cases, but no one was really sure about the details.)

I don't know how mwaddlink decides what's a word, that would be good to know, because using the same definition of what a word is in both mwaddlink and the VE frontend is the only way that an algorithm based on words/phrases is at all viable. Otherwise, we'd have to lean on the context information (text before/after) instead.

I'm not sure mwaddlink has the concept of a word, as such. It knows about wikilinks as determined by the parser in mwparserfromhell. In training mode, when building the datasets used for looking up candidates to links, it uses dictionaries comprised of anchor text that were returned from parsing articles and iterating over the wikilinks found in the text of those articles. In querying mode, when we send it an article to analyze for links to suggest, it slides a window across the text looking for potential matches with the values in the dataset dictionaries. (See also these notes on meta)

Does that help answer your question? @MGerlach could probably shed more light on this but he's out until 18 January.

We generate word-tokens from wikitext in the following way:

  • extract parts of plain text using mwparserfromhell
  • split text into sentences using a sentence-tokenizer (nltk.tokenize.sent_tokenize)
  • split the sentence into word-tokens by using a word-tokenizer which splits at whitespaces and other punctuation, e.g. commas (nltk.tokenize.word_tokenize)

We generate candidates for the anchor-text of a link by concatenating n consecutive word-tokens (where n=1,...,10) and checking whether that string exists as a key in the anchors-dictionary. The anchors-dictionary keeps track of all the already existing links in a wiki in the form {anchor-text: [list of article-titles linked from anchor-text]}. If the candidate-anchor exists we then have to decide which (if any) of the links we should link to.

This means that currently the link-recommendation will not link sub-words, that is anything that is below the level of a token.

Since we are doing this already, I wonder if it would provide an easy means for T269655: Add a link: sentence highlighting research spike? (Easier, anyway. The wikitext/HTML transformation would still be annoying.)

This is no longer relevant per T271124#6733319

Right, sorry. I was confused by the candidate check logic in mwaddlink using word boundaries, but the match count calculation doesn't, so it shouldn't be an issue.