Page MenuHomePhabricator

Add a link: instance_occurrence should be based on plaintext, not wikitext
Closed, ResolvedPublic

Description

instance_occurrence counts how many times the link text occurs before the point where it should be linked. This is now based on wikitext, which cannot be used for Parsoid HTML. It should be based on plaintext instead.

In other words, when the service suggests changing Scary careening <!-- scared --> car to Scary careening <!-- scared --> [[car]], currently instance_occurrence is (assuming 1-based indexing) 4 but it should be 3 as we want to disregard the comment. Same goes for template parameters etc.

Event Timeline

Change 655547 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[research/mwaddlink@main] Make instance_occurence plaintext-based

https://gerrit.wikimedia.org/r/655547

Patch is up; I have no idea how to test it though.
It takes the somewhat lazy approach of only considering top-level text nodes (ie. it ignores anything inside a comment or a template, but also the contents of things like list items or italic text), which matches how mwaddlink selects link anchors in the first place. That means there is no need for recursion, and the behavior should be easy to match in Parsoid HTML (only look at text DOM nodes which are not inside any [typeof] tag), apart from dealing with DisplaySpace and other whitespace differences but that's an unavoidable problem.

Change 655547 merged by jenkins-bot:
[research/mwaddlink@main] Make instance_occurence plaintext-based

https://gerrit.wikimedia.org/r/655547

kostajh added a subscriber: kostajh.

Marking as resolved; we can re-open if there is more to do later.