Page MenuHomePhabricator

Add Link engineering: Output format for link recommendation format
Closed, ResolvedPublic

Description

As discussed in T261411, we want the link recommendation output to be a list of items (JSON serialized), each item has the following properties:

  • phrase_to_link (text)
  • context_before (text -- 5 characters of text that occur before the phrase to link)
  • context_after (text -- 5 characters of text that occur after the phrase to link)
  • link_target (text)
  • instance_occurrence (integer) -- number showing how many times the phrase to link appears in the wikitext before we arrive at the one to link
  • probability (boolean)
  • insertion_order (integer) -- order in which to insert the link on the page (e.g. recommendation "foo" [0] comes before recommendation "bar baz" [1], which comes before recommendation "bar" [2], etc)

Event Timeline

Do we still want context_before / context_after now that we are not planning to switch the service to Parsoid? Due to filtering in mwaddlink, phrase_to_link won't include wikimarkup, but the before/after segment might so I'm not sure how useful it is for matching against Parsoid DOM.

Change 650467 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[research/mwaddlink@main] Update API output format

https://gerrit.wikimedia.org/r/650467

What do you think about using link_text instead of phrase_to_link and score instead of probability? Those feel like more standard names to me (although the current ones are also easy to understand).

I find the meaning of instance_occurrence and insertion_order hard to remember (especially wrt whether they are 0-based or 1-based). I don't have any great alternative, but maybe something like match_index and link_index (0-based)?

What do you think about using link_text instead of phrase_to_link and score instead of probability? Those feel like more standard names to me (although the current ones are also easy to understand).

Sounds good to me.

I find the meaning of instance_occurrence and insertion_order hard to remember (especially wrt whether they are 0-based or 1-based). I don't have any great alternative, but maybe something like match_index and link_index (0-based)?

That seems fine to me also.

Change 650467 merged by jenkins-bot:
[research/mwaddlink@main] Update API output format

https://gerrit.wikimedia.org/r/650467

kostajh moved this task from Code Review to QA on the Growth-Team (Current Sprint) board.

Change 662092 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[research/mwaddlink@main] Small changes to output format

https://gerrit.wikimedia.org/r/662092

Change 662094 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Follow mwaddlink output changes

https://gerrit.wikimedia.org/r/662094

Tgr moved this task from QA to Code Review on the Growth-Team (Current Sprint) board.

Reopening for the last round of format changes.

Change 662092 merged by jenkins-bot:
[research/mwaddlink@main] Small changes to output format

https://gerrit.wikimedia.org/r/662092

Change 662094 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Follow mwaddlink output changes

https://gerrit.wikimedia.org/r/662094

Not sure if this is worth QA testing but I'll leave that for someone else to decide.