Add a link in bnwiki: algorithm improvements: articles are not being suggested at their first appearance
Open, MediumPublic
Actions

Assigned To

None

Authored By

	Ankan_WMF
	May 26 2021, 12:33 PM

Description

In bnwiki, some articles are being suggested after their first appearance.

For example:
(i) In ফ্রেড বারাট article, the word "ইনিংস" is suggested at its 4th appearance
(ii) In সৌদিয়া article, the word "মধ্যপ্রাচ্য" is suggested at its 2nd appearance
(iii) In স্ট্যাচু অব লিবার্টি article, the word "তামা" is suggested at its 2nd appearance
(iv) In সুপারহিরো article, the word "কমিক বই" is suggested at its 3rd appearance
(v) In অ্যান্থনি বেবিংটন article, the word "যুক্তরাজ্য" is suggested at its 2nd appearance

Screenshots of the first two examples (I marked the first appearances in red):

An observation: in each of the cases, the words of the article contain some additional alphabets in those first appearances. In the case of example (iii), the first appearance was "তামার", and the article "তামা" was not suggested then.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		KStoller-WMF	T276517 [EPIC] Growth: "add a link" structured task 3.0
		Open		None	T283715 Add a link in bnwiki: algorithm improvements: articles are not being suggested at their first appearance

Event Timeline

Ankan_WMF created this task.May 26 2021, 12:33 PM

Restricted Application added a project: Growth-Team. · View Herald TranscriptMay 26 2021, 12:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Ankan_WMF updated the task description. (Show Details)May 26 2021, 12:36 PM

cc @MGerlach. This is the data for the first article, you can see that ইনিংস the mwaddlink tool has told us to link to the fourth occurrence (match_index: 3):

[
  {
    "link_text": "প্রথম বিশ্বযুদ্ধের",
    "link_target": "প্রথম বিশ্বযুদ্ধ",
    "match_index": 0,
    "wikitext_offset": 4878,
    "score": 0.7330302000045776,
    "context_before": "\n",
    "context_after": " কারণে ফ্র",
    "link_index": 0
  },
  {
    "link_text": "ইনিংস",
    "link_target": "ইনিংস",
    "match_index": 3,
    "wikitext_offset": 5915,
    "score": 0.6464713215827942,
    "context_before": "১৯১৯ সালে ",
    "context_after": " প্রতি ১৬ ",
    "link_index": 1
  },
  {
    "link_text": "বোলিং গড়ের",
    "link_target": "বোলিং গড়",
    "match_index": 0,
    "wikitext_offset": 7276,
    "score": 0.541901707649231,
    "context_before": " তার সেরা ",
    "context_after": " অধিকারী হ",
    "link_index": 2
  },
  {
    "link_text": "ওরচেস্টারশায়ারের",
    "link_target": "ওরচেস্টারশায়ার কাউন্টি ক্রিকেট ক্লাব",
    "match_index": 0,
    "wikitext_offset": 9122,
    "score": 0.7601364850997925,
    "context_before": "্ট ব্রিজে ",
    "context_after": " বিপক্ষে ৮",
    "link_index": 3
  },
  {
    "link_text": "মিডলসেক্সের",
    "link_target": "মিডলসেক্স কাউন্টি ক্রিকেট ক্লাব",
    "match_index": 0,
    "wikitext_offset": 13987,
    "score": 0.5254502892494202,
    "context_before": "নিটে ৯৪ ও ",
    "context_after": " বিপক্ষে ৭",
    "link_index": 4
  }
]

kostajh assigned this task to MGerlach.May 26 2021, 1:22 PM

kostajh triaged this task as Medium priority.

kostajh moved this task from Backlog to May 24 – May 28 on the Add-Link board.

kostajh moved this task from Inbox to Upcoming Work on the Growth-Team board.May 26 2021, 7:41 PM

kostajh moved this task from May 24 – May 28 to Post-release backlog on the Add-Link board.May 27 2021, 12:19 PM

MGerlach mentioned this in T272731: In-depth analysis of link-recommendation model .May 27 2021, 3:12 PM

I can reproduce that behaviour for the first example (article: ফ্রেড_বারাট, anchor: ইনিংস) . From what I can see this comes from the difference of the strings "ইনিংস" and "ইনিংসে".

"ইনিংস" is in our anchor-dictionary (other articles used this as an anchor). we are now looking if we can find that somewhere in the article using exact string-matching. According to this rule, mwaddlink finds the correct match. There are 3 possible anchors before that first occurrence as mentioned above, however, these are slightly different strings "ইনিংসে". Since this has a different encoding, we do not count this as a match.

@Ankan_WMF would you consider ইনিংস and ইনিংসে as the same word? (sorry, I am not familiar with the script). Do you have a suggestion how we could determine that both could be used? Is this similar to English, as an example, when we would have a known anchor "bridge" (with the article Bridge) but in the text we only find the word "bridges" -- in this particular case our algorithm would skip the word "bridges" since it is not an exact match with "bridge" and would miss those possibilities to link. There are some possibilities in English to deal with that (Stemming) though we are not applying this due to the fact that this is extremely language-specific. Perhaps there are similar approaches we could use for bnwiki?

Curious if you have any thoughts or ideas as I am not sure how to approach this issue at the moment.

In T283715#7120223, @MGerlach wrote:

I can reproduce that behaviour for the first example (article: ফ্রেড_বারাট, anchor: ইনিংস) . From what I can see this comes from the difference of the strings "ইনিংস" and "ইনিংসে".

"ইনিংস" is in our anchor-dictionary (other articles used this as an anchor). we are now looking if we can find that somewhere in the article using exact string-matching. According to this rule, mwaddlink finds the correct match. There are 3 possible anchors before that first occurrence as mentioned above, however, these are slightly different strings "ইনিংসে". Since this has a different encoding, we do not count this as a match.

@Ankan_WMF would you consider ইনিংস and ইনিংসে as the same word? (sorry, I am not familiar with the script). Do you have a suggestion how we could determine that both could be used? Is this similar to English, as an example, when we would have a known anchor "bridge" (with the article Bridge) but in the text we only find the word "bridges" -- in this particular case our algorithm would skip the word "bridges" since it is not an exact match with "bridge" and would miss those possibilities to link. There are some possibilities in English to deal with that (Stemming) though we are not applying this due to the fact that this is extremely language-specific. Perhaps there are similar approaches we could use for bnwiki?

Curious if you have any thoughts or ideas as I am not sure how to approach this issue at the moment.

Yes, it's similar to the example you provided in English. Bridge > Bridge+s; here it is ইনিংস > ইনিংস+ে = ইনিংসে. ইনিংস and ইনিংসে are slightly different: the first one is simply "Innings", whereas the second one (with an additional "ে") means "in innings". In Bengali text, we put 'kar', which often gets added to the alphabets of the root word.

In my observation, I think the algorithm can find root words in most of the cases. I have seen other examples where the article is about "বাংলাদেশ", the word present in text is "বাংলাদেশের"। The algorithm rightly detected the underlying root word and suggested "বাংলাদেশ" article. In the এয়ার কানাডা article, the second suggestion is the "প্রশান্ত মহাসাগর" article. Although the words appeared in the text is "প্রশান্ত মহাসাগরের", the algorithm could detect the correct words. However, in the examples I provided above, it couldn't find the root words.

If the algorithm follows "exact string-match", then it makes sense why the algorithm couldn't suggest some words in their first appearances. But I am confused as most of the time the algorithm can suggest correctly despite having additional alphabets to the original string.

kostajh moved this task from Post-release backlog to Backlog on the Add-Link board.Jun 7 2021, 7:26 AM

Here's a similar example from huwiki: őssejt is linked, but the declensed form, őssejtekből, already appears in the previous sentence.

(There's another stemming-related error in the same edit: from the term hemopoetikus őssejt (hematopoietic stem cell) only the second word is linked, even though there's a more relevant article to link, hemopoetikus őssejtek, but it's not found because the title is in plural. Maybe we should have a more generic task about better stemming.)

MMiller_WMF edited parent tasks, added: T276517: [EPIC] Growth: "add a link" structured task 3.0; removed: T252822: [EPIC] Growth: "add a link" structured task 1.0.Nov 9 2021, 7:42 PM

MShilova_WMF moved this task from Upcoming Work to Triaged on the Growth-Team board.Nov 7 2022, 6:25 PM

Removing myself as assignee as the task has not been prioritized and thus no immediate work is planned.

Add a link in bnwiki: algorithm improvements: articles are not being suggested at their first appearanceOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add a link in bnwiki: algorithm improvements: articles are not being suggested at their first appearance
Open, MediumPublic
Actions

Related Objects
Search...