Page MenuHomePhabricator

Add a link in bnwiki: algorithm improvements: articles are not being suggested at their first appearance
Open, MediumPublic

Description

In bnwiki, some articles are being suggested after their first appearance.

For example:
(i) In ফ্রেড বারাট article, the word "ইনিংস" is suggested at its 4th appearance
(ii) In সৌদিয়া article, the word "মধ্যপ্রাচ্য" is suggested at its 2nd appearance
(iii) In স্ট্যাচু অব লিবার্টি article, the word "তামা" is suggested at its 2nd appearance
(iv) In সুপারহিরো article, the word "কমিক বই" is suggested at its 3rd appearance
(v) In অ্যান্থনি বেবিংটন article, the word "যুক্তরাজ্য" is suggested at its 2nd appearance

Screenshots of the first two examples (I marked the first appearances in red):

appearance_1.png (794×1 px, 262 KB)
appearance_2.png (574×1 px, 148 KB)

An observation: in each of the cases, the words of the article contain some additional alphabets in those first appearances. In the case of example (iii), the first appearance was "তামার", and the article "তামা" was not suggested then.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

cc @MGerlach. This is the data for the first article, you can see that ইনিংস the mwaddlink tool has told us to link to the fourth occurrence (match_index: 3):

[
  {
    "link_text": "প্রথম বিশ্বযুদ্ধের",
    "link_target": "প্রথম বিশ্বযুদ্ধ",
    "match_index": 0,
    "wikitext_offset": 4878,
    "score": 0.7330302000045776,
    "context_before": "\n",
    "context_after": " কারণে ফ্র",
    "link_index": 0
  },
  {
    "link_text": "ইনিংস",
    "link_target": "ইনিংস",
    "match_index": 3,
    "wikitext_offset": 5915,
    "score": 0.6464713215827942,
    "context_before": "১৯১৯ সালে ",
    "context_after": " প্রতি ১৬ ",
    "link_index": 1
  },
  {
    "link_text": "বোলিং গড়ের",
    "link_target": "বোলিং গড়",
    "match_index": 0,
    "wikitext_offset": 7276,
    "score": 0.541901707649231,
    "context_before": " তার সেরা ",
    "context_after": " অধিকারী হ",
    "link_index": 2
  },
  {
    "link_text": "ওরচেস্টারশায়ারের",
    "link_target": "ওরচেস্টারশায়ার কাউন্টি ক্রিকেট ক্লাব",
    "match_index": 0,
    "wikitext_offset": 9122,
    "score": 0.7601364850997925,
    "context_before": "্ট ব্রিজে ",
    "context_after": " বিপক্ষে ৮",
    "link_index": 3
  },
  {
    "link_text": "মিডলসেক্সের",
    "link_target": "মিডলসেক্স কাউন্টি ক্রিকেট ক্লাব",
    "match_index": 0,
    "wikitext_offset": 13987,
    "score": 0.5254502892494202,
    "context_before": "নিটে ৯৪ ও ",
    "context_after": " বিপক্ষে ৭",
    "link_index": 4
  }
]
kostajh triaged this task as Medium priority.
kostajh moved this task from Backlog to May 24 – May 28 on the Add-Link board.

I can reproduce that behaviour for the first example (article: ফ্রেড_বারাট, anchor: ইনিংস) . From what I can see this comes from the difference of the strings "ইনিংস" and "ইনিংসে".

"ইনিংস" is in our anchor-dictionary (other articles used this as an anchor). we are now looking if we can find that somewhere in the article using exact string-matching. According to this rule, mwaddlink finds the correct match. There are 3 possible anchors before that first occurrence as mentioned above, however, these are slightly different strings "ইনিংসে". Since this has a different encoding, we do not count this as a match.

@Ankan_WMF would you consider ইনিংস and ইনিংসে as the same word? (sorry, I am not familiar with the script). Do you have a suggestion how we could determine that both could be used? Is this similar to English, as an example, when we would have a known anchor "bridge" (with the article Bridge) but in the text we only find the word "bridges" -- in this particular case our algorithm would skip the word "bridges" since it is not an exact match with "bridge" and would miss those possibilities to link. There are some possibilities in English to deal with that (Stemming) though we are not applying this due to the fact that this is extremely language-specific. Perhaps there are similar approaches we could use for bnwiki?

Curious if you have any thoughts or ideas as I am not sure how to approach this issue at the moment.

I can reproduce that behaviour for the first example (article: ফ্রেড_বারাট, anchor: ইনিংস) . From what I can see this comes from the difference of the strings "ইনিংস" and "ইনিংসে".

"ইনিংস" is in our anchor-dictionary (other articles used this as an anchor). we are now looking if we can find that somewhere in the article using exact string-matching. According to this rule, mwaddlink finds the correct match. There are 3 possible anchors before that first occurrence as mentioned above, however, these are slightly different strings "ইনিংসে". Since this has a different encoding, we do not count this as a match.

@Ankan_WMF would you consider ইনিংস and ইনিংসে as the same word? (sorry, I am not familiar with the script). Do you have a suggestion how we could determine that both could be used? Is this similar to English, as an example, when we would have a known anchor "bridge" (with the article Bridge) but in the text we only find the word "bridges" -- in this particular case our algorithm would skip the word "bridges" since it is not an exact match with "bridge" and would miss those possibilities to link. There are some possibilities in English to deal with that (Stemming) though we are not applying this due to the fact that this is extremely language-specific. Perhaps there are similar approaches we could use for bnwiki?

Curious if you have any thoughts or ideas as I am not sure how to approach this issue at the moment.

Yes, it's similar to the example you provided in English. Bridge > Bridge+s; here it is ইনিংস > ইনিংস+ে = ইনিংসে. ইনিংস and ইনিংসে are slightly different: the first one is simply "Innings", whereas the second one (with an additional "ে") means "in innings". In Bengali text, we put 'kar', which often gets added to the alphabets of the root word.

In my observation, I think the algorithm can find root words in most of the cases. I have seen other examples where the article is about "বাংলাদেশ", the word present in text is "বাংলাদেশের"। The algorithm rightly detected the underlying root word and suggested "বাংলাদেশ" article. In the এয়ার কানাডা article, the second suggestion is the "প্রশান্ত মহাসাগর" article. Although the words appeared in the text is "প্রশান্ত মহাসাগরের", the algorithm could detect the correct words. However, in the examples I provided above, it couldn't find the root words.

If the algorithm follows "exact string-match", then it makes sense why the algorithm couldn't suggest some words in their first appearances. But I am confused as most of the time the algorithm can suggest correctly despite having additional alphabets to the original string.

Here's a similar example from huwiki: őssejt is linked, but the declensed form, őssejtekből, already appears in the previous sentence.

(There's another stemming-related error in the same edit: from the term hemopoetikus őssejt (hematopoietic stem cell) only the second word is linked, even though there's a more relevant article to link, hemopoetikus őssejtek, but it's not found because the title is in plural. Maybe we should have a more generic task about better stemming.)

Removing myself as assignee as the task has not been prioritized and thus no immediate work is planned.