Page MenuHomePhabricator

Introduce case sensitivity to machine learning model for Add a Link
Open, Needs TriagePublic

Description

The machine learning model being used in Add-Link-Structured-Task often makes erroneous suggestions that appear to reflect a lack of knowledge of case sensitivity.

For example, @Rich_Farmbrough reports that Listen to the Music (an album) has been suggested over the phrase "listen to the music". In another example, Secondary school was suggested over the words in "Had his primary education at Nisuco Staff Children School Bacita and secondary education at Government Secondary School Bacita, Kwara State but obtained..." (something a newcomer subsequently did).

Changing-Benjamin-Justice-Soghie-Simple-English-Wikipedia-the-free-encyclopedia-07-23-2025_01_48_PM (2).png (388×1 px, 271 KB)

This task captures the work around training the machine learning model to make better suggestions in light of the signals case sensitivity provides.

Benefits:

  • This could reduce the error rate of the model, particularly around instances of partial name links, as in the second example above. (To frame it another way, a human seeing a phrase with the capitalization xxxxxxx xx Xxxxxxxx Yyyyyyyyy Yyyyyy Xxxxxx can probably figure out that suggesting a link over the Y-words is a bad idea because all four capitalized words probably form a single multi-word term. A properly trained machine learning model could hopefully do the same.)

Risks/concerns/challenges:

  • Sometimes articles have bad capitalization, and case sensitivity could lead the model to misjudge these instances. In particular, articles that are underdeveloped and have poor grammar are disproportionally likely to appear in "Add a Link" and are also disproportionally likely to have bad capitalization.
  • Article titles almost always begin with capitalization, which introduces a challenge when compiling the data to train the model with. One possible solution is to use the Wikidata item label linked to the article rather than the article title itself, since Wikidata items aren't supposed to capitalize item labels unless they're proper nouns (e.g. the English Wikipedia article "House" has the Wikidata item label "house"). However, some Wikidata item labels may be erroneously capitalized, introducing errors into the data.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@Kerry_Raymond has independently raised this issue and provided some further examples in this thread.

I'm sharing an analysis on case-insensitivity on enwiki and simplewiki.

I use the recommendations that were rejected/accepted since June 1, 2025.

This analysis does not consider recommendations that users have skipped or have not yet seen.

  • The root cause of case-insensitive recommendations is the anchors (link text in recommendations). We always use lower-case anchors in model training and inference. It comes with some advantages and disadvantages:
    • The advantage is that when an anchor is at the beginning of a sentence, it will have an upper-case first letter, although it refers to the same context, e.g., "Concert of the band" and "The band's concert."
    • The disadvantage is that case-sensitivity may reveal the context, e.g., the independent people, The Independent newspaper.

Before exploring possible next steps, I want to share some statistics:

  • Among all accepted/rejected actions, the acceptance rate is 87%.
  • Among all accepted/rejected actions where the link text matches the title only when case-insensitive, the acceptance rate is 83%. Case-insensitivity lowers the acceptance rate. However, we don't have data about the other way around.
  • Among all accepted/rejected actions, 3% of the recommendations have a case-insensitive variant. Therefore, the scope is limited.
  • In enwiki, 5% of the recommendations are not exact matches between the link and the target page, regardless of case sensitivity.
  • Among all accepted/rejected actions where the link text matches the title only when case-insensitive, 33% of the recommendations have a case-insensitive variant.
  • The Independent article is another good example. Among the recommendations for "The Independent" article, the acceptance rate of the link texts:
    • "the independent": 38%.
    • "The Independent": 50%.

Possible next steps:

  • We can make the anchors case-sensitive and train/evaluate models. This experiment should give a general idea of how the models are improved and in which cases we decrease the quality of the recommendations. We should also reflect the changes in the inference. We can expect the anchors (8M) and bloom filters to increase in size.
  • This issue also relates to a context problem. One of our features is the similarity between the pages. Using the context in a page could be too broad for some cases. We can lower the scope to sections or paragraphs.
    • We should use the same scopes.
      • Example: We use the first paragraph of the target article.
      • Then we can use the 'paragraph' where the text link occurs.
    • Using lower-level embeddings will increase the number of embeddings. We should review the average number of sections and paragraphs of articles.
  • We may apply this change to a pre-defined set of wikis.

My short term suggestion is to make anchors case-sensitive and train/evaluate models. So that, we can analyse case where the performance increase/decrease.
Long term suggestion would be to have similarity between lower level embeddings (e.g. paragraph) as an additional feature.

Thanks, @OKarakaya-WMF!

My short term suggestion is to make anchors case-sensitive and train/evaluate models. So that, we can analyse case where the performance increase/decrease.

What level of effort would that be?
If it's limited work, then perhaps we can give this a try for the enwiki model and look at the precision and recall metrics and I can help gather community feedback?

Or if this sounds like significant effort, perhaps we should wait and only consider this type of change once we shift to the V2 model?

FYI @Trizek-WMF (since you asked about this today).

Thank you Kirsten. So far, I only heard about this at English Wikipedia.

hello @KStoller-WMF , and @Trizek-WMF ,

I think it should not be complicated to get offline scores for enwiki.
I'll get back to this and share the results soon after finishing some other tasks.
Please feel free to let me know if we should increase the priority.

@Chipmunkdavis has pointed to an additional instance of a partial name link in this edit.

thank you both @Sdkb and @Chipmunkdavis for reporting this issue,

I think making anchors case-sensitive and train/evaluate models will not fix this case.

Addalink first generates candidate links by creating ngrams (pair, triple, ... of words)

So both:
Agricultural Bank of France
Bank of France

are candidate links. As there is no page Agricultural Bank of France, Bank of France received some similarity with other pages and recommended.

Excited by this proposal, he settled in Molins de Rei and resigned from the Agricultural Bank of France by letter.

I think the best solution would be to generate candidates by using named entities rather than n-grams. (or use NER as additional filtering to ngrams)
So that Agricultural Bank of France will be an entity and we will make predictions only for Agricultural Bank of France.
I think it's easier to test NER as an additional filtering.

I'll include this to items to test/evaluate.

Thanks for that clarification, @OKarakaya-WMF! The partial name link issue (which the Agricultural Bank of France edit is an example of) is really the main capitalization issue we've been encountering — it's also what happened in the examples I cited above reported by Rich Farmbrough and Kerry Raymond.

Glad to hear that there is a potential solution. If there is a separate Phabricator task created for it, please add me.

@OKarakaya-WMF thanks soo much for the analysis you provided above re: opportunity sizing for case sensitivity! I'm curious if we have anything comparable for named entity recognition. I have two questions:
(1) Re: In enwiki, 5% of the recommendations are not exact matches between the link and the target page, regardless of case sensitivity. - I'm wondering how much of that 5% can be attributed to lack of named entity recognition.
(2) Do we have data to suggest that this might be an even more pervasive problem in non-English wikis, as (I believe) was the case for country and continent names?

(2) Do we have data to suggest that this might be an even more pervasive problem in non-English wikis, as (I believe) was the case for country and continent names?

We simply receive a lot more feedback from enwiki, but my guess is that this is an issue for all languages that include capitalization, while languages written in scripts without case distinction won't be impacted.

@IZapico-WMF - do you remember if this was an issue for eswiki in the past?