The machine learning model being used in Add-Link-Structured-Task often makes erroneous suggestions that appear to reflect a lack of knowledge of case sensitivity.
For example, @Rich_Farmbrough reports that Listen to the Music (an album) has been suggested over the phrase "listen to the music". In another example, Secondary school was suggested over the words in "Had his primary education at Nisuco Staff Children School Bacita and secondary education at Government Secondary School Bacita, Kwara State but obtained..." (something a newcomer subsequently did).
This task captures the work around training the machine learning model to make better suggestions in light of the signals case sensitivity provides.
Benefits:
- This could reduce the error rate of the model, particularly around instances of partial name links, as in the second example above. (To frame it another way, a human seeing a phrase with the capitalization xxxxxxx xx Xxxxxxxx Yyyyyyyyy Yyyyyy Xxxxxx can probably figure out that suggesting a link over the Y-words is a bad idea because all four capitalized words probably form a single multi-word term. A properly trained machine learning model could hopefully do the same.)
Risks/concerns/challenges:
- Sometimes articles have bad capitalization, and case sensitivity could lead the model to misjudge these instances. In particular, articles that are underdeveloped and have poor grammar are disproportionally likely to appear in "Add a Link" and are also disproportionally likely to have bad capitalization.
- Article titles almost always begin with capitalization, which introduces a challenge when compiling the data to train the model with. One possible solution is to use the Wikidata item label linked to the article rather than the article title itself, since Wikidata items aren't supposed to capitalize item labels unless they're proper nouns (e.g. the English Wikipedia article "House" has the Wikidata item label "house"). However, some Wikidata item labels may be erroneously capitalized, introducing errors into the data.
