Page MenuHomePhabricator

LinguaLibreBot : fail to add file to all applicable forms on Wikidata
Open, MediumPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • If the forms are similar, the bot only add it to the first one.

What happens?:

What should have happened instead?:
The bot should add the recording to all forms with the same spelling.

Event Timeline

What if other forms have "same spelling" but different pronunciation? (homographs but not homophones)

What if other forms have "same spelling" but different pronunciation? (homographs but not homophones)

Is that a pure hypothetical question or does it really happen in practice? (My guess is that is mostly rare and would result in orders of magnitude fewer errors than correct additions.)

If that is common, then the ExternalTools that feeds the word list need to be remade, because when I tried to load all the forms for L46694 I only got the possibility to record six files out of the eight possible forms. However, at least in Swedish I don't think it occur at all.

It's not hypothetical, it happens in English and German sometimes. Some examples I can think of for English are "read" (present versus past tense), corps, chassis, chamois (singular versus plural). For German I recently started a list of homographs I've come across (I still have quite a few to add), which includes some on the same lexeme. For Japanese, it depends on how it will be modelled, but most kanji have multiple pronunciations so there's definitely potential there for a lexeme to have multiple pronunciations for the same spelling.

So, I don't think it's safe to assume that two forms on the same lexeme with the same spelling will be pronounced the same in all situations. If we want to do this, I think it should take more information into account and only copy them for situations we can be relatively confident about.

For English:
I did a query for lexemes with multiple forms with the same representation. Of the 70,000 lexemes, there were about 6,200 forms which appeared multiple times and after excluding verb forms ending in -ed, there were only about 400 left, which included quite a lot of mistakes. I think we can say that when the past tense and past participle of a verb are spelt the same, they are also pronounced the same - I haven't come across any yet where that is not the case - and only automatically copy pronunciations between those. That would cover the majority of the cases while minimising the risk of adding bad data, while also not leaving too many for people to do manually.

For German:
When the accusative or dative singular of a noun are the same as the nominative singular, they would be pronounced the same and the same applies to plural forms, but when the singular and plural forms are spelt the same, the pronunciation can be different. For adjectives, two attributive forms with the same spelling would be pronounced the same, but when the attributive and predicative forms are spelt the same, the pronunciation can be different. I would have to look more into verbs but e.g. if the first person plural and third person plural for the same tense/mood are the same, they would be pronounced the same.

Thanks for adding that perspective Nikki. I guess it is too hard to configure variabilities for different languages. Instead we could start with a simple control on import: "We found several forms spelled exactly the same on some lexemes. Please uncheck the ones that are not pronounced the same as the first one." This would also probably be helped by showing grammatical features so that it is easier to know if they are pronounced differently but I think we could try it even without that.

If nothing it is unchecked, the bot can safely add it to all lexemes. If something is unchecked, the software could possibly have the reader record several variants, but it would of course need to show some more information so that the reader knows which form to change pronunciation for and how.

WikiLucas00 renamed this task from Lingua Libre Bot does not add the file to all applicable forms to Lingua Libre Bot does not add the file to all applicable forms on Wikidata.Jul 18 2021, 5:49 PM
Yug triaged this task as Medium priority.Jul 6 2022, 11:21 AM
Yug renamed this task from Lingua Libre Bot does not add the file to all applicable forms on Wikidata to LinguaLibreBot : fail to add file to all applicable forms on Wikidata.Jul 6 2022, 1:10 PM