Page MenuHomePhabricator

Improve LinguaLibreBot on Wikidata Lexeme
Open, Needs TriagePublicFeature

Description

  • Audio pronunciations have to be added on every forms
  • Most pronunciation-related values are stored as qualifiers of the pronunciation property (P7243) (see details)
  • P407 has to be used as qualifier of P443 (407 should NOT be used for Lexemes)
  • Create lexeme when trio language + form + pos does not exist on wikidata. See https://ordia.toolforge.org/language/ : French only has 10,000 lexeme on wikidata, about 50k forms on LL.

See https://lingualibre.org/wiki/LinguaLibre:Chat_room/Archives/2019#Feature_request:_ask_to_reuse_existing_identical_audio_if_available_.28part_2.29 and https://lingualibre.org/wiki/LinguaLibre:Chat_room/Archives/2019#Feature_request:_add_language_qualifier_to_lexeme_form_pronunciation_audio_statement for more details.

EDIT: https://lingualibre.org/wiki/LinguaLibre:Chat_room/Archives/2019#Adding_sounds_to_the_pronunciation_claim_in_Wikidata -> request to update the bot code to take into account the new pronunciation property.

EDIT2: https://lingualibre.fr/wiki/LinguaLibre:Chat_room#Wikidata -> request to add several pronunciations on one Wikidata item so that there are pronunciations with different accents

Event Timeline

P407 isn't needed on lexemes. Even, I asked them not to add it.

Indeed for the second point, it is indicated on P443 (Constrain section) that the lexeme has to be excluded (it makes sense). So Lingua Libre Bot should not add P407. That's said, it is not clear why the exclamation mark appears in Strom.

Currently, I don't think the constraint system allows to limit a constraint scope to items only.

The exception there doesn't go beyond the entity https://www.wikidata.org/wiki/Q51885771

Pamputt changed the subtype of this task from "Task" to "Feature Request".Oct 6 2020, 7:12 PM

Wouldn't it be better to edit this through the API than a bot? The tool already asks for Oauth. This has a few advantages, I can get credit/blame more directly and I can also immediately inspect the edit (If I have to wait for the bot, I might forget it).

I took over the bot's code. I see there have been discussions here and the main task "body" might have not been updated to reflect that. Could someone please summarize what I need to do to get this done?

Difficult to answer precisely. So I would say the best would be to ask on Wikidata talk:Lexicographical data to see what the "Wikidata lexicographers" expect from LinguaLibreBot. I think what is listed in the description is still valid but it's worth to ask to the current community. I think @VIGNERON may have some opinion about that since he is quite involved in lexicographical data.

The simplest solution that would still be hugely valuable would be to feed LinguaLibre with forms of lexemes (preferably through a query), get to record the forms in the regular interface, and then after they are uploaded the files get added with pronunciation audio (P443) on the respective form.

There used to be a way to add Wikidata queries, get a list of lexemes or forms, record them and automatically upload to Commons and Wikidata Lexemes, but this functionality seems to be gone. If we could get that working, the change would be huge.

This would be nice indeed. Actually, I do not remember how it did work before because the possibility to get a list of lexemes or forms should be implemented in the Record Wizard (Details step). I've opened a new ticket (T274667) to ask for this feature.

It can certainly be done as a subtask as it would be valuable even without the following addition. But I guess this task is dependent on such a query to be able to know the exact forms to add the files to, so it needs to be connected as a blocker.

That's it. Getting a list of items and recording them is not difficult. Having the bot upload them to the correct form is the tricky part.

@Yug create lexemes would be very nice but quite difficult. "language + form" is not enough, at least the lexical category is mandatory to create a Lexeme.
Other data are needed to to determine if the lexeme exists and is the same (for instance for cases like "fils" - threads L10371- and "fils" - son L15917 - or "tour" L2330 and "tour" L2331).
How could we solve these problems?

See also this diff where the bot confused fils and fils...

@VIGNERON : Thank you ! exactly the kind of thing we ignore but need to know.
I do have a 3000 Chinese lexem datasets with hans, hant, pinyin (toned), pos, french translations.

Also, shouldn't this present task be split ? It really vague and becoming confusing.

Sadly I know the problem, not the solution...

And yes, we should have subtasks for each specific and independant issues.

After the recovery from the OVH fire... do we have any news on this? Thanks!

I have recovered the bot and it's now running on Toolforge.
The description of this task still remains a bit unclear to me, I'd appreciate if someone could split it into smaller-scoped feature requests.

Thanks @Poslovitch! I have one request, and it is in the subtask. There used to be an option to get a list of lexemes and forms and, after recording it, uploading automatically to the corresponding form/lexeme. Now this option is gone, so we can add a list of words, but they won't add to Wikidata lexemes.

Yes, thanks a lot @Poslovitch !

For the Create lexeme when trio language + form + pos does not exist on wikidata. I would abandonned that as 1. this would generate an almost empty lexemes 2. we don't have the pos (part of speech = lexical category) on LinguaLibre, no? @Pamputt if there is no objection, I will strike that.
A better solution would be to generate a todo list somewhere of recording not having a corresponding form in a lexeme (a list that then humans could use to carefuly create lexeme).