Page MenuHomePhabricator

Complete support of Lexemes in QuickStatements
Open, Needs TriagePublic

Description

QuickStatements is partially supporting Lexemes since October 2018 (see announcement). Now we're moving forward with organizations who would like to import data, it would be very helpful to have the full feature ready for big imports.

Event Timeline

@Magnus is there anything that is missing from the Wikidata side (API, etc.) to complete the work? Any support you may need from us or volunteers in order to get this done? :)

A new wish for a full support of Lexemes on QS was mentioned here :)

The patch that resulted in the current support is surprisingly small: https://phabricator.wikimedia.org/R2010:d4bbd816e688d910d28617acf22a8ecc2a725dc5

This makes me think that it is a relatively small task to support Lexeme creation?
I have some JS and Guile experience but my python knowledge is very limited so I guess I'm not the best suited for the task, but it would be really nice if somebody would get it done.

Also the WD API seems to support glosses fine because it is what powers MachtSinn that saves glosses using this code https://github.com/Nudin/makesense/blob/master/app.py#L160

It uses LexData written in python by the same author, found here: https://github.com/Nudin/LexData

LexData supports lexeme creation as well as glosses, forms and grammatical forms. That means it support all important features of lexemes! Hooray! :)

I recommend adding LexData to QS as it is well written and works in MachtSinn without problems. Anyone up for the task?

The desire to create lexemes via QuickStatements was raised by a FactGrid user in the Wikibase Community User Group mailing list.

It seems there are still relatively few tools for this, most lexicographical data scripts being oriented around improving existing Wikidata content.

With QuickStatements installed on Wikibase.cloud and JS scripts currently forbidden, it'd also be nice to have it as an option that doesn't require, say, WikibaseIntegrator. (Cradle support is another idea, but not within this task.)

Can we clarify what is still missing? Is it only the creation of Lexemes? The dev team had asked for that to be held off a bit initially to take things a bit slowly when Lexemes were introduced. I think by now this is no longer an issue and shouldn't block anything.

To makes thing clearer - and if I'm not mistaken - what is still missing:

  • create lexeme
  • create form
  • create sense
  • on lexeme level: edit lemma, lexical category, and language
  • on form level: edit representation and grammatical feature
  • on sense level: edit glose and its language

IIRC full Lexeme support was postponed so we can start Lexemes on Wikidata "clean", and not as a mass import from Wiktionary or some copyright-dubious source. If we agree that Lexemes have reached critical mass, I can add Lexeme support to QuickStatements, unless you want to wait for that rewrite (who was doing this? Brasil?)

IIRC full Lexeme support was postponed so we can start Lexemes on Wikidata "clean", and not as a mass import from Wiktionary or some copyright-dubious source.

Not sure if it was the main or only reason but probably...

If we agree that Lexemes have reached critical mass, I can add Lexeme support to QuickStatements, unless you want to wait for that rewrite (who was doing this? Brasil?)

That would be great! There is now more Lexemes than there is entries on any Wiktionary (depending on how you count, lexeme forms are more-or-less the same as Wikt entries...) so I guess we can say we "reached critical mass".

Yes @ACorrea-WMB (and others) worked on QS3. But IIRC, QS3 works with the Rest API that does not work fully on Lexemes, so maybe QS2 could be the solution for more efficiently create and improve Lexemes (and indeed a lot of them needs it).

This comment was removed by GreenReaper.

IIRC, QS3 works with the Rest API that does not work fully on Lexemes

FWIW, the specific issue here is covered in T329096 and as I understand it relates to the desire of the Wikibase team to first rewrite WikibaseLexeme to adhere to the precepts of Hexagonal Architecture, such work needing to be prioritised against other tasks.

IIRC full Lexeme support was postponed so we can start Lexemes on Wikidata "clean", and not as a mass import from Wiktionary or some copyright-dubious source. If we agree that Lexemes have reached critical mass, I can add Lexeme support to QuickStatements, unless you want to wait for that rewrite (who was doing this? Brasil?)

Let's do it! :)
(And yeah Wikimedia Brasil. We'll work on adding Lexeme support to the REST API so it can later be integrated in Quickstatements 3. But until then it doesn't hurt to enable it in Quickstatements 2.)

Yay!! As for us at Wikimedia Brasil, as soon as Lexemes are available in the Wikibase REST API we'll quickly work to support it in QuickStatements 3. We already have most of the syntax prepared.

I have added the required Lexeme fixes to both the PHP and the (back-end) Rust version. Specifics are here (not yet in the official help page). Can someone more familiar with Lexemes give it a whirl please, before I announce it to the general population?

First, thanks a lot @magnusmanske ! (you may just have destroyed my future free time but I'm very glad about it ;) ).

I just did a quick try with the following code:

CREATE_LEXEME	Q12107	Q147276	br:"Montroulez"
LAST	P12846	"m/montroulez/"
LAST	ADD_FORM	br:"Montroulez"	Q110786
LAST	ADD_SENSE	fr:"commune française"
LAST	P5137	Q202368

See what I've got:

image.png (1,882×605 px, 182 KB)

It work mostly well, see L1560547. Apparently the last 2 lines didn't work (from what I understand it tried to add the sense to the form? or maybe I misunderstood the syntax?).

When redoing the two last lines with the actual Lexeme id, it worked fine:

L1560547	ADD_SENSE	fr:"commune française"
LAST	P5137	Q202368

Also, less importantly, as visible in the screenshot, it seems that the interface don't know yet these new commands, I see "UNKNOWN COMMAND".

@VIGNERON The LAST after ADD_FORM referred to the FORM not to the Lexeme. This is indeed confusing, so I changed it to use LAST to refer to the last Lexeme created. Please try again (with a new Lexeme), it should work now. I also changed the docs accordingly.

I don't think it's true yet that an entire lexeme can be created in one go if form and sense IDs have to be calculated when preparing a QuickStatements batch. "Item for this sense" statements, as the property name suggests, go on senses; they do not go on lexemes as the command examples given show. There are lots of other statements that can go on senses (images, semantic genders, external IDs) and forms (pronunciation, morphological context, external IDs) as well, and what I fear will happen with the current setup is that someone decides to write up a batch thinking, naturally, that LAST refers to the last entity (whether lexeme, form, or sense) created, or that they will simply omit all statements on forms/senses due to an unwillingness to calculate form/sense IDs.

Perhaps some commands LAST_SENSE and LAST_FORM might be introduced to allow statements on those (in addition to grammatical features, form representations, and sense glosses) to be added? e.g.

CREATE_LEXEME	Q12107	Q147276	br:"Montroulez"
LAST	P12846	"m/montroulez/"
LAST	ADD_FORM	br:"Montroulez"	Q110786
LAST_FORM	P898	"[mɔ̃tˈʁuːles]"
LAST_FORM	P443	"Br-Montroulez.ogg"
LAST	ADD_SENSE	fr:"commune française"
LAST_SENSE	P5137	Q202368
LAST_SENSE	P18	"Vue_de_Morlaix.JPG"

I have implemented the idea of @Mahir256 in both PHP and Rust, and put everything live. Please test. Note that I might not be able to reply until tomorrow.

I did not make use of the new syntax (LAST_SENSE and LAST_FORM), but a simple "add" a value for a property on ~800 lexeme forms.
The temporary batch was working fine (see https://www.wikidata.org/w/index.php?title=Lexeme:L1523693&oldid=2475322672 as an example). But then it started errorring with the following API response in the Firefox dev tools:

{
  "status": "OK",
  "command": {
    "action": "add",
    "item": "L1522609-F1",
    "property": "P7481",
    "what": "statement",
    "new_statement": 0,
    "datavalue": {
      "type": "wikibase-entityid",
      "value": {
        "entity-type": "item",
        "id": "Q138786802"
      }
    },
    "meta": {
      "message": "",
      "status": "RUN",
      "id": 0
    },
    "summary": "#temporary_batch_1774469814385",
    "status": "error",
    "message": "Item L1522609-F1 is not available"
  },
  "last_item": "",
  "last_form": "",
  "last_sense": ""
}

I could somehow get it working again by starting another temporary batch, but it would start failing again after ~100 forms.
Now, even if I try a simple batch with a single line, it fails with the similar "Item Lxxx-Fxx is not available error".
Attached is the initial, full set of QSv1 commands.

Hi,

I had the same "Item LXXX is not available" problem when trying to create L1560879

The original code started with

CREATE_LEXEME	Q12107	Q147276 	br:"Douarnenez"
LAST	P11068	"douarnenez"

The second line failed and then I tried

L1560879	P11068	"douarnenez"

which also failed...