Page MenuHomePhabricator

Add wikibase:identifiers to RDF representation of lexemes
Open, Needs TriagePublicFeature

Description

Related task: T144476

As a Query Service user I want to query the number of external ids for a lexeme in order to see the popularity of a lexeme.

Problem:
Using wikibase:identifiers on the Query Service provides the number of external ids for an item. Example: SELECT * { wd:Q1084 wikibase:identifiers ?identifiers }

However, this is only available for Q-items, not for lexemes (L-items).

Example:
Expected: the query SELECT * { wd:L241 wikibase:identifiers ?identifiers } should return the number of external ids for the lexeme L241 (2 at the moment).

Like for Q-items, this would simplify the writing of SPARQL queries related to Wikidata lexicographical data.

Acceptance criteria:

  • Using wikibase:identifiers on the Query Service provides the number of external ids for lexemes (L-item)

Open questions:
*Given the Query Service scalability issues, how much additional data will be added?

Notes

As far as I can tell, those numbers come from the page props and for items the page props are set here, while for lexemes they're set here.

For items it has $properties['wb-identifiers'] = $this->getContentHandler()->getIdentifiersCount( $item->getStatements() ); - would adding that to the file for lexemes be all that's needed to make them start being added to the RDF?

And then once they start appearing in the RDF, I assume the lexemes in the query service would need resyncing somehow?

Event Timeline

@Lydia_Pintscher is that something your team is taking on? Or do you expect something from us?

I think this needs changes on our side and should then get into the query service automagically.

What actually needs doing here?

As far as I can tell, those numbers come from the page props and for items the page props are set here, while for lexemes they're set here.

For items it has $properties['wb-identifiers'] = $this->getContentHandler()->getIdentifiersCount( $item->getStatements() ); - would adding that to the file for lexemes be all that's needed to make them start being added to the RDF?

And then once they start appearing in the RDF, I assume the lexemes in the query service would need resyncing somehow?

Given the Query Service scalability issues, how much additional data will be added?

There are currently 998,055 lexemes (query), so 998,055 triples.

If the amount of data that would be added to the query service is an issue, I will personally remove the 6.7 million redundant aliases on items for Unicode characters (about 151,000 items and each has the codepoint as an alias in 444 different languages), whether we have mul yet or not, to make some space.

(Also a periodic reminder that over 1.5 billion descriptions, or 10% of all triples, could be made redundant by automatically showing P31 when there's no description - T303677#7789434)

Should the identifier numbers be added only to lexemes or also to forms and senses?

As there are both form identifiers and sense identifiers, having wikidata:identifiers for forms and senses would be useful, and as there are around 12.1 million forms and around 300,000 senses, I am willing to supplement Nikki's offer by removing 6.7 million more unnecessary triples from a variety of places.

(As a follow-up to Nikki's periodic reminder, I'd like to suggest that any postponement of related automatic description functionality of the form "let's wait for Wikifunctions to launch" be rethought on account of the numerous delays in that project.)

Hm, one thing I just noticed: wikibase:statements is also currently only on lexemes, not on forms and senses – and it includes the form and sense statement counts (demo query). So it would be consistent to also have wikibase:identifiers only on the lexeme, counting the identifiers in the whole lexeme including its forms and senses.

Apparently properties are also missing wikibase:identifiers, btw.