Page MenuHomePhabricator

Wikidata JSON dumps should include Lexemes
Open, Needs TriagePublic

Description

Hello,

Lexicographical data has been deployed for almost a year (May 2018) and is now a significant part of Wikidata. Despite of that, Wikidata JSON dumps include only a subset of the lexicographical data in Wikidata (only the identifiers of lexemes and senses used as value in main (Q) and Property (P) namespaces). At the moment, we only have inconsistent dumps, as L items are not included, even if they are linked by other items within the dumps.

Lexemes have been removed from Wikidata JSON dumps for an unknown reason (see T195419).

Is it possible to include them again?

One possible application would be to have an easy way to compute statistics about the usage of all Wikidata properties across all namespaces, without having to gather data from several dumps in various formats.

Event Timeline

Envlh created this task.Sat, Apr 13, 2:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSat, Apr 13, 2:55 PM
hoo added a comment.Mon, Apr 15, 10:12 AM

As we don't want to inflate the current JSON dumps even more, we will not add Lexemes to them.

What we will do is add a new JSON dump covering just Lexemes.

Is anything missing before we can create those separate dumps? Any decisions we still need to take?

hoo added a comment.Tue, Apr 23, 5:16 AM

Is anything missing before we can create those separate dumps? Any decisions we still need to take?

I just test-dumped a Lexeme in production and the JSON (as expected) looks complete.

The only thing we now need to decide about is the naming scheme (given that the JSON structure should be final (as much as it can be), no need for calling them BETA):
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.bz2
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.gz
And as links to the latest versions of each:
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.gz
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.bz2

Envlh added a comment.Tue, Apr 23, 6:35 AM

Thank you for your replies. A few comments / questions:

  • While I understand your point, I fair that isolating some data from the main dump is only a temporary solution to its size growth. Sooner or later, it will weight 1 TB (even compressed), and we'll have to deal with this (as producer or as consumer).
  • Will the lexemes dumps contain the P namespace, or will the consumer have to additionally download the other complete dump to get the data about properties?
  • Will there be one dump per namespace (one for P, one for Q, one for L)?

I'll be fine with a lexemes dump containing only the L namespace (even if it's much less practical), but I prefer to ask theses questions to help you anticipate other use cases.

hoo added a comment.Tue, Apr 23, 7:47 AM

While I understand your point, I fair that isolating some data from the main dump is only a temporary solution to its size growth. Sooner or later, it will weight 1 TB (even compressed), and we'll have to deal with this (as producer or as consumer).

We indeed might need to take further steps due to the size of the dumps in the future, but nevertheless not all consumers will be interested in Lexemes (or even Items), thus it makes sense to distribute them separately. Also as "merging" the individual dumps is fairly trivial for a consumer, I don't think this will be very obstructive.

Will the lexemes dumps contain the P namespace, or will the consumer have to additionally download the other complete dump to get the data about properties?

It will not contain properties, thus the other JSON dump will still be needed. In the future we might also add a properties only JSON dump.

Will there be one dump per namespace (one for P, one for Q, one for L)?

In the long run, we will probably have on dump per entity type, yes.

I just test-dumped a Lexeme in production and the JSON (as expected) looks complete.

The only thing we now need to decide about is the naming scheme (given that the JSON structure should be final (as much as it can be), no need for calling them BETA):
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.bz2
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.gz
And as links to the latest versions of each:
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.gz
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.bz2

That looks good to me.