Page MenuHomePhabricator

JSON dumps for Lexemes
Closed, ResolvedPublic

Description

Problem:
We currently don't provide JSON dumps for Wikidata's Lexemes. We should provide them. They should be published in dumps separate from the regular Wikidata dumps containing Items and Properties to not make those even bigger.

Acceptance criteria:

Initial report:

Hello,

Lexicographical data has been deployed for almost a year (May 2018) and is now a significant part of Wikidata. Despite of that, Wikidata JSON dumps include only a subset of the lexicographical data in Wikidata (only the identifiers of lexemes and senses used as value in main (Q) and Property (P) namespaces). At the moment, we only have inconsistent dumps, as L items are not included, even if they are linked by other items within the dumps.

Lexemes have been removed from Wikidata JSON dumps for an unknown reason (see T195419).

Is it possible to include them again?

One possible application would be to have an easy way to compute statistics about the usage of all Wikidata properties across all namespaces, without having to gather data from several dumps in various formats.

Event Timeline

As we don't want to inflate the current JSON dumps even more, we will not add Lexemes to them.

What we will do is add a new JSON dump covering just Lexemes.

Is anything missing before we can create those separate dumps? Any decisions we still need to take?

Is anything missing before we can create those separate dumps? Any decisions we still need to take?

I just test-dumped a Lexeme in production and the JSON (as expected) looks complete.

The only thing we now need to decide about is the naming scheme (given that the JSON structure should be final (as much as it can be), no need for calling them BETA):
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.bz2
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.gz
And as links to the latest versions of each:
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.gz
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.bz2

Thank you for your replies. A few comments / questions:

  • While I understand your point, I fair that isolating some data from the main dump is only a temporary solution to its size growth. Sooner or later, it will weight 1 TB (even compressed), and we'll have to deal with this (as producer or as consumer).
  • Will the lexemes dumps contain the P namespace, or will the consumer have to additionally download the other complete dump to get the data about properties?
  • Will there be one dump per namespace (one for P, one for Q, one for L)?

I'll be fine with a lexemes dump containing only the L namespace (even if it's much less practical), but I prefer to ask theses questions to help you anticipate other use cases.

While I understand your point, I fair that isolating some data from the main dump is only a temporary solution to its size growth. Sooner or later, it will weight 1 TB (even compressed), and we'll have to deal with this (as producer or as consumer).

We indeed might need to take further steps due to the size of the dumps in the future, but nevertheless not all consumers will be interested in Lexemes (or even Items), thus it makes sense to distribute them separately. Also as "merging" the individual dumps is fairly trivial for a consumer, I don't think this will be very obstructive.

Will the lexemes dumps contain the P namespace, or will the consumer have to additionally download the other complete dump to get the data about properties?

It will not contain properties, thus the other JSON dump will still be needed. In the future we might also add a properties only JSON dump.

Will there be one dump per namespace (one for P, one for Q, one for L)?

In the long run, we will probably have on dump per entity type, yes.

I just test-dumped a Lexeme in production and the JSON (as expected) looks complete.

The only thing we now need to decide about is the naming scheme (given that the JSON structure should be final (as much as it can be), no need for calling them BETA):
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.bz2
https://dumps.wikimedia.org/wikidatawiki/entities/20190419/wikidata-20190419-lexemes.json.gz
And as links to the latest versions of each:
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.gz
https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.json.bz2

That looks good to me.

So... is there a patch to the bash script to be merged? What's next to make these go live?

Addshore subscribed.

@hoo any chance you could point @ArielGlenn in the right direction here?

Hi, I just stumbled into that issue, and was wondering if there was anything blocking this task? Cheers!

I renew my question above in T220883#5185999 and if someone can answer this, I can work with them to make these go live.

Lydia_Pintscher renamed this task from Wikidata JSON dumps should include Lexemes to JSON dumps for Lexemes.Oct 1 2020, 2:49 PM
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher updated the task description. (Show Details)
WMDE-leszek subscribed.
WMDE-leszek updated the task description. (Show Details)
WMDE-leszek updated the task description. (Show Details)

@hoo During task inspection we concluded that we probably have the sufficient know how in the team but we should reconvene during a time when you are available. We will create a dedicated task inspection for this one.