Page MenuHomePhabricator

Wikidata JSON dumps do not have the 'ns' (namespace)
Open, Needs TriagePublic

Description

I want to process the Wikidata dump and filter the qitems which have not a 0 (article) namespace. Apparently, it seems that the JSON provided in the dump does not contain this field, even though the documentation says it is there.

In this sample, there is the 'ns' field: https://www.wikidata.org/wiki/Special:EntityData/Q1.json
In this other sample, too: https://www.wikidata.org/wiki/Special:EntityData/Q26.json

I am testing with the latest dump from:
https://dumps.wikimedia.org/wikidatawiki/entities/

Event Timeline

Hmm, what's the usecase here? Is this for wikidata dumps? Right now Items being in NS 0 is a pretty safe assumption, they don't appear anywhere else.

The use case is to process the dumps and filter out qitems which do not relate to articles, this is why we put NS0. The JSON dump sample says there is ns field but in the final dump there is no such field.

The use case is to process the dumps and filter out qitems which do not relate to articles, this is why we put NS0.

That sounds like you are referring to the namespace of the sitelinks of the entity?

On wikidata.org all "qitems" are in the main namespace, which is namespace 0.
The sitelinks held within those items can be on any number of different namespaces on the wikidata clients.

The JSON dump sample says there is ns field but in the final dump there is no such field.

This is the namespace of the item itself, not of the sitelinks.
Could you link to those docs? it could be that they are only meant for the API serialization.?

I need all the Wikidata qitems that relate to Wikipedia articles. If I understand it correctly, these are qitems that have namespace 0. Although not all qitems with namespace 0 necessarily have sitelinks (they could be just qitems without an article).

The thing is that I'm not sure all wikidata qitems have namespace main (0).

I explain you what I did.

Since I cannot use the namespace XML tag in the dump to just parse the namespace 0 and skip the rest I managed to use the wikidata mysql replica database.

In this case, I consulted:
select count(page_namespace), page_namespace from page group by page_namespace order by 1 desc;

This is the result:

-----------------------+----------------+

count(page_namespace)page_namespace

+-----------------------+----------------+

569860530
1522501198
450223
42573146
362044
333202
165411
108742600
746410
73715
5940121
5887120
367514
30328
180012
462828
2989
19311
13113
66829
62147
1415
37
31199

+-----------------------+----------------+

So it seems that there are many pages with namespace 1198, 146, 2600...
besides 3, 4, 2, 1 which are user talk, project, user page, talk page.

I don't know how many of these are in the dump. But I only need those which are 0. So, the solution that I found is retrieving all the qitems with namespace 0 from the wikidata replica mysql database and storing them into a database.

Then I consult this database when parsing and I skip those which haven't been previously inserted. This way I the parsing is shorter.

Do you think there is any other way to do it?
Thanks.