Page MenuHomePhabricator

provide incremental JSON dumps for Wikidata
Open, LowPublic

Description

We should provide incremental dumps also for the JSON dumps.

Details

Reference
bz70246

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:41 AM
bzimport set Reference to bz70246.
bzimport added a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).

This would probably be implemented like this: Have a script that dumps all entity ids that have been changed since the last incremental dump. Then just dump all entities on that list.
The first script is not yet implemented, but that shouldn't be to hard.

Potential shortcomings of this (that may or may not also apply to the other incremental dumps, I have no idea): Deletions and merges (that turn things into redirects) wouldn't show up that way.

Nemo_bis lowered the priority of this task from Medium to Low.Apr 9 2015, 7:16 AM
Nemo_bis set Security to None.

I believe I originally asked for this, but current WDQ wouldn't use these anymore, and SPARQL replacements are on the way. In case I would have been the only customer, this task could be closed now.

JanZerebecki claimed this task.

If anyone else wants this please reopen.

Report here what I have writed in Wikidata:
The actual JSON dump compressed is more than 6 Gigabyte so, it's possible to create json dumps with only item changed/added from the previous week/dump? This allows for smaller files, and then you need less time to download and decompression. Useful for those who have slow connections

Is useful for bot operator that done periodic task of maintenance

I have a side project that would benefit from daily JSON dumps. Happy to look into providing this if there's anyone else who cares?

Addshore updated the task description. (Show Details)

Hello - i am extremely interested in incremental JSON dumps. The dumps are now over 80 GB so it feels a bit weird having to process over 88 million records every two weeks just to get new and updated records. The amount of unneccessary downloaded GBs starts to grow rapidly, it would spare Wikidata a lot of bandwidth, if they care, and save me a lot of processing.