We should provide incremental dumps also for the JSON dumps.
Description
Details
- Reference
- bz70246
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T88728 Improve Wikimedia dumping infrastructure | |||
Open | None | T88991 improve Wikidata dumps [tracking] | |||
Open | None | T72246 provide incremental JSON dumps for Wikidata |
Event Timeline
This would probably be implemented like this: Have a script that dumps all entity ids that have been changed since the last incremental dump. Then just dump all entities on that list.
The first script is not yet implemented, but that shouldn't be to hard.
Potential shortcomings of this (that may or may not also apply to the other incremental dumps, I have no idea): Deletions and merges (that turn things into redirects) wouldn't show up that way.
I believe I originally asked for this, but current WDQ wouldn't use these anymore, and SPARQL replacements are on the way. In case I would have been the only customer, this task could be closed now.
Report here what I have writed in Wikidata:
The actual JSON dump compressed is more than 6 Gigabyte so, it's possible to create json dumps with only item changed/added from the previous week/dump? This allows for smaller files, and then you need less time to download and decompression. Useful for those who have slow connections
Is useful for bot operator that done periodic task of maintenance
I have a side project that would benefit from daily JSON dumps. Happy to look into providing this if there's anyone else who cares?
Hello - i am extremely interested in incremental JSON dumps. The dumps are now over 80 GB so it feels a bit weird having to process over 88 million records every two weeks just to get new and updated records. The amount of unneccessary downloaded GBs starts to grow rapidly, it would spare Wikidata a lot of bandwidth, if they care, and save me a lot of processing.