Create an index for each dump that contains which IDs changed and also contains deletes and redirects (with target).
For the full dumps it might be a good idea to have deletes and redirects in separate files.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T88728 Improve Wikimedia dumping infrastructure | |||
Open | None | T88991 improve Wikidata dumps [tracking] | |||
Open | None | T85101 create index for each dump |
Event Timeline
Copying here the suggestion I made in the mailing list A plea for incremental dumps thread:
I think Apache CouchDB would be a great fit to address the issue of keeping up to date with the whole database. Quoting Wikipedia article:
Main features
[...]
Distributed Architecture with Replication
CouchDB was designed with bi-direction replication (or synchronization) and off-line operation in mind. That means multiple replicas can have their own copies of the same data, modify it, and then sync those changes at a later time.Wikimedia could run a CouchDB instance updated live, or, if not possible, on the same regularity as dumps. People interested could either run their own instance live mirroring Wikimedia master instance (using replication), or simply from time to time make a request to know which entities changed (using the _changes endpoint)
I guess the first replication will take more time/be more resource intensive than a simple file dump, but that would be compensated quickly on the following differential updates.
This would be beautiful :)
Let me know if I can help on making it happen
Bests,
Maxime
The _change endpoint I mention would provide the desired list of IDs that changed, plus a few goodies such as the include_docs or filter options
Hello guys - I am using Wikidata enriched with other data source, I must ingest the entire Wikidata JSON dump in a dev graph database of mine. That's easy (yet time-consuming) but once that's done, I want to keep my copy updated by querying the RecentChanges and LogEvents API endpoints to retrieve de changes/deletes/creates that occurred between two timestamps (I'd do so every few minutes) - that's relatively easy too!
How to get the cutoff timestamp for a given JSON dump? Where is this available or how to figure it out since the modified timestamp and lastrevid fields aren't present in JSON dumps.
@Lazhar: Please do not ask the same question in several tasks. Please see https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team where to ask support questions that are not directly related to the task topic. Thanks for your understanding!
The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status.