Page MenuHomePhabricator

create index for each dump
Open, Stalled, LowPublic

Description

Create an index for each dump that contains which IDs changed and also contains deletes and redirects (with target).
For the full dumps it might be a good idea to have deletes and redirects in separate files.

Event Timeline

JanZerebecki raised the priority of this task from to Needs Triage.
JanZerebecki updated the task description. (Show Details)
JanZerebecki changed Security from none to None.
JanZerebecki added subscribers: Aklapper, JanZerebecki.
Lydia_Pintscher triaged this task as Normal priority.Dec 27 2014, 12:06 PM
Lydia_Pintscher added a subscriber: hoo.
Restricted Application added a project: Discovery. · View Herald TranscriptAug 27 2016, 7:24 PM

Copying here the suggestion I made in the mailing list A plea for incremental dumps thread:

I think Apache CouchDB would be a great fit to address the issue of keeping up to date with the whole database. Quoting Wikipedia article:

Main features
[...]
Distributed Architecture with Replication
CouchDB was designed with bi-direction replication (or synchronization) and off-line operation in mind. That means multiple replicas can have their own copies of the same data, modify it, and then sync those changes at a later time.

Wikimedia could run a CouchDB instance updated live, or, if not possible, on the same regularity as dumps. People interested could either run their own instance live mirroring Wikimedia master instance (using replication), or simply from time to time make a request to know which entities changed (using the _changes endpoint)
I guess the first replication will take more time/be more resource intensive than a simple file dump, but that would be compensated quickly on the following differential updates.
This would be beautiful :)
Let me know if I can help on making it happen
Bests,
Maxime

The _change endpoint I mention would provide the desired list of IDs that changed, plus a few goodies such as the include_docs or filter options

Smalyshev lowered the priority of this task from Normal to Low.Dec 20 2016, 11:18 PM
Smalyshev changed the task status from Open to Stalled.Dec 21 2017, 2:14 AM
Lazhar added a subscriber: Lazhar.Feb 13 2018, 8:24 AM

Hello guys - I am using Wikidata enriched with other data source, I must ingest the entire Wikidata JSON dump in a dev graph database of mine. That's easy (yet time-consuming) but once that's done, I want to keep my copy updated by querying the RecentChanges and LogEvents API endpoints to retrieve de changes/deletes/creates that occurred between two timestamps (I'd do so every few minutes) - that's relatively easy too!

How to get the cutoff timestamp for a given JSON dump? Where is this available or how to figure it out since the modified timestamp and lastrevid fields aren't present in JSON dumps.

@Lazhar: Please do not ask the same question in several tasks. Please see https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team where to ask support questions that are not directly related to the task topic. Thanks for your understanding!