Page MenuHomePhabricator

Rethink 12± hour lag of incremental dumps for Wikidata
Closed, DeclinedPublic

Description

The incremental dumps available on https://dumps.wikimedia.org/other/incr/ are mostly 12 hours behind to "give local editing communities time to delete revisions with sensitive information, vulgarities and other vandalism, etc". This lag actually makes it harder to find the listed information as most reports (like mandatory constraint violations) are based on these dumps. This makes it very unrewarding to work actively on working down constraint violations.

Is it technically possible to remove this pre-defined lag or is there a different solution? I don't think SPARQL-queries can replace this, as it would take a long time.

Event Timeline

It is technically possible to reduce it to as little as we like, but once the dumps are produced, the data may be archived forever by anyone. This means that revisions with sensitive information may float around the internet forever, and we want to minimize that. The idea is that folks use other tools to discover current revisions with these sorts of issues and remove them.

I'd like to reach a conclusion on this one way or another, either accept or decline. My inclination towards this now is decline but I'm still open to discussion if you have some suggestions re my previous comment.

ArielGlenn closed this task as Declined.May 8 2018, 7:29 AM

I'm going to go ahead and decline this. If there is new information to take into account at a later date, it can be re-opened or a new ticket created.