Page MenuHomePhabricator

wikidata-exports is using 256G in Tools
Closed, ResolvedPublic

Description

Almost all of this is in:

61G old_dumpfiles
196G public_html

In public_html I see files going back to 20140420. Can we clean up the old_dumpfiles and historical public_html? What is the data retention policy for this tool?

Event Timeline

I wasn't able to find any of the maintainers on Phabricator. I emailed Markus via contact information found through this user page.

We have two kinds of large data files: biweekly Wikidata json entity dumps and RDF exports that we generate from them. The RDF exports are what we offer through our website http://tools.wmflabs.org/wikidata-exports/rdf/index.php?content=exports.php

We can easily delete old Wikidata dumps. However, history might be of interest. Is there any other record of Wikidata dumps anywhere? It would be a pity to delete the last place where the actual history was preserved, even though it was not meant to be used for this purpose (note that full dumps do not contain deleted pages, and are therefore not a full history; it is also a lot of extra work to compute a single point-in-time snapshot from a full dump). Maybe one should check with the community if there is an historic interest in these dumps before removing them for good.

For our purposes, we could also stop downloading own dumps altogether if they would appear on labs' /public/dumps/public/wikidatawiki/entities within the hour of their online appearance, and not just a day later.

As for the RDF dumps we generated and published, we do not have any retention policy so far. There is some historic interest there as well, and I would keep a few snapshots of the past, but one could defer interested users to the old entity dumps from which they can rebuild the RDF dumps themselves if they wish to. This assumes that the old entity dumps are available somewhere.

We can easily delete old Wikidata dumps. However, history might be of interest. Is there any other record of Wikidata dumps anywhere? It would be a pity to delete the last place where the actual history was preserved, even though it was not meant to be used for this purpose (note that full dumps do not contain deleted pages, and are therefore not a full history; it is also a lot of extra work to compute a single point-in-time snapshot from a full dump). Maybe one should check with the community if there is an historic interest in these dumps before removing them for good.

@Lydia_Pintscher hello! Do you know who could answer this question:

Is there any other record of Wikidata dumps anywhere?

Could you help us understand if the data here is worth keeping?


@mkroetzsch If we do want to keep these historical dumps for some future use case is there somewhere else they can be stored? Does wikidata have a dump for historical content? This online NFS share is an expensive place to stash large inactive things away, and it's not a great plan to serve them off of NFS if we can avoid it.

For our purposes, we could also stop downloading own dumps altogether if they would appear on labs' /public/dumps/public/wikidatawiki/entities within the hour of their online appearance, and not just a day later.

@ArielGlenn, is hourly syncing of dumps an option?

As for the RDF dumps we generated and published, we do not have any retention policy so far. There is some historic interest there as well, and I would keep a few snapshots of the past, but one could defer interested users to the old entity dumps from which they can rebuild the RDF dumps themselves if they wish to. This assumes that the old entity dumps are available somewhere.

I'm not sure what the end game of this is. Can you do this and see where it gets us space wise?

We can easily delete old Wikidata dumps. However, history might be of interest. Is there any other record of Wikidata dumps anywhere? It would be a pity to delete the last place where the actual history was preserved, even though it was not meant to be used for this purpose (note that full dumps do not contain deleted pages, and are therefore not a full history; it is also a lot of extra work to compute a single point-in-time snapshot from a full dump). Maybe one should check with the community if there is an historic interest in these dumps before removing them for good.

@Lydia_Pintscher hello! Do you know who could answer this question:

Is there any other record of Wikidata dumps anywhere?

Could you help us understand if the data here is worth keeping?

I believe it is worth keeping. There might be some on the Internet Archive or elsewhere in Wikimedia infrastructure but I am not sure. Adding @hoo and @aude who might know.

Yeh I think sending these to the Internet Archive would make sense.
The code that I have written to use the JSON dumps based on the Wikidata Toolkit actually uses archive.org as a source for dump files.

I guess in theory it would always be possible for us to create JSON dumps from a past time, I mean we have all revisions, but indeed these would not include deleted items, but why should they..?

Yeh I think sending these to the Internet Archive would make sense.
The code that I have written to use the JSON dumps based on the Wikidata Toolkit actually uses archive.org as a source for dump files.

I guess in theory it would always be possible for us to create JSON dumps from a past time, I mean we have all revisions, but indeed these would not include deleted items, but why should they..?

Does anyone know how we transition this data to archive.org?

I would not rsync every hour, that seems like overkill. I could try to
schedule the cron job to run shortly after the entity dump completes, maybe
put them together in a little bash script or something.

Ariel

Phamhi changed the task status from Open to Stalled.Nov 13 2019, 2:20 PM
Phamhi subscribed.

Closing this as the last update was 3 years ago.

Phamhi claimed this task.