Page MenuHomePhabricator

Don't expose partial dumpfiles
Closed, ResolvedPublic

Description

Output files are built in-place, and discoverable through the normal HTTP file index. In other words, today is March 20th and dumps are being generated--https://dumps.wikimedia.org/other/enterprise_html/runs/20230320/ is listed in the https://dumps.wikimedia.org/other/enterprise_html/runs/ index, and if I refresh the snapshot index I see .json.tar.gz files changing in size.

This is not ideal because clients will have to guess or use heuristic methods to tell whether a specific dump exists or will exist, and whether it's complete. In the worst-case scenario, clients may download huge amounts of data only to discover that the file is truncated because it wasn't completely written yet.

I would suggest the following:

  • Don't add the snapshot date to the top-level index until the run is finished.
  • Build dump files under a filename like ".<random>.<final_name>" similar to rsync. Once the file is complete, rename into the final location.
  • Nice-to-have: the final step in the dump pipeline could be to write a machine-readable index file with filenames and sizes for the snapshot date.

Event Timeline

I'm going to tag Dumps-Generation here, they are downloading the snapshots and posting them on the website.

We could download WME dumps to a name that ends with ".inprog" and move them once the download is complete; this is how generation of our sql/xml dumps works, so it would fit right in.

We could download WME dumps to a name that ends with ".inprog" and move them once the download is complete; this is how generation of our sql/xml dumps works, so it would fit right in.

That would be fine for my use case, thanks!

Hmm we already write to a temp file with the file name ending in ".tmp". So I wonder why the size is changing. The file is written into the same directory and moved with a rename so there should be no time period when the final file is growing.

Hmm we already write to a temp file with the file name ending in ".tmp". So I wonder why the size is changing. The file is written into the same directory and moved with a rename so there should be no time period when the final file is growing.

+1 that's really mysterious, then. I'm certain that I saw file size changing, I guess I'll try to take some screenshots when the next dumps are published on April 1st...

I"m presuming you didn't see any instances of this in the meantime, @awight ? Can we close this?

awight claimed this task.

Thanks for the nudge!

These dumps have been a really productive and interpretable new data source and are feeling increasingly solid from my perspective as a consumer.