Output files are built in-place, and discoverable through the normal HTTP file index. In other words, today is March 20th and dumps are being generated--https://dumps.wikimedia.org/other/enterprise_html/runs/20230320/ is listed in the https://dumps.wikimedia.org/other/enterprise_html/runs/ index, and if I refresh the snapshot index I see .json.tar.gz files changing in size.
This is not ideal because clients will have to guess or use heuristic methods to tell whether a specific dump exists or will exist, and whether it's complete. In the worst-case scenario, clients may download huge amounts of data only to discover that the file is truncated because it wasn't completely written yet.
I would suggest the following:
- Don't add the snapshot date to the top-level index until the run is finished.
- Build dump files under a filename like ".<random>.<final_name>" similar to rsync. Once the file is complete, rename into the final location.
- Nice-to-have: the final step in the dump pipeline could be to write a machine-readable index file with filenames and sizes for the snapshot date.