Page MenuHomePhabricator

Improve fail-safety and performance of export script
Open, Needs TriagePublic

Description

Our current export script executes the following steps in order:

  1. Export the data to a CSV file
  2. Check the CSV file for broken lines (i.e. count number of columns and check if the number of fields is divisible by the number of columns).
  3. Export the data again to a CSV file, this time marking the data in the database as "is exported".
  4. Compress the CSV file
  5. Encrypt the compressed CSV file

This approach has two problems:

  1. Because "export" and "mark as exported" are so intertwined, the export is done twice. This is not a big problem as long as the returned data set fits in the MySQL query cache and the PHP processing is not too slow. Historically, this was meant to be atomic, but in reality, this isn't, because the SQL statements of the exporter are not wrapped in a transaction. In theory, a donation that was made between the SELECT statement for export and the UPDATE statement for marking as exported could be marked as exported.
  2. If something goes wrong during compression or encryption, the data sets are still marked as exported.

A better way for the export script would be:

  1. Export the data to a CSV file
  2. Check the CSV file for broken lines
  3. Encrypt the compressed CSV file (GPG compresses by default)
  4. Mark exactly the data set that was exported as exported

There are several approaches to improve the process:

  • create an "export table" that contains all the IDs of the exported data, together with an "export ID". Drawbacks: You have to clean up the data from time to time, also the exports get slower because data has to be written.
  • Use a strict time window (24 hour period) for the exports.
  • Move most of the shell script code to the PHP code, wrap the export in a transaction that rolls back when any file creation fails. This seems to be the most attractive approach.