In order to allow (parallel) downloading chunks (of a snapshot) and querying chunks metadata, we need to update our snapshots service (export handler) to generate chunk tar and chunk metadata.
Refer to RfC-chuncked-snapshots for details.
To do
- Generate and upload chunks tar.gz and json
The export handler generates and uploads to s3 snapshots/enwiki_namespace_0.json and snapshots/enwiki_namespace_0.tar.gz for project enwiki and namespace 0 (for example).
Each tar.gz contains several ndjson file based on the uncompressed file limit. We want each of those ndjson to be uploaded to s3 as chunks as follows:
chunks/enwiki_namespace_0/chunk_0.json
chunks/enwiki_namespace_0/chunk_0.tar.gz
chunks/enwiki_namespace_0/chunk_1.json
chunks/enwiki_namespace_0/chunk_1.tar.gz
.
.
- Update chunks field of snapshots metadata.
In the export handler, we need to update the chunks field of the snapshots metadata in order to reflect the number of chunks present for a snapshot.
For the above example:
{ "identifier": "enwiki_namespace_0", "version": "637a1410d4e803c0b5ca04ecc6890815", "date_modified": "2023-12-21T02:40:14.475051666Z", "is_part_of": { "identifier": "enwiki" }, "in_language": { "identifier": "en" }, "namespace": { "identifier": 0 }, "size": { "value": 123374.514e0, "unit_text": "MB" }, “chunks”: [“enwiki_namespace_0_chunk_0”, “enwiki_namespace_0_chunk_1”, …] }
- Switch to enable chunking
As the snapshots handler are also used to generate batches. We are only aiming to generate chunks for snapshots.
In order to enable/disable chunking, add a new filed enable_chunking to ExportRequest
in protos/snapshots.proto
Update protos submodule for scheduler and snapshots services. Set enable_chunking to false for batches DAG and true for snapshots DAG.
- Optional: some code refactoring
Team decided this is totally optional. If refactoring takes time and makes completion of this task complex - opt to put good comments in the code instead.
Note that snapshot generation and upload has similar steps and functionalities as that of chunk generation and upload. If possible, consider using common interfaces such as: (these are just examples)
// TarWriter takes a buffer, creates tar header using the buffer, then writes the tar header and buffer data using the tar writer. type TarWriter interface { TarWriter(buf *buffer, trw *tar.Writer) (error) } // Uploader reads a pipe and uploads it to s3 bucket using the key. type Uploader interface { Uploader(upl *s3manager.Uploader, prr *nio.PipeReader, bkt string, key string) (error) } . .
Sync with @ROdonnell-WMF as he did some draft refactoring and implementation already.
QA / Acceptance criteria
- Dev deployment and testing
After dev deployment, verify that the right amount of chunks are getting uploaded to s3, and that the chunks and snapshots metadata is updated accordingly.