In order make sure that all the chunks (of a snapshot) altogether contain all the articles in the snapshot, we need to do some testing.
To Do
- In scheduler repo, update protos submodule. Generate python gRPC.
- Update snapshots DAG, ExportRequest with enable_chunking arg true.
- Update batches and structured-snapshot DAG, ExportRequest with enable_chunking arg false.
- From scheduler, run snapshot for a couple of smaller projects. This should produce snapshot as well as chunks.
- Compare the articles in the snapshot vrs. the articles in all the chunks for this snapshot. They should be the same. Take a look at wikimedia-enterprise/experiments/snapshots for inspiration on snapshot testing.
- Run a batches job and a structured-snapshot job from scheduler. Verify that the batches and structured-snapshot are getting created as usual. No chunks are generated for these.
Acceptance criteria / QA
- The articles in a snapshot and all the chunks (for this snapshot) are the same.
- The DAGs for batches, structured-snapshot are working as usual