Update snapshots service to produce and upload chunks of snapshots
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	prabhat
	May 14 2024, 7:39 PM

Description

In order to allow (parallel) downloading chunks (of a snapshot) and querying chunks metadata, we need to update our snapshots service (export handler) to generate chunk tar and chunk metadata.

Refer to RfC-chuncked-snapshots for details.

To do

Generate and upload chunks tar.gz and json

The export handler generates and uploads to s3 snapshots/enwiki_namespace_0.json and snapshots/enwiki_namespace_0.tar.gz for project enwiki and namespace 0 (for example).
Each tar.gz contains several ndjson file based on the uncompressed file limit. We want each of those ndjson to be uploaded to s3 as chunks as follows:

chunks/enwiki_namespace_0/chunk_0.json
chunks/enwiki_namespace_0/chunk_0.tar.gz
chunks/enwiki_namespace_0/chunk_1.json
chunks/enwiki_namespace_0/chunk_1.tar.gz
.
.

Update chunks field of snapshots metadata.

In the export handler, we need to update the chunks field of the snapshots metadata in order to reflect the number of chunks present for a snapshot.

For the above example:

{
    "identifier": "enwiki_namespace_0",
    "version": "637a1410d4e803c0b5ca04ecc6890815",
    "date_modified": "2023-12-21T02:40:14.475051666Z",
    "is_part_of": {
        "identifier": "enwiki"
    },
    "in_language": {
        "identifier": "en"
    },
    "namespace": {
        "identifier": 0
    },
    "size": {
        "value": 123374.514e0,
        "unit_text": "MB"
    },
    “chunks”: [“enwiki_namespace_0_chunk_0”, “enwiki_namespace_0_chunk_1”, …]
}

Switch to enable chunking

As the snapshots handler are also used to generate batches. We are only aiming to generate chunks for snapshots.
In order to enable/disable chunking, add a new filed enable_chunking to ExportRequest
in protos/snapshots.proto

Update protos submodule for scheduler and snapshots services. Set enable_chunking to false for batches DAG and true for snapshots DAG.

Optional: some code refactoring

Team decided this is totally optional. If refactoring takes time and makes completion of this task complex - opt to put good comments in the code instead.
Note that snapshot generation and upload has similar steps and functionalities as that of chunk generation and upload. If possible, consider using common interfaces such as: (these are just examples)

// TarWriter takes a buffer, creates tar header using the buffer, then writes the tar header and buffer data using the tar writer.
type TarWriter interface {
	TarWriter(buf *buffer, trw *tar.Writer) (error)
}

// Uploader reads a pipe and uploads it to s3 bucket using the key.
type Uploader interface {
	Uploader(upl *s3manager.Uploader, prr *nio.PipeReader, bkt string, key string) (error)
}
.
.

Sync with @ROdonnell-WMF as he did some draft refactoring and implementation already.

QA / Acceptance criteria

Dev deployment and testing

After dev deployment, verify that the right amount of chunks are getting uploaded to s3, and that the chunks and snapshots metadata is updated accordingly.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T364904 [Epic] Q4 carry over- Chunking snapshots
Resolved	ROdonnell-WMF	T353881 Create an RFC for chunked snapshots
Resolved	ROdonnell-WMF	T364912 Update snapshots service to produce and upload chunks of snapshots
Resolved	LDlulisa-WMF	T371103 Update DAGs and test included articles in chunks
Resolved	LDlulisa-WMF	T364915 Production deploy snapshot chunks feature
Resolved	E.Enabulele	T364919 Add integration tests for chunks
Resolved	REsquito-WMF	T364922 Update SDK with chunk feature support

Event Timeline

prabhat created this task.May 14 2024, 7:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2024, 7:39 PM

prabhat updated the task description. (Show Details)May 14 2024, 7:41 PM

prabhat added a parent task: T353881: Create an RFC for chunked snapshots.May 14 2024, 7:44 PM

prabhat added a subtask: T364915: Production deploy snapshot chunks feature.May 14 2024, 8:05 PM

prabhat updated the task description. (Show Details)Jun 3 2024, 8:30 PM

JArguello-WMF set the point value for this task to 5.Jul 4 2024, 2:04 PM

JArguello-WMF moved this task from API Usability to Sprint 63 on the Wikimedia Enterprise board.

JArguello-WMF edited projects, added Wikimedia Enterprise (Sprint 63); removed Wikimedia Enterprise.

prabhat updated the task description. (Show Details)Jul 4 2024, 5:43 PM

prabhat updated the task description. (Show Details)

ROdonnell-WMF claimed this task.Jul 7 2024, 9:20 AM

ROdonnell-WMF moved this task from Next Up to In Progress on the Wikimedia Enterprise (Sprint 63) board.

ROdonnell-WMF updated the task description. (Show Details)Jul 7 2024, 6:18 PM

I refactored some of the code for the uploader and compression but didn't create an interface for the tar method.

I will hold off on dev testing until SC is in Prod

Sounds good. Refactoring is best effort and optional.

ROdonnell-WMF moved this task from In Progress to MR on the Wikimedia Enterprise (Sprint 63) board.Jul 19 2024, 5:04 PM

MR in Added chunking "lite"

I want to add more unit tests to improve the coverage of compressed files and review uncovered lines of code. Rather than push this to dev and rush it through.

There is a 2nd MR that has a heavier refactoring. We could use some of the code structures and folders to make the code more maintainable. But that is a long term effort.

AN issue, we won't have download file size or hash MD5 in S3, maybe I add the PutInputObject for each chunk?

JArguello-WMF edited projects, added Wikimedia Enterprise (Sprint 64); removed Wikimedia Enterprise (Sprint 63).Jul 25 2024, 12:52 PM

JArguello-WMF moved this task from Next Up to MR on the Wikimedia Enterprise (Sprint 64) board.

prabhat added a subtask: T371103: Update DAGs and test included articles in chunks.Jul 26 2024, 3:08 PM

JArguello-WMF changed the point value for this task from 5 to 8.Jul 31 2024, 1:04 PM

ROdonnell-WMF moved this task from MR to Done on the Wikimedia Enterprise (Sprint 64) board.Aug 1 2024, 1:04 PM

JArguello-WMF closed this task as Resolved.Aug 16 2024, 7:07 PM

JArguello-WMF closed subtask T371103: Update DAGs and test included articles in chunks as Resolved.Sep 4 2024, 9:28 PM

JArguello-WMF closed subtask T364915: Production deploy snapshot chunks feature as Resolved.Sep 25 2024, 1:45 PM

Update snapshots service to produce and upload chunks of snapshotsClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Update snapshots service to produce and upload chunks of snapshots
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...