Page MenuHomePhabricator

Create an RFC for chunked snapshots
Closed, ResolvedPublic8 Estimated Story Points

Description

O6 KR2
In order to produce and serve chunks of larger snapshots, we need to scope and design this work.
Please start with under progress RfC-chuncked-snapshots.

To do

  • Finalize API design for chunks metadata and snapshots metadata
  • Finalize schema design for snaphshot and chunks metadata
  • Infrastructure - s3 folder/filename key for chunk metadata and chunk itself
  • Implementation - how to intercept snapshot handler, how to know what snapshots to chunk, number of articles per chunks, etc.

Acceptance criteria
Complete RfC presented to the team

Event Timeline

JArguello-WMF triaged this task as High priority.
JArguello-WMF set the point value for this task to 8.
prabhat updated the task description. (Show Details)
ROdonnell-WMF changed the task status from Open to In Progress.Jan 2 2024, 2:47 AM

I've written up an RFC proposal that would move away from JSON dumps. Instead I proposed Parquet files and using Kafka Connect to configure the export process.
We'd keep the existing JSON dumps for compatibility and the new format would add a richer client experience and better API usability

More in RFC folder in Implementing Apache Parquet for WME Dumps

Looking for team feedback on RFC titled, Implementing Apache Parquet for WME Dumps

Looked into the RFC for Parquet format. It seems totally unrelated to the chunked snapshots. We should not involve or discuss Parquet format as part of this scoping work. This task is specifically for chunking < 1% of our current larger snapshots.

Based on Phrabat's draft RFC, I'm reservations about the proposed complexity that will be in this one file for export.go. The code is already complex and hard to maintain. The proposed changes would be conditions for turning on/off chunking based on file size and on the export type. I'd prefer to see a re-architecture of this block of code so it is more maintainable and less cryptic code.

I'd like to discuss it with Phrabat to see if he has insights into making the code more maintainable. I've put it in the blocked column until I've chatted with Phrabat.

JArguello-WMF changed the task status from In Progress to Open.Feb 26 2024, 2:34 PM