For the purpose of this POC, the minimal schema can be: (wiki_db string, page_id bigint, revision_id bigint, revision_deleted_parts array<string>). It might be interesting to also include is_latest boolean to keep track of which revision is the latest for a page and see how fast updates to that work in iceberg with our volume.
- study https://iceberg.apache.org/docs/latest/configuration/#write-properties
- experiment writing sample content from wmf.mediawiki_wikitext_history to an iceberg table backed by parquet files. Here we have to optimize for:
- as few parquet files as possible
- fast joins with metadata table (see T323642: Spark Streaming Dumps POC: Backfill metadata table)
- fast to update from kafka streams of new revisions, page changes, and visibility changes
- with results from above, write milimetric.iceberg_wikitext_history with everything available in wmf.mediawiki_wikitext_history. This will mostly be used for performance testing right now, but the schema is the same.
- test inserting into milimetric.iceberg_wikitext_history from spark streaming
- document everything