For the purpose of this POC, the minimal schema can be: (wiki_db string, page_id bigint, revision_id bigint, revision_text string)
- study https://iceberg.apache.org/docs/latest/configuration/#write-properties
- experiment writing sample content from wmf.mediawiki_wikitext_history to an iceberg table backed by avro files. The iceberg table will eventually be configured to optimize for:
- as few avro files as possible
- fast joins with metadata table (see T323642: Spark Streaming Dumps POC: Backfill metadata table)
- fast to find content for a given (wiki_db, page_id)
- with results from above, write <some_test_db>.ice_wikitext with everything available in wmf.mediawiki_wikitext_history. This will mostly be used for performance testing right now, so it doesn't need to have more than the minimal schema from above.
- run the job to fill a semi-final table in the wmf db
- document job parameters and explain how the spark job was tuned in detail
NOTE: while investigating this, one interesting question is: should we have another table with (wiki_db, page_id, cached_xml_dumps_output) that would be updated only when a page sees new revisions?