Page MenuHomePhabricator

Spark Streaming Dumps POC: Backfill content table
Open, Needs TriagePublic

Description

For the purpose of this POC, the minimal schema can be: (wiki_db string, page_id bigint, revision_id bigint, revision_text string)

  • study https://iceberg.apache.org/docs/latest/configuration/#write-properties
  • experiment writing sample content from wmf.mediawiki_wikitext_history to an iceberg table backed by avro files. The iceberg table will eventually be configured to optimize for:
    1. as few avro files as possible
    2. fast joins with metadata table (see T323642: Spark Streaming Dumps POC: Backfill metadata table)
    3. fast to find content for a given (wiki_db, page_id)
  • with results from above, write <some_test_db>.ice_wikitext with everything available in wmf.mediawiki_wikitext_history. This will mostly be used for performance testing right now, so it doesn't need to have more than the minimal schema from above.
  • run the job to fill a semi-final table in the wmf db
  • document job parameters and explain how the spark job was tuned in detail
NOTE: while investigating this, one interesting question is: should we have another table with (wiki_db, page_id, cached_xml_dumps_output) that would be updated only when a page sees new revisions?

Event Timeline

Milimetric updated the task description. (Show Details)
Milimetric updated the task description. (Show Details)