Page MenuHomePhabricator

Design technical architecture for section topics pipeline
Closed, ResolvedPublic

Description

This ticket is to represent the work to design the technical architecture for the section topics data pipeline.

See the miro board at https://miro.com/app/board/uXjVOpiLkTQ=/

Highlights

Event Timeline

mfossati renamed this task from Design techncial architecture for section topics pipeline to Design technical architecture for section topics pipeline.Jun 30 2022, 4:51 PM

@mfossati and @lbowmaker, what's left to do before we can consider this resolved?

I think we need to decide whether we:

  • compute diffs or not
  • write the output dataset to Parquet or Hive

We agreed on the following decisions, based on the Image-Suggestions experience:

  • don't compute diffs. This has been one of the most time-consuming and error-prone tasks
  • write the output Spark dataframe to Parquet, which is straightforward and effective, compared to tweaks required by Hive

This is reflected in the Miro board.
Closing this ticket.