Page MenuHomePhabricator

Consider renaming event and event_sanitized Hive databases
Open, MediumPublic

Description

At this time, the event database holds unsanitized data up to 90 days plus long retained data like mediawiki events (they are public).

The events_sanitized database holds data that is been sanitized.

Options:

  • Main events database has all longterm data, including sanitized stuff, and event_fresh* or event_private (TBD) database has recent unsanitized data. This would encourage folks to use the main events database by default, and only go to the unsanitized one when necessary.
  • Have the main events database hold data that is less <90 days old and an event_longterm database where we hold sanitized data but also data from events , like mediawiki's that are public.

Either way, we should make the retention settings consistent in these different databases.

Perhaps in a new Iceberg world future, we will not need two different databases?

Event Timeline

We need to

  • Rename event_sanitized to event_archive (or whatever)
  • Make sure all tables in event db are refined into event_archive (possibly via the sanitization refine job)
  • Make sure all hive partitions in event db is deleted after 90 days
Ottomata lowered the priority of this task from High to Medium.
Ottomata raised the priority of this task from Medium to High.
Ottomata moved this task from Incoming to Data Quality on the Analytics board.
mforns lowered the priority of this task from High to Medium.Nov 21 2019, 6:17 PM
Ottomata renamed this task from Rename event_sanitized to event_longterm to Consider renaming event and event_sanitized Hive databases.Dec 14 2023, 2:26 PM
Ottomata updated the task description. (Show Details)