Page MenuHomePhabricator

Create a guideline for where to store data for official tables in HDFS
Open, Needs TriagePublic

Description

It has recently been agreed [WMF internal link] that the wmf_* Data Lake databases will hold not only Data Platform Engineering–owned tables, but also "production-grade" tables owned by other teams.

However, this raises the question of where the data for those tables should be stored in HDFS. So far, some data has been placed in /wmf/data within directories corresponding to the database (e.g. /wmf/data/wmf_readership) while others have been placed within /wmf/data within directories corresponding to the owning team (e.g. /wmf/data/research). There is no explicit guideline.

Beyond the question of organization, permissions are an obstacle. Currently, most of these directories can only be written by the analytics system user, so if it is agreed that non-DPE owned data belongs here, we need to decide how the non-DPE users will get their tables created.

In my opinion, the simple and sensible answer is just to say:

  1. data for "production-grade" tables belongs in the directory /wmf/data/{{database}}
  2. if you need a table created but don't have permissions, just ask someone from DPE to do it (this will not happen frequently, so the burden should be minimal).

An alternative (pointed out by @fkaelin) is to place data in directories corresponding to the owning team (while maintaining the team-agnostic organization of tables). This would make it very easy to set appropriate permissions. This system would go out of date eventually as team structures and ownership changed, but it would be possible (if somewhat complex) to move the underlying data accordingly with minimal impact on users.