Reposting from original email outreach:
My team is seeking guidance regarding tips and best practices for structuring data in hive for (1) longer term data reference storage, (2) use in computation with other datasets, and (3) use in superset dashboarding.
Some context: Our team has been working on a project [1] which triangulates metrics in multiple Wikimedia Movement domains (i.e., Editors, Readers, Volunteer Program Leaders, Grantees, and Affiliates) as well as global indicators on freedom, access, and population statistics (i.e., key demographics and equity, diversity & inclusion indices) to understand in what domains and to what extent we are improving at diversity, inclusion, and equity in our movement ecosystem. To do this we create an annual data reference of external data, product data, as well as affiliates, grants, and key look-up data references and triangulate the input measures to calculate output domain metrics by country.
The task at hand: When it comes to structuring the dataset(s) [2] for computation and storage, it is unclear to me what guidance we may have and/or what reasons there may be for creating smaller more focused datasets vs a single comprehensive combined data set. I think for instance, if different indicators, metrics, or data sources are combined in a single dataset, it becomes harder to document specific caveats of each one; or, if some input metrics are not public or the combination could create some sort of identifying risk it may be useful to keep them separate for legal/privacy review.
The ask: Do you know any resources that exist to guide us on this, or, might you have any guidance to share?