Page MenuHomePhabricator

Data structuring guidance request
Closed, ResolvedPublic

Description

Reposting from original email outreach:

My team is seeking guidance regarding tips and best practices for structuring data in hive for (1) longer term data reference storage, (2) use in computation with other datasets, and (3) use in superset dashboarding.

Some context: Our team has been working on a project [1] which triangulates metrics in multiple Wikimedia Movement domains (i.e., Editors, Readers, Volunteer Program Leaders, Grantees, and Affiliates) as well as global indicators on freedom, access, and population statistics (i.e., key demographics and equity, diversity & inclusion indices) to understand in what domains and to what extent we are improving at diversity, inclusion, and equity in our movement ecosystem. To do this we create an annual data reference of external data, product data, as well as affiliates, grants, and key look-up data references and triangulate the input measures to calculate output domain metrics by country.

The task at hand: When it comes to structuring the dataset(s) [2] for computation and storage, it is unclear to me what guidance we may have and/or what reasons there may be for creating smaller more focused datasets vs a single comprehensive combined data set. I think for instance, if different indicators, metrics, or data sources are combined in a single dataset, it becomes harder to document specific caveats of each one; or, if some input metrics are not public or the combination could create some sort of identifying risk it may be useful to keep them separate for legal/privacy review.

The ask: Do you know any resources that exist to guide us on this, or, might you have any guidance to share?

[1] https://docs.google.com/presentation/d/1oixKgsEOnkf6UFonMrEZ5fqL7KEvfswuEBjaUQGN_UA/edit#slide=id.g92e7fc863e_2_53

[2] https://docs.google.com/spreadsheets/d/1X-lIH1hJSyNJPsPfG07Rf9rVbCwbjlVBYrdPyPjTTvw/edit#gid=1536271900

Event Timeline

From a purely structural perspective: if the datasets have differing schemas (columns in your spreadsheets case), then you'll likely want them to be different datasets (Hive tables) for sure. If they have the same or mostly similar schemas, you could put them all in the same table.

We often use the same table with periodic Hive partition columns. For most of our data, which is very frequent, we partition by hour. It sounds like your reports are annual? In that case, could have a single table partitioned by year. The year column would then be a special column that Hive would use to organize its data files. I.e. your files in HDFS would end up being organized in the table directory with hierarchical partition column directories, e.g. yourtable/year=2021/.... You'd then be able to use SQL select data pertaining to a single year, or you could still sum over years.

There might be other reasons to structure your data differently though; I'll let others comment.

odimitrijevic moved this task from Incoming to Datasets on the Analytics board.

There are broader ongoing conversations about an approach to bringing this data into the data lake.

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.
odimitrijevic added a subscriber: EChetty.

@JAnstee_WMF can you please set up a meeting with @mforns @EChetty to discuss and tag this ticket in the agenda. Ideally any documentation/notes that come out of this would be made available to others too.

Hi @Ottomata, thanks for the clarification . Sorry it took me this long to make this comment. You are right that our reports are annual and I also think we should have a single table partitioned by year.

As for the schemas, like @JAnstee_WMF elaborated above, we have metrics for countries, sub-continents and continents which were computed from, for instance, Editors and Pageviews data that already exist in hive. Do you think it will be necessary to create separate tables for these sets of metrics when their raw data already exist in Hive? Wouldn't there be too much multiplicity of similar names?

Also, we have external metrics computed from data sourced from external publics sources. About 60% of the external metrics can almost be equally distributed between Human Development Report (HDR) and World Bank (WB). While the other 40% are spread among various other sites.

So, let say we decide to create two tables. One for HDR related metrics and the other for WB related metrics. How do you suggest we handle the remaining 40% of the external metrics that are spread almost equally among the many other sites. I suppose it won't be efficient to create several one or two-column tables all around, right?

Hello!

we have metrics for countries, sub-continents and continents which were computed from, for instance, Editors and Pageviews data that already exist in hive. Do you think it will be necessary to create separate tables for these sets of metrics when their raw data already exist in Hive? Wouldn't there be too much multiplicity of similar names?

Similar names are not a problem. Depending on the queries / data that is accessed, if you are deriving / aggregating metrics, it will probably be better to store that data in a separate table, so as not to have to repeat the queries on the raw data additional times, especially if you are looking at a years worth of data all at once!

So, let say we decide to create two tables. One for HDR related metrics and the other for WB related metrics. How do you suggest we handle the remaining 40% of the external metrics that are spread almost equally among the many other sites. I suppose it won't be efficient to create several one or two-column tables all around, right?

What do these metric tables look like? Are the schemas similar? Do you plan to query them together or sum/group over them? Are these imported raw from external sources, or do you mean some kind of normalized table where the columns are consistent between the external datasets?

I don't think I have enough context to make a recommendation (I betcha @mforns could do better than me anyway :p).

Thanks for your prompt response @Ottomata. Please @mforns kindly also assist with your expertise inputs.

What do these metric tables look like? Are the schemas similar?

Sorry for bothering you. Please here is a link to a folder containing the samples of the Internal, External and Output metrics CSVs. If you can have some time to look at it.

There are functions already in place for extracting the raw internal and external data, for computing the Internal and External metrics, and for computing the Output metrics from the computed Internal and External metrics.

The plan is to use these functions in an Airflow ETL job that will be scheduled to update these tables annually.

Do you plan to query them together or sum/group over them?

The code/query for computing the Output metrics currently treats/queries each of the Internal and External metrics tables as single tables. Editing the code to pull from the number of tables that the Internal and External metrics tables may be broken down into will not be a problem though.

Are these imported raw from external sources, or do you mean some kind of normalized table where the columns are consistent between the external datasets?

The code for the external data pulls data from various sites, cleans, edits and transforms the data before consolidating them into a single table. So there's no one external dataset to check for consistency with the consolidated external table that we arrive at.

What do you think?

Just a quick thought without too much more investigation on my part: If the metrics are distinct, then use different tables. If they are the same but with different dimensional values (but very similar schemas, e.g. it's ok if some columns are null for some combination of dimensions), then use the same table.

And actually, I betcha the Product Analytics team would have even better advice. Maybe @nshahquinn-wmf or @nettrom_WMF?

Here are the meeting notes from our sync yesterday.

Action items

  • GDI will work to split the external data frame to partition absolute count data separate from scale data for storage within canonical on hive
  • GDI will share code with Marcel for him to explore and understand the feasibility of querying HDFS more directly for editor and reader metrics rather than druid tables
  • GDI will continue conversation about documentation and storage locations for new internal data tables as well as data documentation and planning for shared data use

Removing inactive assignee (please do so as part of team offboarding!).