Page MenuHomePhabricator

[SPIKE] Experiment with flattening custom data object
Closed, ResolvedPublic



There have been issues raised with having a schema-less nested custom data object in the Metrics Platform monoschema. To see if we can lower the complexity on the consumer side, we can try a few different options to flatten the schema to simplify querying the data.

User story

As a data scientist/product analyst, I want to see if it's easier to query data from a flattened schema as opposed to a schema with nested json.

As an engineer, I'd like to know where the complexity gets shifted to if the Metrics Platform monoschema is adapted to remove the nested custom data object and flattened custom data properties are provided instead.

Developer notes

In order to test the hypothesis that querying is easier with a flattened schema, we can try sending mock data through the pipelines using the following approaches:

  1. Create a table wherein the custom data object is transformed into a given set of custom data fields i.e. for 3 custom data points, there will be 9 corresponding columns:
    • custom_data1_name, custom_data1_value, custom_data1_type
    • custom_data2_name, custom_data2_value, custom_data2_type
    • custom_data3_name, custom_data3_value, custom_data3_type
    • If client code only sends 1 custom data point, the other 2 sets of custom data columns will be null.
  2. Create a table wherein the custom data object is transformed into a singular column for each name, value, type so that it scales vertically i.e.
    • custom_data_name, custom_data_value, custom_data_type
    • Monoschema core contextual attributes are matched to each row such that:
      • If an event sends 2 custom data points, there will be 2 rows in the table that are linked to the same event id.
      • If an event sends 3 custom data points, there will be 3 rows in the table that are linked to the same event id.
Acceptance criteria
  • Test tables are created per approaches above and populated with sufficient data. << see T340702#9014296
  • Queries are run/captured for each approach. << see Jupyter notebook attached in T340702#9014296
  • Custom data decision matrix elucidates pros/cons/risks of each approach << see spreadsheet link in T340702#9014296


Other Assignee

Event Timeline

To experiment with the different approaches to flatten the custom data object in the Metrics Platform monoschema, I generated mock data in Hive that can be queried by product analysts. The mock data in this case is a clone of the production table event.mediawiki_visual_editor_feature_use which is a current Metrics Platform stream that is collecting events from VE.

Custom data artifacts

The first approach in the ticket description (horizontal solution for including a limited TBD number of custom data items as columns in the events table) can be tested by querying cjming.mp_vefu_core_custom_data_2 (number 1 enumerated below). Note that the ticket description indicated 3 pieces of custom data - I settled on 2 pieces because of the complexity of the queries as a pyspark noob which will suffice for the needs of this experiment.

The second approach in the ticket description (vertical scaling solution for including N number of custom data items with an event) can be tested by querying the join of 2 tables cjming.mp_vefu_copy_with_id and cjming.mp_vefu_custom_data_vertical (numbers 2 and 3 enumerated below).

There are 3 tables in my eponymous database cjming in Hive:

  1. hdfs://analytics-hadoop/user/hive/warehouse/cjming.db/mp_vefu_core_custom_data_2
    • cjming.mp_vefu_core_custom_data_2 is a single table that contains the original columns from event.mediawiki_visual_editor_feature_use joined with a limit of 2 items of custom data per event. This means that if an event submitted 4 custom data items, 2 will be dropped and 2 will populate the 6 columns in the joined table. If an event sends 1 item of custom data, that 1 will populate 3 of the custom_data columns and the other 3 will be null:
      • custom_data_name1
      • custom_data_type1
      • custom_data_value1
      • custom_data_name2
      • custom_data_type2
      • custom_data_value2
    • This table can be queried as is - a predetermined 2 items of custom data are populated in this joined table.
  2. hdfs://analytics-hadoop/user/hive/warehouse/cjming.db/mp_vefu_copy_with_id
    • cjming.mp_vefu_copy_with_id is a copy of event.mediawiki_visual_editor_feature_use with an id inserted that represents the event id in this context.
    • This table contains the original custom data object blob that was submitted with the event.
  3. hdfs://analytics-hadoop/user/hive/warehouse/cjming.db/mp_vefu_custom_data_vertical
    • cjming.mp_vefu_custom_data_vertical is a vertical scaling table of custom data items associated with events by the aforementioned "event id".
    • Each custom data item has 3 fields:
      • custom_data_name
      • custom_data_type
      • custom_data_value
    • A single event can have N items of custom data which are inserted into cjming.mp_vefu_custom_data_vertical N times. In other words, an event with 3 custom data items will have 3 rows in cjming.mp_vefu_custom_data_vertical linked to the original event in cjming.mp_vefu_copy_with_id by the same event id.
    • These 2 tables cjming.mp_vefu_copy_with_id and cjming.mp_vefu_custom_data_vertical will have to be joined to link all custom data items to their respective events.

I am also including the Jupyter notebook to build these dataframes in case it can be of use:

Size / performance

A note about the size of these tables. The event.mediawiki_visual_editor_feature_use (~3500 rows) table was chosen as a representative of a more common use case of the Metrics Platform. The entirety of this production table was copied into the test tables. When checking the size of cjming.mp_vefu_core_custom_data_2, there are 257 parquets totaling 6MB. The other two tables clock in at ~0.5 MB each. This experiment will not be able to test the performant aspect of querying but hopefully reveal the ease or difficulty of writing the queries themselves, as well as what may be required of Data Engineering to do any transforms or joining.

If performance testing is needed on GB of data, the Jupyter notebook can be easily adjusted to query from a larger production table (i.e. event.mediawiki_edit_attempt which contains ~50 million rows) and new tables can be written to from newly derived dataframes (all the queries aside from the initial one would be the same and names of dataframes/tables would need to be adjusted).


If we allow the number of custom data items to be non-deterministic, we will have to solve the problem of how to join an unknown number of custom data items to their corresponding events. As illustrated by the 2 tables needed to accommodate for unknown items of custom data per event, the complexity increases in querying and analysis. This approach provides great flexibility for the producer of events at the cost of greater complexity for the consumer.

By limiting the number of custom data items submitted with an event, this greatly lowers the complexity for the consumer at the cost of losing flexibility for the producer. Producers would then need to submit multiple events to capture more data and there would likely be a need to process them as a single event.

Here is a custom data decision matrix for the options going forward:

Next steps

A product analyst can try querying for custom data in the above tables.

cc @VirginiaPoundstone @WDoranWMF

nettrom_WMF added a project: Product-Analytics.
nettrom_WMF added a subscriber: nettrom_WMF.

Added the Product Analytics tag, and myself as secondary assignee on this task. I've started looking at this and will continue that next week.

hi @nettrom_WMF -- I re-ran the notebook with data from the production event.mediawiki_edit_attempt table from a random day in July (7/26/23).

Here is the Jupyter Notebook for the new dataframes:

And here are the names of the new tables (same notes apply from T340702#9014296):

  1. hdfs://analytics-hadoop/user/hive/warehouse/cjming.db/mp_ea_core_custom_data_2
    • This one is the combined table limited to 2 pieces of custom data (rest are dropped - fwiw, editattemptstep instrument collects quite a bit of custom data points -- like up to 8-10 per event).
    • This table contains 5,850,061 rows
    • Size: 1.3 GB, 257 parquets
  2. hdfs://analytics-hadoop/user/hive/warehouse/cjming.db/mp_ea_copy_with_id
    • This table must be joined with the table below to get all custom data points submitted with a given editattemptstep event.
    • This table contains 5,850,061 rows
    • Size: ~1 GB, 59 parquets
  3. hdfs://analytics-hadoop/user/hive/warehouse/cjming.db/mp_ea_custom_data_vertical
    • This table must be joined with the table above to get all custom data points submitted with a given editattemptstep event.
    • This table contains 43,909,953 rows
    • Size: ~285 MB, 64 parquets

Let me know if this suffices to check for performant querying.

I'll start digging into query performance this week. In the meantime, I want to make sure that previous work is documented. The experiments and notes around working with these flattened tables can be found in this notebook on gitlab.

I've now completed testing using the EditAttemptStep data. There's a summary of the approaches, run times, and some observations and notes in this notebook on Gitlab.

It took a little longer than expected because I had to debug a non-obvious issue with how the _id columns were generated that resulted in them not corresponding to each other across tables. That's described and an improved data generation process is found in this other notebook.