To be able to create some Superset dashboards, we need some data in the data lake.
However, we are not likely to have real production data still for some days (weeks?).
But, if we generate some fake data that looks more or less like the community updates data,
we can still play with Superset, see how far we can get with the current assumptions,
and encounter problems that we can start thinking of.
Description
Related Objects
Event Timeline
This is a spark sql query that reads event data from another table with the same web/base schema,
modifies the data to mimic what community updates data would look like,
and writes it to a temporary table in a user database.
https://gist.github.com/marcelrf/157466f507fdc06aad9f3ac419f722c4
This is a python script that executes the query above for each hour of a given time interval.
https://gist.github.com/marcelrf/f6bdc954886b358e5cf6658e9b9878aa
I generated 1 month of data (Aug 2024) and stored it under mforns.community_updates using the scripts above.
If we need more data, or we need to modify it, it's just a matter of re-running the scripts.
But for now, we can play a bit with it from Superset.
@mforns: This task is open and its associated sprint project is archived. Please associate an active project tag to this task so it can be found on workboards, or set the task status to resolved if no further work is needed. Thanks!