Extract edit history data from MySQL for simplewiki. Goal is to have an algorithm that can reconstruct history correctly, we will work on scaling algorithm for a larger wiki in a different task.
- Do we want to move data to event bus first or rather we want to go directly to analytics schemas?
Given that our goal is to be able to have a prototype of data pipeline we will go from db to analytics schemas. Later we will move data into eventbus schemas as intermediate step.
The DB loading is assumed to be a bootstrapping step, to happen only once. Updates to the past data that are happening to db data should come as eventbus events so we are not considering those as this time (DB loading is a 1-off)
Technical steps that need to happen:
- SQL to transform from db data to JSON (note that 1st stab is done with a small wiki, no need to think about scaling yet)
- Data in analytics schemas is denormalized, we need access to say user data when we are inserting a page create event
- Might be that we need to load data into normalized form and later denormalize the data (this is likeley a mini research task within this one)
If during the spike we realize that data needs an intermediate processing step that makes data in shape similar to eventbus in shape, we should reconsider
the decision of not using the eventbus schemas.
This is a prototype to learn about the structure of data we are using a small wiki to get away from scaling problems of joining massive tables.