Page MenuHomePhabricator

Extract edit oriented data from MySQL for simplewiki
Closed, ResolvedPublic0 Story Points

Description

Extract edit history data from MySQL for simplewiki. Goal is to have an algorithm that can reconstruct history correctly, we will work on scaling algorithm for a larger wiki in a different task.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 9 2016, 5:27 PM
JAllemandou renamed this task from Spike - Extract edit oriented dqata from MySQL on Simplewiki to match EventBus schemas to Spike - MySQL edit data extraction.May 9 2016, 5:28 PM
JAllemandou updated the task description. (Show Details)
Nuria renamed this task from Spike - MySQL edit data extraction to Extract edit oriented data from MySQL from simplewiki (small size) .May 19 2016, 4:38 PM
Nuria renamed this task from Extract edit oriented data from MySQL from simplewiki (small size) to Extract edit oriented data from MySQL for small wiki.
Nuria added a subscriber: Nuria.May 19 2016, 4:55 PM

First question:

  1. Do we want to move data to event bus first or rather we want to go directly to analytics schemas?

Given that our goal is to be able to have a prototype of data pipeline we will go from db to analytics schemas. Later we will move data into eventbus schemas as intermediate step.

Nuria added a comment.May 19 2016, 5:20 PM

The DB loading is assumed to be a bootstrapping step, to happen only once. Updates to the past data that are happening to db data should come as eventbus events so we are not considering those as this time (DB loading is a 1-off)

Technical steps that need to happen:

  • SQL to transform from db data to JSON (note that 1st stab is done with a small wiki, no need to think about scaling yet)

Issues:

  • Data in analytics schemas is denormalized, we need access to say user data when we are inserting a page create event
  • Might be that we need to load data into normalized form and later denormalize the data (this is likeley a mini research task within this one)

If during the spike we realize that data needs an intermediate processing step that makes data in shape similar to eventbus in shape, we should reconsider
the decision of not using the eventbus schemas.

This is a prototype to learn about the structure of data we are using a small wiki to get away from scaling problems of joining massive tables.

Nuria updated the task description. (Show Details)May 19 2016, 5:23 PM
Nuria updated the task description. (Show Details)
Nuria set the point value for this task to 34.
Nuria updated the task description. (Show Details)May 19 2016, 5:25 PM
Nuria added a comment.May 19 2016, 5:27 PM

An idea on how to approach that task: http://www.gv.com/sprint/

Nuria edited projects, added Analytics-Kanban; removed Analytics.May 19 2016, 5:29 PM
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 295693 had a related patch set uploaded (by Milimetric):
[WIP] Process Mediawiki page history

https://gerrit.wikimedia.org/r/295693

Milimetric changed the point value for this task from 34 to 0.Jun 28 2016, 4:07 PM
Nuria renamed this task from Extract edit oriented data from MySQL for small wiki to Extract edit oriented data from MySQL for simplewiki.Jul 27 2016, 7:21 PM
Nuria updated the task description. (Show Details)
Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.Aug 2 2016, 5:43 PM

Change 295693 abandoned by Milimetric:
[WIP] Process Mediawiki page history

Reason:
Joseph has the much better https://gerrit.wikimedia.org/r/#/c/301837/

https://gerrit.wikimedia.org/r/295693

Milimetric reassigned this task from Milimetric to mforns.Aug 2 2016, 5:44 PM
Milimetric moved this task from Done to In Progress on the Analytics-Kanban board.
Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.
Milimetric moved this task from Done to In Progress on the Analytics-Kanban board.
Milimetric added a subscriber: Milimetric.
mforns moved this task from In Progress to Done on the Analytics-Kanban board.Aug 4 2016, 11:25 AM
Nuria moved this task from Done to Parent Tasks on the Analytics-Kanban board.Aug 8 2016, 3:20 PM