Parent task for the extraction transformation and loading of the user history data from mediawiki into hadoop.
Description
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | None | T120037 Vital Signs: Please provide an "all languages" de-duplicated stream for the Community/Content groups of metrics | |||
| Resolved | None | T120036 Vital Signs: Please make the data for enwiki and other big wikis less sad, and not just be missing for most days | |||
| Resolved | • odimitrijevic | T130256 Wikistats 2.0. | |||
| Duplicate | mforns | T134793 User history in hadoop | |||
| Resolved | Ottomata | T134502 Propose evolution of Mediawiki EventBus schemas to match needed data for Analytics need | |||
| Resolved | Ottomata | T137287 Update MediaWiki hooks to generate data for new event-bus schemas | |||
| Declined | JAllemandou | T134792 Spike - Slowly Changing Dimensions on Druid |
Event Timeline
Comment Actions
Build a schema that would help us doing analytics. How do we represent that data into druid? and before how do we represent that data into hadoop?
Comment Actions
This task is about Schema Design, 1st stab, we might need to revisit this schema later.
We are going to treat it like a spike and devote 1 week for 1 person.
1.1 First team needs to internally define schemas that are to be used to calculate metrics. These are not event-based schema but data flowing in them comes from eventbus event-based data inflow.
1.2 How is this data represented in hadoop? Are analytics schema tables or something else.
1.3. How is this data represented on Druid? (we need to know how druid handles slowly-changing dimensions. See subtask)