Page MenuHomePhabricator

Add data-quality check on mediawiki-history-reduced before druid indexation
Closed, ResolvedPublic13 Estimated Story Points

Description

Needs to happen after T192482.
An oozie step asserting data-quality of a new snapshot by comparing it with the previous one is to be added before the mediawiki-history-reduced data is indexed into druid to be served by AQS.
Given the mediawiki-history-reduced is quite complex, the job/query needs to be carefully thought of and tested.

Event Timeline

Milimetric added a project: Analytics.

Change 441341 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update MWH-reduced to parquet storage

https://gerrit.wikimedia.org/r/441341

Change 441378 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery/source@master] Updating MediawikiHistoryChecker for reduced

https://gerrit.wikimedia.org/r/441378

Operational changes that go with this change:

  • convert existing data (json) into parquet
  • kill old job
  • start new job that indexes parquet data
  • recreate table in parquet format (repair also in order to create partitions)

Change 441341 merged by Nuria:
[analytics/refinery@master] Update MWH-reduced to parquet storage

https://gerrit.wikimedia.org/r/441341

Change 445373 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add check to mw-history-reduced druid indexation

https://gerrit.wikimedia.org/r/445373

JAllemandou set the point value for this task to 13.Aug 9 2018, 3:04 PM

Change 445373 merged by Joal:
[analytics/refinery@master] Add check to mw-history-reduced druid indexation

https://gerrit.wikimedia.org/r/445373