Page MenuHomePhabricator

Refactor Mediawiki-Database ingestion
Closed, ResolvedPublic0 Estimated Story Points

Description

So far mediawiki-database ingestion was done from labsdb using Sqoop. We have experienced issues (see T209031), and we think the new approach proposed below will be better in term of reliability, time to data availability, and future usages. The main difficulty is that, since data will be extracted from MariasDB analytics replicas containing production data, the sanitization steps happening before making the data available in labsdb will have to be replicated on the hadoop size. The sanitization steps for labsdb is done in two steps, the first one uses sql-triggers (code is here ), and the second one uses sql-views (code is here).

The plan:
1 - Modify our sqoop-wrapper script to read the sql-triggers settings in order to replicate triggers-sanitization at sqoop step.
2 - Add a sanitization job for the data sqooped out, defined using the sql-views-definition (see link above). There might be some adaptation in the SQL language, but we think they'll be minimal
3 - Reconfigure the sqoop job to read from MariaDB analytics replicas. Start the 1st of the month instead of waiting a few days for other jobs to finish. Also change the num-mappers and num-processors settings as the 10 connections limitiation from labsdb will not apply anymore.

Event Timeline

fdans triaged this task as High priority.
fdans added a project: Analytics-Kanban.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
JAllemandou set the point value for this task to 0.Nov 13 2018, 8:32 AM
JAllemandou moved this task from In Progress to Parent Tasks on the Analytics-Kanban board.
JAllemandou renamed this task from Long term solution for sqooping comments to Refactor Mediawiki-Database ingestion.Nov 15 2018, 9:27 AM
JAllemandou updated the task description. (Show Details)
elukey updated the task description. (Show Details)
Ottomata added a subscriber: JAllemandou.