Maniphest T209178

Refactor Mediawiki-Database ingestion
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

None

Authored By

	Milimetric
	Nov 9 2018, 8:13 PM

Description

So far mediawiki-database ingestion was done from labsdb using Sqoop. We have experienced issues (see T209031), and we think the new approach proposed below will be better in term of reliability, time to data availability, and future usages. The main difficulty is that, since data will be extracted from MariasDB analytics replicas containing production data, the sanitization steps happening before making the data available in labsdb will have to be replicated on the hadoop size. The sanitization steps for labsdb is done in two steps, the first one uses sql-triggers (code is here ), and the second one uses sql-views (code is here).

The plan:
1 - Modify our sqoop-wrapper script to read the sql-triggers settings in order to replicate triggers-sanitization at sqoop step.
2 - Add a sanitization job for the data sqooped out, defined using the sql-views-definition (see link above). There might be some adaptation in the SQL language, but we think they'll be minimal
3 - Reconfigure the sqoop job to read from MariaDB analytics replicas. Start the 1st of the month instead of waiting a few days for other jobs to finish. Also change the num-mappers and num-processors settings as the 10 connections limitiation from labsdb will not apply anymore.

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T209178 Refactor Mediawiki-Database ingestion
Resolved	JAllemandou	T209179 Update log_namespace, page_namespace from bigint to int
Resolved	Milimetric	T209031 Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}
Resolved	• Banyek	T210693 Create materialized views on Wiki Replica hosts for better query performance
Declined	Milimetric	T210522 Refactor Sqoop, join actor and comment from analytics replicas
Resolved	Milimetric	T210541 Update sqoop to work with the new schema
Resolved	JAllemandou	T210542 Update datasets definitions and oozie jobs for dual-sqoop of comments and actors
Resolved	Milimetric	T210543 Update refinery-source jobs to join labsdb with actor and comment

Event Timeline

Milimetric created this task.Nov 9 2018, 8:13 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 9 2018, 8:13 PM

• fdans assigned this task to JAllemandou.Nov 12 2018, 5:18 PM

• fdans triaged this task as High priority.

• fdans added a project: Analytics-Kanban.

• fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

JAllemandou moved this task from Next Up to In Progress on the Analytics-Kanban board.Nov 12 2018, 5:23 PM

JAllemandou added a subtask: T209031: Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC}.Nov 13 2018, 8:27 AM

JAllemandou set the point value for this task to 0.Nov 13 2018, 8:32 AM

JAllemandou moved this task from In Progress to Parent Tasks on the Analytics-Kanban board.

JAllemandou renamed this task from Long term solution for sqooping comments to Refactor Mediawiki-Database ingestion.Nov 15 2018, 9:27 AM

JAllemandou updated the task description. (Show Details)

elukey updated the task description. (Show Details)Nov 15 2018, 9:33 AM

elukey updated the task description. (Show Details)

JAllemandou updated the task description. (Show Details)Nov 15 2018, 9:38 AM

• Nuria closed subtask T209179: Update log_namespace, page_namespace from bigint to int as Resolved.Dec 10 2018, 5:50 PM

• Nuria closed subtask T209031: Not able to scoop comment table in labs for mediawiki reconstruction process [EPIC} as Resolved.Feb 14 2019, 5:07 AM

• Nuria closed subtask T210522: Refactor Sqoop, join actor and comment from analytics replicas as Declined.

Milimetric removed a project: Analytics-Kanban.Mar 4 2019, 4:57 PM

Ottomata removed JAllemandou as the assignee of this task.Mar 4 2019, 4:58 PM

Ottomata added a subscriber: JAllemandou.

Milimetric added a project: Analytics-Kanban.Mar 4 2019, 4:58 PM

• Nuria closed this task as Resolved.Jul 6 2020, 11:23 PM

Refactor Mediawiki-Database ingestionClosed, ResolvedPublic0 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Refactor Mediawiki-Database ingestion
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...