So far mediawiki-database ingestion was done from labsdb using Sqoop. We have experienced issues (see T209031), and we think the new approach proposed below will be better in term of reliability, time to data availability, and future usages. The main difficulty is that, since data will be extracted from production-slaves, the sanitization steps happening before making the data available in labsdb will have to be replicated on the hadoop size. The sanitization steps for labsdb is done in two steps, the first one uses sql-triggers (code is here https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb), and the second one uses sql-views (code is here https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/templates/labs/db/views/maintain-views.yaml).
The plan:
1 - Modify our sqoop-wrapper script to read the sql-triggers settings (https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/files/mariadb/filtered_tables.txt) in order to replicate triggers-sanitization at sqoop step.
2 - Add a sanitization job for the data sqooped out, defined using the sql-views-definition (see link above). There might be some adaptation in the SQL language, but we think they'll be minimal
3 - Reconfigure the sqoop job to start the 1st of the nmonth instead of waiting a few days for other jobs to finish. Also change the num-mappers and num-processors settings as the 10 connections limitiation from labsdb will not apply anymore.