This will allow a simpler and smoother dump procedure for the staging database when migrating to the new infrastructure.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | elukey | T172410 Replace the current multisource analytics-store setup | |||
Resolved | Jclark-ctr | T216491 Decommission dbstore1002 | |||
Resolved | Marostegui | T210478 Migrate dbstore1002 to a multi instance setup on dbstore100[3-5] | |||
Resolved | Marostegui | T213670 dbstore1002 Mysql errors | |||
Resolved | elukey | T213706 Convert Aria/Tokudb tables to InnoDB on dbstore1002 | |||
Resolved | JAllemandou | T215450 Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002 |
Event Timeline
Job started with the command:
sudo -u hdfs sqoop import \ -D mapred.job.name='sqoop-staging-mep_word_persistence' \ --username research \ --password-file /user/hdfs/mysql-analytics-research-client-pw.txt \ --connect jdbc:mysql://analytics-store/staging \ --query "select rev_id, convert(rev_timestamp using utf8) rev_timestamp, page_id, page_namespace, convert(page_title using utf8) page_title, user_id, convert(user_text using utf8) user_text, convert(comment using utf8) comment, minor, convert(sha1 using utf8) sha1, revisions_processed, non_self_processed, seconds_possible, tokens_added, persistent_tokens, non_self_persistent_tokens, censored, non_self_censored, sum_log_persisted, sum_log_non_self_persisted, sum_log_seconds_visible from mep_word_persistence where \$CONDITIONS" \ --target-dir /user/joal/mep_word_persistence \ --num-mappers 5 \ --split-by rev_id \ --as-parquetfile \ --map-column-java "minor=Boolean,censored=Boolean,non_self_censored=Boolean,sum_log_persisted=Double,sum_log_non_self_persisted=Double,sum_log_seconds_visible=Double"
I hope 5 parallel workers is not too much ...
@JAllemandou dbstore1002 crashed, let's start the same thing but with less jobs I would suggest.
@Marostegui : Sqoop finished yesterday at 23:31 UTC with expected number of rows. Maybe the db crash was unrelated?
@Halfak : Data is available for vetting in your hdfs user folder: /user/halfak/mep_word_persistence. I also created the related hive table:
use halfak; CREATE EXTERNAL TABLE mep_word_persistence ( rev_id BIGINT, rev_timestamp STRING, page_id BIGINT, page_namespace BIGINT, page_title STRING, user_id BIGINT, user_text STRING, comment STRING, minor BOOLEAN, sha1 STRING, revisions_processed BIGINT, non_self_processed BIGINT, seconds_possible BIGINT, tokens_added BIGINT, persistent_tokens BIGINT, non_self_persistent_tokens BIGINT, censored BOOLEAN, non_self_censored BOOLEAN, sum_log_persisted DOUBLE, sum_log_non_self_persisted DOUBLE, sum_log_seconds_visible DOUBLE ) STORED AS PARQUET LOCATION "/user/halfak/mep_word_persistence";
Let's triple check the data is correct before calling it done :)
@JAllemandou at what time did you start the job?
From what I can see it crashed at around 18:32, so it crashed before then (if we use your comment at T215450#4932773 as a reference for time).
@Halfak let's check you've got everything you need, so we can proceed and kill this table as soon as possible, as its size is giving us some headaches :-( (135GB and still growing on InnoDB)
Great - so the crash happened before. Thanks a lot for helping out here :)
Let's wait for @Halfak to verify he's got everything he needs.
Mentioned in SAL (#wikimedia-operations) [2019-02-07T10:43:30Z] <marostegui> Run mysqldump from dbstore1003 to dump dbstore1002:staging.mep_word_persistence - T215450
@elukey the .sql file is at: dbstore1003:/srv/tmp.staging/staging.sql
Can you grab it and store it somewhere else, so we have two copies, the HIVE one and this?
I suggest using hdfs://wmf/data/archive/sqldumps as a base for sql-dumps, with content-oriented subfolders, leading to hdfs://wmf/data/archive/sqldumps/mep_word_persistence/staging.sql.
Would that be ok for you @elukey ?
I am moving the dump to stat1004, then I'll move it to the HDFS dir that Joseph proposed. Will update the task once done :)
Naming is not great (yet) but the backup is on HDFS:
elukey@stat1004:~$ sudo -u hdfs hdfs dfs -ls /wmf/data/archive/backup/misc/dbstore1002_backup/staging_dbstore1002_07022019.sql -rw-r--r-- 3 hdfs hadoop 252675730411 2019-02-07 14:49 /wmf/data/archive/backup/misc/dbstore1002_backup/staging_dbstore1002_07022019.sql
Confirmed that they have the same number of rows (SELECT COUNT(*) FROM mep_word_persistence;). I confirmed that I can run a useful query and get the same result (SELECT SUM(non_self_persistent_tokens) FROM mep_word_persistence WHERE user_id = 753665;)
This is a measure of the total contribution of https://en.wikipedia.org/wiki/Steven_Pruitt as measured by word tokens added that have persisted.
Both queries arrive at the same result: 16,971,001 persistent word tokens!
I'm confirming that this is good. Querying this in Hive is *so much faster* too. :)
Thanks @Halfak for the double check :) Also, If you like hive speed, try Spark :D
Massive thanks again to @Marostegui for handling the death of the dbstore1002 beast :)
Thank you! I will kill this table tomorrow and close this task!
@JAllemandou thank you for helping out and placing this into HIVE that fast!
Mentioned in SAL (#wikimedia-operations) [2019-02-08T06:07:28Z] <marostegui> Drop staging.mep_word_persistence from dbstore1002 T215450 T213706
root@DBSTORE[staging]> drop table mep_word_persistence; Query OK, 0 rows affected (5.53 sec)