Page MenuHomePhabricator

Sqoop staging.mep_word_persistence to HDFS and drop the table from dbstore1002
Closed, ResolvedPublic


This will allow a simpler and smoother dump procedure for the staging database when migrating to the new infrastructure.

Event Timeline

elukey triaged this task as High priority.Feb 6 2019, 7:04 PM
elukey created this task.

Job started with the command:

sudo -u hdfs sqoop import \
  -D'sqoop-staging-mep_word_persistence' \
  --username research \
  --password-file /user/hdfs/mysql-analytics-research-client-pw.txt \
  --connect jdbc:mysql://analytics-store/staging \
  --query "select rev_id, convert(rev_timestamp using utf8) rev_timestamp, page_id, page_namespace, convert(page_title using utf8) page_title, user_id, convert(user_text using utf8) user_text, convert(comment using utf8) comment, minor, convert(sha1 using utf8) sha1, revisions_processed, non_self_processed, seconds_possible, tokens_added, persistent_tokens, non_self_persistent_tokens, censored, non_self_censored, sum_log_persisted, sum_log_non_self_persisted, sum_log_seconds_visible from mep_word_persistence where \$CONDITIONS" \
  --target-dir /user/joal/mep_word_persistence \
  --num-mappers 5 \
  --split-by rev_id \
  --as-parquetfile \
  --map-column-java "minor=Boolean,censored=Boolean,non_self_censored=Boolean,sum_log_persisted=Double,sum_log_non_self_persisted=Double,sum_log_seconds_visible=Double"

I hope 5 parallel workers is not too much ...

@JAllemandou dbstore1002 crashed, let's start the same thing but with less jobs I would suggest.

@Marostegui : Sqoop finished yesterday at 23:31 UTC with expected number of rows. Maybe the db crash was unrelated?
@Halfak : Data is available for vetting in your hdfs user folder: /user/halfak/mep_word_persistence. I also created the related hive table:

use halfak;
CREATE EXTERNAL TABLE mep_word_persistence (  
  rev_id BIGINT,
  rev_timestamp STRING,
  page_id BIGINT,
  page_namespace BIGINT,
  page_title STRING,
  user_id BIGINT,
  user_text STRING,
  comment STRING,
  minor BOOLEAN,
  sha1 STRING,
  revisions_processed BIGINT,
  non_self_processed BIGINT,
  seconds_possible BIGINT,
  tokens_added BIGINT,
  persistent_tokens BIGINT,
  non_self_persistent_tokens BIGINT,
  censored BOOLEAN,
  non_self_censored BOOLEAN,
  sum_log_persisted DOUBLE,
  sum_log_non_self_persisted DOUBLE,
  sum_log_seconds_visible DOUBLE
) STORED AS PARQUET LOCATION "/user/halfak/mep_word_persistence";

Let's triple check the data is correct before calling it done :)

@JAllemandou at what time did you start the job?
From what I can see it crashed at around 18:32, so it crashed before then (if we use your comment at T215450#4932773 as a reference for time).

@Halfak let's check you've got everything you need, so we can proceed and kill this table as soon as possible, as its size is giving us some headaches :-( (135GB and still growing on InnoDB)

@Marostegui : Indeed I started the job later (comment time is almostsynchronous).

Great - so the crash happened before. Thanks a lot for helping out here :)
Let's wait for @Halfak to verify he's got everything he needs.

Mentioned in SAL (#wikimedia-operations) [2019-02-07T10:43:30Z] <marostegui> Run mysqldump from dbstore1003 to dump dbstore1002:staging.mep_word_persistence - T215450

I am also mysqldumping that table, just in case.

@elukey the .sql file is at: dbstore1003:/srv/tmp.staging/staging.sql
Can you grab it and store it somewhere else, so we have two copies, the HIVE one and this?

I suggest using hdfs://wmf/data/archive/sqldumps as a base for sql-dumps, with content-oriented subfolders, leading to hdfs://wmf/data/archive/sqldumps/mep_word_persistence/staging.sql.
Would that be ok for you @elukey ?

I am moving the dump to stat1004, then I'll move it to the HDFS dir that Joseph proposed. Will update the task once done :)

elukey added a comment.Feb 7 2019, 2:57 PM

Naming is not great (yet) but the backup is on HDFS:

elukey@stat1004:~$ sudo -u hdfs hdfs dfs -ls /wmf/data/archive/backup/misc/dbstore1002_backup/staging_dbstore1002_07022019.sql
-rw-r--r--   3 hdfs hadoop 252675730411 2019-02-07 14:49 /wmf/data/archive/backup/misc/dbstore1002_backup/staging_dbstore1002_07022019.sql

Thanks @elukey - ok to delete the .sql file I created from dbstore1003?

elukey added a comment.Feb 7 2019, 3:00 PM

Thanks @elukey - ok to delete the .sql file I created from dbstore1003?


Done - thank you

Halfak added a comment.Feb 7 2019, 3:56 PM

Confirmed that they have the same number of rows (SELECT COUNT(*) FROM mep_word_persistence;). I confirmed that I can run a useful query and get the same result (SELECT SUM(non_self_persistent_tokens) FROM mep_word_persistence WHERE user_id = 753665;)

This is a measure of the total contribution of as measured by word tokens added that have persisted.

Both queries arrive at the same result: 16,971,001 persistent word tokens!

I'm confirming that this is good. Querying this in Hive is *so much faster* too. :)

Thanks @Halfak, can we drop the table from dbstore1002 then?

Halfak added a comment.Feb 7 2019, 4:09 PM

Yes. Thanks for your patience.

Thanks @Halfak for the double check :) Also, If you like hive speed, try Spark :D
Massive thanks again to @Marostegui for handling the death of the dbstore1002 beast :)

Yes. Thanks for your patience.

Thank you! I will kill this table tomorrow and close this task!

@JAllemandou thank you for helping out and placing this into HIVE that fast!

fdans closed this task as Resolved.Feb 7 2019, 6:17 PM

Mentioned in SAL (#wikimedia-operations) [2019-02-08T06:07:28Z] <marostegui> Drop staging.mep_word_persistence from dbstore1002 T215450 T213706

root@DBSTORE[staging]> drop table mep_word_persistence;
Query OK, 0 rows affected (5.53 sec)