Page MenuHomePhabricator

Archive PageContentSaveComplete in hdfs while we continue collecting data
Closed, ResolvedPublic5 Estimated Story Points

Description

PageContentSavecomplete data includes tags about edits that are not available in other form, table is real large and getting on the way of mysql purging.

Let's:

  • archive most of PageContentSaveComplete data to a known location on hdfs where it will not get purged (other than UA most data is public so it can be retained)
  • continue harvesting events on mysql but on a table that will be much smaller in size

Event Timeline

Nuria set the point value for this task to 3.Jul 18 2017, 9:01 PM
Nuria renamed this task from Clean up PageContentSaveComplete event if there are no data users to Archive PageContentSaveComplete in hdfs while we continue collecting data .Jul 25 2017, 6:08 PM
Nuria updated the task description. (Show Details)
elukey triaged this task as High priority.Aug 1 2017, 3:46 PM
elukey added a project: User-Elukey.

It would be really great to progress this task asap so we'll free a ton of space on analytics-store :)

The command

time sudo -u hdfs sqoop import --as-avrodatafile --password-file '/user/hdfs/mysql-analytics-research-client-pw.txt' --username research --connect jdbc:mysql://analytics-store.eqiad.wmnet/log --table PageContentSaveComplete_5588433 --target-dir /tmp/PageContentSaveComplete > log-2017-08-08-PageContentSaveComplete_5588433.txt

Is failing with:

2017-08-08 22:43:30,542 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1498042433999_174599_m_000002_0: Erro
r: PageContentSaveComplete_5588433 : Unsupported major.minor version 52.0

https://yarn.wikimedia.org/cluster/app/application_1498042433999_174599

Nuria updated the task description. (Show Details)

mmm, java missmatch? cluster is running 7 but local version is set to 8?

https://community.cloudera.com/t5/Data-Ingestion-Integration/Oracle-BigDataLite-4-2-1-Sqoop-Error/td-p/32449

Bingooo! So running sqoop from 1005 does not work, needs to be run from analytics1003 where JDK is 1.7

nuria@analytics1003:~$ java -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
java version "1.7.0_131"
OpenJDK Runtime Environment (IcedTea 2.6.9) (7u131-2.6.9-2~deb8u1)
OpenJDK 64-Bit Server VM (build 24.131-b00, mixed mode)

Nuria changed the point value for this task from 3 to 5.Aug 9 2017, 12:20 AM

Job to create avro files takes 10 minutes for a 60G table.

Job to create text files takes twice as long , about 20 mins.

17 mins to create hive table

Still having problems importing tinyint fields:
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_mysql_import_of_tinyint_1_from_mysql_behaves_strangely

Command:
time sudo -u hdfs sqoop import --as-avrodatafile --password-file '/user/hdfs/mysql-analytics-research-client-pw.txt' --username research --connect jdbc:mysql://analytics-store.eqiad.wmnet/log?tinyInt1isBit=false --query 'select id,uuid,convert(timestamp using utf8) timestamp,convert(webHost using utf8) webhost,wiki,event_isAPI,event_isMobile,convert(event_revisionId using utf8) event_revisionId from PageContentSaveComplete_5588433 where $CONDITIONS' --map-column-java event_isAPI=Integer,event_isMobile=Integer --target-dir /tmp/PageContentSaveCompleteAvro --split-by id

Ok, so the avro import is working fine, just opened avro files to verify:

{

"id" : {
  "int" : 1
},
"uuid" : {
  "bytes" : "f2becaa8657e523797fdb2d247716543"
},
"timestamp" : {
  "string" : "20170329170309"
},
"webhost" : {
  "string" : "ru.wikisource.org"
},
"wiki" : {
  "bytes" : "ruwikisource"
},
"event_isAPI" : {
  "int" : 1
},
"event_isMobile" : {
  "int" : 0
},
"event_revisionId" : {
  "int" : 1172067
}

}

What doesn't work is the mapping to a hive table of the last three fields, since data is there that seems a solvable problem, I am just going to import the larger table.

In fact, creating avro table from schema, like:

CREATE EXTERNAL TABLE PageContentSaveComplete_5588433
ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

LOCATION

'hdfs://analytics-hadoop/tmp/PageContentSaveCompleteAvro'

TBLPROPERTIES ('avro.schema.url'='hdfs://analytics-hadoop/tmp/PageContentSaveCompleteAutoGeneratedSchema.avsc')
;

Data can be seen fine

pagecontentsavecomplete_5588433_15423246 table create on wmf database

The size of avro files is well below 400G (about 40G) we dropped 1 column but it seems it is quite a difference:

nuria@analytics1003:~$ sudo -u hdfs hdfs dfs -ls -h /wmf/data/archive/eventlogging/PageContentSaveComplete_5588433_15423246
Found 5 items
-rw-r--r-- 3 hdfs hadoop 0 2017-08-11 20:46 /wmf/data/archive/eventlogging/PageContentSaveComplete_5588433_15423246/_SUCCESS
-rw-r--r-- 3 hdfs hadoop 9.3 G 2017-08-11 20:46 /wmf/data/archive/eventlogging/PageContentSaveComplete_5588433_15423246/part-m-00000.avro
-rw-r--r-- 3 hdfs hadoop 9.6 G 2017-08-11 20:27 /wmf/data/archive/eventlogging/PageContentSaveComplete_5588433_15423246/part-m-00001.avro
-rw-r--r-- 3 hdfs hadoop 9.8 G 2017-08-11 20:30 /wmf/data/archive/eventlogging/PageContentSaveComplete_5588433_15423246/part-m-00002.avro
-rw-r--r-- 3 hdfs hadoop 9.9 G 2017-08-11 20:29 /wmf/data/archive/eventlogging/PageContentSaveComplete_5588433_15423246/part-m-00003.avro

Let's do some spot checking before we delete the mysql table (cc @elukey )

Been documenting things here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Administration#Dumping_data_via_sqoop_from_eventlogging_to_hdfs

Will document for end users once we call it good

Spot checking (note that the way I set up avro table in hive is really not performant due to avro schema being hardcoded in table but functionally it works)

  • number of records is same

: mysql:research@analytics-store.eqiad.wmnet [log]> select count(*) from PageContentSaveComplete_5588433_15423246;

1291270247

Total MapReduce CPU Time Spent: 0 days 6 hours 18 minutes 51 seconds 710 msec
1291270247

  • select distinct wiki from PageContentSaveComplete_5588433_15423246 where timestamp like '2015070101%' ;

161 rows returned in both hadoop and mysql

  • select distinct wiki from PageContentSaveComplete_5588433_15423246 where timestamp like '20150701010101' ;

6 rows returned in either system

  • select * from PageContentSaveComplete_5588433_15423246 where id="650748079";

Same result

  • select distinct wiki from PageContentSaveComplete_5588433_15423246 where timestamp like '2015070101%' and event_isMobile=1 ;

Same result

ping @elukey: I think we can drop table on mysql, please do couple selects on hive to confirm that you have access.

Sanity check from my side:

HDFS:
select count(*) from PageContentSaveComplete_5588433_15423246 ;
1291270247

dbstore1002:
mysql:research@analytics-store.eqiad.wmnet [log]> select count(*) from PageContentSaveComplete_5588433_15423246;
+------------+
| count(*)   |
+------------+
| 1291270247 |
+------------+

Dropped from slaves but not master.

We noticed only a modest 100GB drop in disk usage on slave/store after the drop of the table. It seems that what reported by the following script (that we used to find the 400GB size of the table) is not reporting what is stored on disk by TokuDB:

SELECT
     -- table_name AS `Table`,
     round(((data_length + index_length) / 1024 / 1024 / 1024), 2) `Size in GB`
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'log'
ORDER BY (data_length + index_length) DESC;

On m4-master, that is the only one still holding the table:

root@db1046:/srv/sqldata# du -hsc _log_*PageContentSaveComplete*
693M	_log_PageContentSaveComplete_5588433_key_ix_PageContentSaveComplete_5588433_timestamp_116cc80eb_3_1b_P_0.tokudb
4.1G	_log_PageContentSaveComplete_5588433_key_ix_PageContentSaveComplete_5588433_uuid_116cc80e7_3_1b_B_0.tokudb
9.3G	_log_PageContentSaveComplete_5588433_main_116cc80e4_2_1b.tokudb
64K	_log_PageContentSaveComplete_5588433_status_116cc80e4_1_1b.tokudb
3.1G	_log_sql_6dc5_407_key_ix_PageContentSaveComplete_5588433_timestamp_902ff62_4_1b.tokudb
17G	_log_sql_6dc5_407_key_ix_PageContentSaveComplete_5588433_uuid_902ff62_3_1b.tokudb
32K	_log_sql_a10_4d898c_key_ix_TrackedPageContentSaveComplete_8535426_timestamp_5a0a626c_4_1b.tokudb
32K	_log_sql_a10_4d898c_key_ix_TrackedPageContentSaveComplete_8535426_uuid_5a0a626c_3_1b.tokudb
34G	total

To be noticed that check_size reports this:

Table	Size in MB
PageContentSaveComplete_5588433_15423246	339968.06

So there is a disparity between what is effectively showed on disk and what is showed by information_schema, probably due to TokuDB's compression?

Another test that I did was summing up all the files on dbstore1001:/srv/sqldata/log_* (all files of the log database) and compare them with the sum of the table sizes reported by information_schema. The result was respectively 1.3T vs 3.3T.

To summarize, we are not getting all the space that is showed in information_schema when we drop a table.

Today I even restarted mysql on dbstore1002 to ensure that no files was accidentally kept open by the mysql process, but nothing changed.

Mentioned in SAL (#wikimedia-operations) [2017-08-28T14:31:00Z] <elukey> drop PageContentSaveComplete_5588433_15423246 from the log database on db1046 (m4-master) - T170720

Before we close this, we should move the tables that Nuria created in Hive to a new 'archive' database.