Page MenuHomePhabricator

Drop Navigationtiming data entirely from mysql storage?
Closed, ResolvedPublic5 Story Points

Description

Pinging perf team to see if we can drop NavigationTiming data entirely from MySQL storage , data will continue be in hadoop as is now without changes but there is some older data (years older) that is present on this host that we are not sure is needed.

Event Timeline

Nuria created this task.Sep 26 2019, 1:39 AM
elukey added a subscriber: elukey.Sep 26 2019, 8:13 AM

Space that can be potentially recovered:

elukey@db1107:~$ du -hsc /srv/sqldata/_log_NavigationTiming_1* | grep total
74G	total

elukey@db1108:~$ du -hsc /srv/sqldata/_log_NavigationTiming_* | grep total
145G	total
Nuria added a subscriber: Gilles.Sep 27 2019, 5:52 PM

ping @Gilles to confirm this data can be dropped from mysql

There is a lot of old data that isn't in Hadoop (see below). Ideally it should be moved to Hadoop, at minimum the archives in MySQL should be kept. This is very valuable historical data for our team.

YearMySQL records (in millions)Hive records (in millions)
2013400
2014147.80
2015140.00
2016147.30
2017131.117.7
2018245.5247.7

Based on the above, only the following NavigationTiming tables are safe to delete in MySQL right now:

NavigationTiming_17703215
NavigationTiming_17729222
NavigationTiming_17740037
NavigationTiming_17854363
NavigationTiming_17957960
NavigationTiming_17975587
NavigationTiming_18156125
NavigationTiming_18358840
NavigationTiming_18435897
NavigationTiming_18636169
NavigationTiming_18813782
NavigationTiming_18917904
NavigationTiming_18988839
NavigationTiming_19147339
NavigationTiming_19157653

fdans added a subscriber: fdans.Sep 30 2019, 4:04 PM

Dropping 2017 and 2018 data (hive records)

fdans triaged this task as High priority.Sep 30 2019, 4:05 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.
Gilles moved this task from Inbox to Radar on the Performance-Team board.Sep 30 2019, 8:09 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

@fdans do you mean that you're going to drop SQL records where we have equivalent records in Hadoop by comparing some ids?

As you can see in my table, 2017 only has very partial data in Hadoop, you can't drop that whole year from SQL without data loss.

Nuria added a comment.Sep 30 2019, 8:13 PM

@Gilles Correction: in this case data drop is only 2018 year

elukey added a comment.Oct 8 2019, 5:54 AM

Data from 2018 in various tables:

elukey@db1108:~$ for table in `cat el_nav_tables`; do echo $table; sudo mysql log -e "select count(1) from log.$table where timestamp like '2018%'"; done
NavigationTiming_10076863
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_10374055
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_10785754
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_12405818
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_13317958
+----------+
| count(1) |
+----------+
|      191 |
+----------+
NavigationTiming_13332008
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_14899847
+----------+
| count(1) |
+----------+
|        3 |
+----------+
NavigationTiming_15033442
+----------+
| count(1) |
+----------+
|        5 |
+----------+
NavigationTiming_15396488
+----------+
| count(1) |
+----------+
|        4 |
+----------+
NavigationTiming_15485142
+----------+
| count(1) |
+----------+
|       69 |
+----------+
NavigationTiming_15485142_15423246
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_16305090
+----------+
| count(1) |
+----------+
|       36 |
+----------+
NavigationTiming_16305090_15423246
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_17216284
+----------+
| count(1) |
+----------+
| 15020983 |
+----------+
NavigationTiming_17703215
+----------+
| count(1) |
+----------+
|  2269526 |
+----------+
NavigationTiming_17729222
+----------+
| count(1) |
+----------+
|  3027325 |
+----------+
NavigationTiming_17740037
+----------+
| count(1) |
+----------+
| 21187339 |
+----------+
NavigationTiming_17854363
+----------+
| count(1) |
+----------+
| 31754109 |
+----------+
NavigationTiming_17957960
+----------+
| count(1) |
+----------+
|  6358689 |
+----------+
NavigationTiming_17975587
+----------+
| count(1) |
+----------+
| 65598839 |
+----------+
NavigationTiming_18156125
+----------+
| count(1) |
+----------+
| 66521111 |
+----------+
NavigationTiming_18358840
+----------+
| count(1) |
+----------+
|  6903790 |
+----------+
NavigationTiming_18435897
+----------+
| count(1) |
+----------+
| 16757706 |
+----------+
NavigationTiming_18636169
+----------+
| count(1) |
+----------+
| 10431016 |
+----------+
NavigationTiming_18813782
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18917904
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18988839
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_19147339
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_19157653
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_5323808
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_5333197
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_5336845
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_5832704
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_6703470
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_7494934
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_8365252
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_8477778
+----------+
| count(1) |
+----------+
|        0 |
+----------+
elukey added a comment.Oct 8 2019, 6:07 AM

To recap:

  • Drop table
NavigationTiming_17703215
NavigationTiming_17729222
NavigationTiming_17740037
NavigationTiming_17854363
NavigationTiming_17957960
NavigationTiming_17975587
NavigationTiming_18156125
NavigationTiming_18358840
NavigationTiming_18435897
NavigationTiming_18636169
NavigationTiming_18813782
NavigationTiming_18917904
NavigationTiming_18988839
NavigationTiming_19147339
NavigationTiming_19157653
  • delete from log.$table where timestamp like '2018%'
elukey added a comment.Oct 8 2019, 6:09 AM

Correction: among the tables to drop I can see some of them containing data from 2018:

NavigationTiming_17703215
+----------+
| count(1) |
+----------+
|  2269526 |
+----------+

NavigationTiming_17729222
+----------+
| count(1) |
+----------+
|  3027325 |
+----------+

NavigationTiming_17740037
+----------+
| count(1) |
+----------+
| 21187339 |
+----------+

NavigationTiming_17854363
+----------+
| count(1) |
+----------+
| 31754109 |
+----------+

....

Am I checking the timestamp in the wrong way or should we recheck @Gilles?

Gilles added a comment.Oct 8 2019, 8:18 AM

Yes, data from 2018 is safe to drop since we have it in Hadoop already. This all seems correct to me.

elukey added a comment.Oct 8 2019, 8:19 AM

Yes, data from 2018 is safe to drop since we have it in Hadoop already. This all seems correct to me.

Ah sorry misread your comment before, 2017 is the partial one. Good then :)

elukey added a comment.Oct 8 2019, 8:21 AM

Triple checked 2017 data on the tables listed by Gilles:

elukey@db1108:~$ for table in `cat el_nav_tables_no_2017`; do echo $table; sudo mysql log -e "select count(1) from log.$table where timestamp like '2017%'"; done
NavigationTiming_17703215
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_17729222
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_17740037
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_17854363
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_17957960
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_17975587
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18156125
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18358840
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18435897
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18636169
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18813782
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18917904
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_18988839
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_19147339
+----------+
| count(1) |
+----------+
|        0 |
+----------+
NavigationTiming_19157653
+----------+
| count(1) |
+----------+
|        0 |
+----------+

Safe to drop :)

elukey added a comment.Oct 8 2019, 8:22 AM

To recap:

  • Drop table
NavigationTiming_17703215
NavigationTiming_17729222
NavigationTiming_17740037
NavigationTiming_17854363
NavigationTiming_17957960
NavigationTiming_17975587
NavigationTiming_18156125
NavigationTiming_18358840
NavigationTiming_18435897
NavigationTiming_18636169
NavigationTiming_18813782
NavigationTiming_18917904
NavigationTiming_18988839
NavigationTiming_19147339
NavigationTiming_19157653
  • delete from log.$table where timestamp like '2018%'

@Nuria request for review, then I think we can proceed..

Nuria added a comment.Oct 28 2019, 7:20 PM

Spot checked these and while they have data for 2018/2019 I can not see any for 2017 so per @Gilles comment above they should be safe to drop.

Ottomata reassigned this task from Ottomata to elukey.Oct 30 2019, 3:12 PM
elukey added a subscriber: mforns.Nov 4 2019, 8:52 AM

@mforns When you are online can you ping me? I'd like to drop the above tables but with somebody triple checking what I am doing :)

mforns added a comment.Nov 5 2019, 4:19 PM

@elukey @Nuria @Gilles

EventLogging data was first enabled in Hive on 2017-11-20T19:00:00Z.
I believe that is why we have partial data on 2017, proportion seems to match.
We could theoretically drop data with timestamp >= '20171120190000'.

elukey added a comment.Nov 5 2019, 4:55 PM

To recap:

  • Drop table
NavigationTiming_17703215
NavigationTiming_17729222
NavigationTiming_17740037
NavigationTiming_17854363
NavigationTiming_17957960
NavigationTiming_17975587
NavigationTiming_18156125
NavigationTiming_18358840
NavigationTiming_18435897
NavigationTiming_18636169
NavigationTiming_18813782
NavigationTiming_18917904
NavigationTiming_18988839
NavigationTiming_19147339
NavigationTiming_19157653

Dropped tables on db1107 and db1108 with Marcel, dropped ~70G of data. I think that we could stop here without delete from log.$table where timestamp like '2018%', what do you think @Nuria?

elukey set the point value for this task to 5.Nov 5 2019, 4:55 PM
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.
Nuria added a comment.Nov 5 2019, 7:15 PM

+1 to @elukey I think we are good

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Nov 6 2019, 12:00 PM
Nuria closed this task as Resolved.Nov 7 2019, 11:08 PM