Page MenuHomePhabricator

Restore MobileWebSectionUsage_14321266 and MobileWebSectionUsage_15038458
Closed, ResolvedPublic

Description

As discussed on IRC (log), these two tables were erroneously deleted from analytisc-store during a troubleshooting session today (based on the mistaken assumption they are "unqueriable" anyway - various successful analyses have been done with the data, see e.g .T118041 - and apparently oblivious of a recent thread about these tables on Analytics-l).

The suggestion was to access the data in Hadoop instead, but as discussed on IRC, this is not an equivalent alternative and would require at least considerable additional setup and conversion work (e.g. to adapt the existing queries from SQL to HQL).

(To be clear, this is separate from the decision to blacklist them - i.e. stop logging events. On Monday, @Jdlrobson and I had already resolved to set the sample rate to zero. IMHO the incident also makes clear that we need to improve decision and communication processes about such measures in general - but that's a separate discussion.)

Event Timeline

Tbayer raised the priority of this task from to Needs Triage.
Tbayer updated the task description. (Show Details)
Tbayer added a project: Analytics.

Tables will start existing once blacklisting is lifted, let us know when new sampling ratio has taken effect.

The suggestion was to access the data in Hadoop instead, but as discussed on IRC, this is not an equivalent alternative and would require at least considerable additional setup and conversion >work (e.g. to adapt the existing queries from SQL to HQL).

We can help you with that as needed be.

Tables will start existing once blacklisting is lifted, let us know when new sampling ratio has taken effect.

I understand from Jon that the change he committed (in effect turning off the events) is going live fully today. It's OK to discard the data generated since Monday; this task is about restoring the data collected up to then.

@Tbayer:

Given the many issues we have in our data store right now.

Hardware: https://phabricator.wikimedia.org/T123546
Replication: https://phabricator.wikimedia.org/T123634#1934482
And disk getting full: https://phabricator.wikimedia.org/T120187

It is not likely we will restore tables until those are resolved.

As we have said data is available in hadoop. Given how problematic is been to maintain the mysql instance and that data on salves might not catch up to the master for days, it is probably worth it to give a shot to get data from cluster.

Understood that these are timely and severe issues; I really appreciate the Analytics' team's hard work on fixing them. I should check with other stakeholders to be certain, but from my perspective further analysis on these two tables could wait until the middle of next week (January 20); would that be that a realistic timeframe for restoring them? BTW I talked a bit with @madhuvishy yesterday about what restoring these tables entails technically; I understand from her that it's not a huge technical challenge, but will require some time for the data transfer from one server to the other to complete.

would that be that a realistic timeframe for restoring them?

Before adding more data we need to do the tokudb conversion, once we have an ETA for that we will update this ticket. The dba is on vacation and this is one of many issues I am sure that will need his attention when he comes back.

@Tbayer:

Given the many issues we have in our data store right now.

Hardware: https://phabricator.wikimedia.org/T123546
Replication: https://phabricator.wikimedia.org/T123634#1934482
And disk getting full: https://phabricator.wikimedia.org/T120187

It is not likely we will restore tables until those are resolved.

OK, I've added the first and the third as blocking task (T123634 has been resolved in the meantime). That's based on the above explanation, as I lack the expertise myself to understand the connection between these two tables and that broken memory chip, or the disk space issue for that matter (I thought that freeing up disk space had not been the main motivation for dropping these; according to https://phabricator.wikimedia.org/T123634#1935198 there are several EL tables that are much larger, up to more than 500 GB - IIRC these two tables had about 50-ish a week or two ago).

...but actually it seems that the newer of the two tables (MobileWebSectionUsage_15038458) has already been fully restored? That would be awesome news.

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)` FROM information_schema.tables WHERE table_name LIKE 'MobileWebSection%';
+--------------------------------+-----------------+
| table_name                     | TOTAL SIZE (GB) |
+--------------------------------+-----------------+
| MobileWebSectionUsage_14321266 |  0.000357764773 |
| MobileWebSectionUsage_15038458 | 86.475565684959 |
+--------------------------------+-----------------+

It seems it is also receiving new events (as noted above though, we don't really expect or need it to record new data at this point, as the instrumentation was already deactivated last week - so these might be "stragglers", events generated from cached code or such):

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT MIN(timestamp), MAX(timestamp) FROM log.MobileWebSectionUsage_15038458;
+----------------+----------------+
| MIN(timestamp) | MAX(timestamp) |
+----------------+----------------+
| 20151215231911 | 20160120235636 |
+----------------+----------------+

It seems the "disk getting full" blocking task is now resolved too.

@jcrespo or others, could you confirm that the table that has already reappeared (MobileWebSectionUsage_15038458 , see previous comment) can be used again? Looking forward to see the other table restored with its previous data as well.

Nuria triaged this task as Medium priority.Jan 26 2016, 5:49 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.

@Nuria any chance we can bump this priority of this and associated tasks? The reading team is looking to lazy load images as part of our quarterly goal and the results of this analysis are delaying us from progressing. We had been hoping to start this Monday...

...but actually it seems that the newer of the two tables (MobileWebSectionUsage_15038458) has already been fully restored?

I believe so. There was a lot that @Nuria and I did not know about the EL databases when we erroneously deleted these tables. I still don't fully understand all the custom replication and multiple slaves. If it looks like all the data is there, then likely it was just automatically re-synced from somewhere we didn't know about.

Do you need MobileWebSectionUsage_14321266 restored as well?

Oh, @Tbayer says as much. I'll look into it now...

The table: MobileWebSectionUsage_15038458 is already there from what I can see, is that sufficient to do analysis?

@Ottomata: let me know if you think we can restore MobileWebSectionUsage_14321266 from the 'other' db host, was it 1046?

@Jdlrobson: if we can restore the table from a mysql copy of it we might be able to do it in not too much time. Again, the data is not lost , it is in hadoop.

I think it has been restored...

mysql:research@s1-analytics-slave.eqiad.wmnet [log]> select min(timestamp), max(timestamp) from MobileWebSectionUsage_14321266;
+----------------+----------------+
| min(timestamp) | max(timestamp) |
+----------------+----------------+
| 20151103201401 | 20160127201839 |
+----------------+----------------+

mysql:research@s1-analytics-slave.eqiad.wmnet [log]> select count(*) from MobileWebSectionUsage_14321266;
+-----------+
| count(*)  |
+-----------+
| 119487911 |
+-----------+

Hm, but looking at the size (repeat from https://phabricator.wikimedia.org/T123595#1950564 ), MobileWebSectionUsage_14321266 has not actually been restored yet :(

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)` FROM information_schema.tables WHERE table_name LIKE 'MobileWebSection%';
+--------------------------------+-----------------+
| table_name                     | TOTAL SIZE (GB) |
+--------------------------------+-----------------+
| MobileWebSectionUsage_14321266 |  0.001350170933 |
| MobileWebSectionUsage_15038458 | 86.486242502928 |
+--------------------------------+-----------------+

I learned about the miracle today. :)

I'm syncing MobileWebSectionUsage_14321266 from the backup master on db1047 now. It is slowly inserting into analytics-store. Will post with progress tomorrow.

Ok, MobileWebSectionUsage_14321266 has been restored to m4-master. I'll need to do a manual sync from m4-master to analytics-store. @jcrespo's resync of EL tables over last weekends outage is still happening in T120187, and I don't want to slow analytics-store down any more during the day.

I'll wait until tonight and do start a full sync of MobileWebSectionUsage_14321266.

Just started sync from db1047 to dbstore1002, running on db1047.

mysqldump --single-transaction --insert-ignore --no-create-info --skip-add-locks log MobileWebSectionUsage_14321266 | pv | mysql -h dbstore1002.eqiad.wmnet log

Ok!

mysql:research@analytics-store.eqiad.wmnet [log]> SELECT table_name, (DATA_LENGTH + INDEX_LENGTH)/1024/1024/1024 as `TOTAL SIZE (GB)` FROM information_schema.tables WHERE table_name LIKE 'MobileWebSection%';
+--------------------------------+-----------------+
| table_name                     | TOTAL SIZE (GB) |
+--------------------------------+-----------------+
| MobileWebSectionUsage_14321266 | 44.117207354866 |
| MobileWebSectionUsage_15038458 | 86.496098669246 |
+--------------------------------+-----------------+
mysql:research@analytics-store.eqiad.wmnet [log]> select min(timestamp), max(timestamp) from MobileWebSectionUsage_14321266;
+----------------+----------------+
| min(timestamp) | max(timestamp) |
+----------------+----------------+
| 20151103201401 | 20160201145300 |
+----------------+----------------+

Should be good to go, ja?!

Great, thanks everyone!
As a test of the integrity of the restored MobileWebSectionUsage_14321266, I re-ran the first two queries from https://phabricator.wikimedia.org/T118041#1835201 and got almost the same results (511290 now vs. 511290 then, and 156017 now vs. 156018 then, respectively). That should be good enough.

PS (just to record this here, mostly for myself):

I did a quick check of how the number total events per day developed for both tables (as a sanity check because I have been running queries over the entire table, and also because Grafana shows 0 until Jan 26). Both stopped sending events on Jan 12/13, but picked up again on January 18 at a very low rate (<1% the previous average for MobileWebSectionUsage_15038458). Both still have events trickling in even now, at an even lower rate.

SELECT LEFT(timestamp, 8) AS date, COUNT(*) AS events FROM log.MobileWebSectionUsage_15038458 GROUP BY date ORDER BY date
--------------

date    events
20151215        1
20151216        21187
20151217        1608322
20151218        6352221
20151219        6849604
20151220        7725212
20151221        6741613
20151222        6511682
20151223        6714214
20151224        6723976
20151225        7480533
20151226        8128354
20151227        8486421
20151228        7829595
20151229        7706630
20151230        7549203
20151231        7378455
20160101        8635484
20160102        9518818
20160103        10010413
20160104        8212235
20160105        8082635
20160106        8251934
20160107        8152006
20160108        8358187
20160109        9252847
20160110        10073266
20160111        9454764
20160112        8561344
20160113        6613485
20160118        45769
20160119        88011
20160120        40973
20160121        7943
20160122        9151
20160123        9619
20160124        8761
20160125        2624
20160126        1162
20160127        999
20160128        791
20160129        750
20160130        698
20160131        702
20160201        585
20160202        547
20160203        458
20160204        435
20160205        253
20160206        185
20160207        191
20160208        156
20160209        168
20160210        167
20160211        149
20160212        140
20160213        140
20160214        121
20160215        103
20160216        104
20160217        92
20160218        101
20160219        94
20160220        108
20160221        118
20160222        86
20160223        80
20160224        71
20160225        75
20160226        70
20160227        73
20160228        74
20160229        61
20160301        40


SELECT LEFT(timestamp, 8) AS date, COUNT(*) AS events FROM log.MobileWebSectionUsage_14321266 GROUP BY date ORDER BY date
--------------

date    events
20151103        6
20151104        4609
20151105        251269
20151106        2478986
20151107        2850014
20151108        3387017
20151109        2878178
20151110        2775752
20151111        2854553
20151112        2835381
20151113        2729154
20151114        2987378
20151115        3434902
20151116        2946675
20151117        2873347
20151118        2831533
20151119        2853299
20151120        2728653
20151121        3096311
20151122        3377825
20151123        3020065
20151124        2775824
20151125        2825577
20151126        2855724
20151127        2333657
20151128        3045383
20151129        3389315
20151130        2844030
20151201        2622658
20151202        1880859
20151203        1918811
20151204        2708733
20151205        2956732
20151206        3346403
20151207        2895370
20151208        2830068
20151209        2729954
20151210        2736122
20151211        2692762
20151212        3014296
20151213        3395182
20151214        2879138
20151215        2684231
20151216        2696515
20151217        2041320
20151218        56530
20151219        29547
20151220        21888
20151221        15654
20151222        12030
20151223        9815
20151224        8007
20151225        6332
20151226        3764
20151227        3495
20151228        3490
20151229        2563
20151230        2100
20151231        1524
20160101        1577
20160102        1861
20160103        1897
20160104        1280
20160105        1486
20160106        1598
20160107        1367
20160108        1191
20160109        1190
20160110        1108
20160111        1299
20160112        546
20160118        324
20160119        672
20160120        450
20160121        252
20160122        372
20160123        382
20160124        409
20160125        204
20160126        156
20160127        145
20160128        162
20160129        153
20160130        132
20160131        126
20160201        128
20160202        107
20160203        109
20160204        90
20160205        73
20160206        59
20160207        80
20160208        62
20160209        68
20160210        72
20160211        66
20160212        64
20160213        61
20160214        56
20160215        63
20160216        59
20160217        42
20160218        45
20160219        56
20160220        41
20160221        44
20160222        38
20160223        55
20160224        48
20160225        42
20160226        41
20160227        42
20160228        34
20160229        46
20160301        29