Page MenuHomePhabricator

Hive EventLogging tables not updating since January 26
Closed, ResolvedPublic

Description

As noted earlier at T185952#3930033 (similar for other tables from the event database):

SELECT year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date, COUNT(*) AS events FROM event.navigationtiming WHERE year =2018 GROUP BY year, month, day ORDER BY year, month, day LIMIT 10000;

year	month	day	date	events
2018	1	1	2018-01-01	351864
2018	1	2	2018-01-02	377098
2018	1	3	2018-01-03	374133
2018	1	4	2018-01-04	371390
2018	1	5	2018-01-05	360239
2018	1	6	2018-01-06	367106
2018	1	7	2018-01-07	401630
2018	1	8	2018-01-08	401073
2018	1	9	2018-01-09	383230
2018	1	10	2018-01-10	378587
2018	1	11	2018-01-11	373086
2018	1	12	2018-01-12	361172
2018	1	13	2018-01-13	358843
2018	1	14	2018-01-14	395842
2018	1	15	2018-01-15	396190
2018	1	16	2018-01-16	389693
2018	1	17	2018-01-17	386363
2018	1	18	2018-01-18	379750
2018	1	19	2018-01-19	362491
2018	1	20	2018-01-20	357370
2018	1	21	2018-01-21	401910
2018	1	22	2018-01-22	403286
2018	1	23	2018-01-23	393751
2018	1	24	2018-01-24	385242
2018	1	25	2018-01-25	380481
2018	1	26	2018-01-26	50555
26 rows selected (54.464 seconds)


SELECT MAX(dt) FROM event.navigationtiming WHERE year = 2018 AND month = 1 AND day = 26;

_c0
2018-01-26T03:59:59

Event Timeline

Restricted Application added a project: Analytics. · View Herald TranscriptJan 31 2018, 5:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Nuria edited projects, added Analytics-Kanban; removed Analytics.Jan 31 2018, 6:44 PM
Nuria added a subscriber: Nuria.Jan 31 2018, 10:20 PM

There is data on the EL raw datastore:

nuria@stat1004:~$ ls /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_NavigationTiming/hourly/2018/01/26/23
eventlogging_NavigationTiming.13.0.12479.337252222.1517007600000 eventlogging_NavigationTiming.13.0.1275.337239743.1517007600000
nuria@stat1004:~$ ls /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_NavigationTiming/hourly/2018/01/30/23
eventlogging_NavigationTiming.13.0.1275.338793978.1517353200000 eventlogging_NavigationTiming.13.0.13094.338807074.1517353200000

Thus issues are with json refine only

Also, as noted earlier at https://phabricator.wikimedia.org/T185952#3930033 ([2]), the MySQL tables seem up to date.

Investigating! The same JsonRefine job has been stuck since Jan 26: https://yarn.wikimedia.org/cluster/app/application_1515441536446_52892

Elapsed:	155hrs, 23mins, 31sec
Nuria added a comment.Feb 1 2018, 5:35 PM

Let's make sure after this investigation is complete we set up some alarms for jsonrefine similar to the ones we have on the mysql end so we can see when throughput (like in this case) has gone to 0.

Ok, still not sure why that one job was stuck, but after killing it, the next scheduled run seemed to have work. @Tbayer, let us know if it is looking better now for ya.

On to more docs and some alarms...

Ok, still not sure why that one job was stuck, but after killing it, the next scheduled run seemed to have work. @Tbayer, let us know if it is looking better now for ya.

Well yes, all the days since January 29 look good now, but we are still missing all events from between January 26 4:00 and January 28 17:00 UTC it seems:

SELECT year, month, day, MIN(dt) AS mindt, MAX(dt) AS maxdt, COUNT(*) AS events FROM event.navigationtiming WHERE year =2018 GROUP BY year, month, day ORDER BY year, month, day LIMIT 10000;

year	month	day	mindt	maxdt	events
2018	1	1	2018-01-01T00:00:00	2018-01-01T23:59:59	351864
2018	1	2	2018-01-02T00:00:00	2018-01-02T23:59:59	377098
2018	1	3	2018-01-03T00:00:00	2018-01-03T23:59:59	374133
2018	1	4	2018-01-04T00:00:00	2018-01-04T23:59:59	371390
2018	1	5	2018-01-05T00:00:00	2018-01-05T23:59:59	360239
2018	1	6	2018-01-06T00:00:00	2018-01-06T23:59:59	367106
2018	1	7	2018-01-07T00:00:00	2018-01-07T23:59:59	401630
2018	1	8	2018-01-08T00:00:00	2018-01-08T23:59:59	401073
2018	1	9	2018-01-09T00:00:01	2018-01-09T23:59:59	383230
2018	1	10	2018-01-10T00:00:00	2018-01-10T23:59:59	378587
2018	1	11	2018-01-11T00:00:00	2018-01-11T23:59:59	373086
2018	1	12	2018-01-12T00:00:00	2018-01-12T23:59:59	361172
2018	1	13	2018-01-13T00:00:00	2018-01-13T23:59:58	358843
2018	1	14	2018-01-14T00:00:00	2018-01-14T23:59:59	395842
2018	1	15	2018-01-15T00:00:00	2018-01-15T23:59:59	396190
2018	1	16	2018-01-16T00:00:00	2018-01-16T23:59:59	389693
2018	1	17	2018-01-17T00:00:00	2018-01-17T23:59:59	386363
2018	1	18	2018-01-18T00:00:00	2018-01-18T23:59:59	379750
2018	1	19	2018-01-19T00:00:00	2018-01-19T23:59:59	362491
2018	1	20	2018-01-20T00:00:00	2018-01-20T23:59:59	357370
2018	1	21	2018-01-21T00:00:00	2018-01-21T23:59:59	401910
2018	1	22	2018-01-22T00:00:00	2018-01-22T23:59:59	403286
2018	1	23	2018-01-23T00:00:00	2018-01-23T23:59:59	393751
2018	1	24	2018-01-24T00:00:00	2018-01-24T23:59:59	385242
2018	1	25	2018-01-25T00:00:00	2018-01-25T23:59:59	380481
2018	1	26	2018-01-26T00:00:00	2018-01-26T03:59:59	50555
2018	1	28	2018-01-28T17:00:00	2018-01-28T23:59:59	144055
2018	1	29	2018-01-29T00:00:00	2018-01-29T23:59:59	403620
2018	1	30	2018-01-30T00:00:00	2018-01-30T23:59:59	385939
2018	1	31	2018-01-31T00:00:00	2018-01-31T23:59:59	378578
2018	2	1	2018-02-01T00:00:00	2018-02-01T23:59:59	369820
31 rows selected (68.34 seconds)


SELECT year, month, day, MIN(dt) AS mindt, MAX(dt) AS maxdt, COUNT(*) AS events 
FROM event.popups WHERE year =2018 GROUP BY year, month, day ORDER BY year, month, day LIMIT 10000;

year	month	day	mindt	maxdt	events
2018	1	1	2018-01-01T00:00:00	2018-01-01T23:59:59	10304594
2018	1	2	2018-01-02T00:00:00	2018-01-02T23:59:59	14870621
2018	1	3	2018-01-03T00:00:00	2018-01-03T23:59:59	15333868
2018	1	4	2018-01-04T00:00:00	2018-01-04T23:59:59	15016878
2018	1	5	2018-01-05T00:00:00	2018-01-05T23:59:59	14333294
2018	1	6	2018-01-06T00:00:00	2018-01-06T23:59:59	11595269
2018	1	7	2018-01-07T00:00:00	2018-01-07T23:59:59	12753601
2018	1	8	2018-01-08T00:00:00	2018-01-08T23:59:59	16651726
2018	1	9	2018-01-09T00:00:00	2018-01-09T23:59:59	16801365
2018	1	10	2018-01-10T00:00:00	2018-01-10T23:59:59	16358371
2018	1	11	2018-01-11T00:00:00	2018-01-11T23:59:59	16049877
2018	1	12	2018-01-12T00:00:00	2018-01-12T23:59:59	14722130
2018	1	13	2018-01-13T00:00:00	2018-01-13T23:59:59	11604586
2018	1	14	2018-01-14T00:00:00	2018-01-14T23:59:59	12695654
2018	1	15	2018-01-15T00:00:00	2018-01-15T23:59:59	16342986
2018	1	16	2018-01-16T00:00:00	2018-01-16T23:59:59	16960167
2018	1	17	2018-01-17T00:00:00	2018-01-17T23:59:59	16514163
2018	1	18	2018-01-18T00:00:00	2018-01-18T23:59:59	16099091
2018	1	19	2018-01-19T00:00:00	2018-01-19T23:59:59	14807773
2018	1	20	2018-01-20T00:00:00	2018-01-20T23:59:59	11441730
2018	1	21	2018-01-21T00:00:00	2018-01-21T23:59:59	13085816
2018	1	22	2018-01-22T00:00:00	2018-01-22T23:59:59	17842035
2018	1	23	2018-01-23T00:00:00	2018-01-23T23:59:59	17391402
2018	1	24	2018-01-24T00:00:00	2018-01-24T23:59:59	16862244
2018	1	25	2018-01-25T00:00:00	2018-01-25T23:59:59	16720180
2018	1	26	2018-01-26T00:00:00	2018-01-26T03:59:59	1763610
2018	1	28	2018-01-28T17:00:00	2018-01-28T23:59:59	4931919
2018	1	29	2018-01-29T00:00:00	2018-01-29T23:59:59	17589425
2018	1	30	2018-01-30T00:00:00	2018-01-30T23:59:59	16996963
2018	1	31	2018-01-31T00:00:00	2018-01-31T23:59:59	16421536
2018	2	1	2018-02-01T00:00:00	2018-02-01T23:59:59	16021907
31 rows selected (62.353 seconds)

Ping - I seem to recall someone saying that we only have one week's worth of data buffered in Kafka for this. Considering that it's now already one week after the gap started to occur, does this mean that the data is already starting to become irretrievably lost from the Hive perspective?

Nuria added a comment.Feb 4 2018, 8:42 PM

@Tbayer: no, see my comment earlier, the raw data is there. Data is persisted in hive at different stages in different forms.

nuria@stat1005:~$ ls -la /mnt/hdfs/wmf/data/raw/eventlogging/eventlogging_Popups/hourly/2017/12/27/12
total 114328
drwxr-xr-x 4 hdfs hadoop 4096 Dec 27 13:06 .
drwxr-xr-x 26 hdfs hadoop 4096 Dec 27 23:06 ..
-rw-r--r-- 1 hdfs hadoop 9863595 Dec 27 12:06 eventlogging_Popups.12.0.46395.1026353423.1514376000000
-rw-r--r-- 1 hdfs hadoop 107200408 Dec 27 13:06 eventlogging_Popups.12.0.504388.1026857874.1514376000000

Hm, ok, thanks Tilman, will investigate this today.

Ah, this was my fault. When I reran the jobs, I only started from 4 days in the past. We def need alarms on missing data. Implementing them here on just missing data is a little more complicated than some other cases, but we need some anyway. We'll figure something out though, you shouldn't have had to been the one to have to alert us about this.

Ottomata claimed this task.Feb 6 2018, 2:58 PM

BTW, I believe this should be fixed, please tell me if otherwise. We'll add better monitoring in T186833

BTW, I believe this should be fixed, please tell me if otherwise.

Thanks! The checks from T186130#3939824 look good now (feel free to adapt these queries next time something like this needs to be tested).

We'll add better monitoring in T186833

I assume you meant T186602 ? Anyway, great!

Oops, I did, thanks!

Tbayer closed this task as Resolved.Feb 9 2018, 7:45 PM