Page MenuHomePhabricator

Data Lake queries abort with HDFS write fail
Closed, DuplicatePublic

Description

This Hive query and variants of it repeatedly fail after several hours with an error message like this in stage 2:

Stage-Stage-2: Map: 627  Reduce: 739   Cumulative CPU: 77.38 sec   HDFS Read: 148778813 HDFS Write: 0 FAIL

The full log including the entire query is also pasted here: P5604

It happens both from stat1004 and from stat1002.

The strange this is that the problem goes away if one simply removes this clause in the WHERE condition of the outer query:

AND event_comment NOT REGEXP '\\{\\{R from '

(or if one replaces both clauses in that line with the one clause event_comment not regexp '[Rr]edir' - that's the version of @Milimetric's query I posted at T149021#3287887, and the failing version above results from applying the correction at T149021#3331707 ).

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 21 2017, 5:29 AM

Hi @Tbayer,
I wonder about the reason ... One solution is to use the non-yet-proctionized version of mediawiki_history that has cumulative edit counts (and timestamps in JDBC format YYYY-mm-dd HH:MM:SS): joal.mediawiki_history WHERE snapshot = '2017-05'.
I have transformed your query to match this dataset:

select
  wiki_db,
  substr(event_timestamp, 0, 10) as day,
  count(*) as creations,
  sum( if( ((unix_timestamp(event_timestamp) - unix_timestamp(coalesce(event_user_creation_timestamp, '2005-01-01 00:00:00'))) < 345600) or (revision_user_cumulative_revision_count < 10), 1, 0) ) as non_autoconfirmed_creations
from joal.mediawiki_history
where event_entity = 'revision'
  and event_type = 'create'
  and page_namespace = 0
  and wiki_db = 'enwiki'
  and revision_parent_id = 0
  and LCASE(event_comment) NOT REGEXP 'redir'
  AND event_comment NOT REGEXP '\\{\\{R from '
  and snapshot = '2017-05'
  and event_timestamp > '2017-01-01 00:00:00'
group by
  wiki_db,
  substr(event_timestamp, 0, 10)
order by
  wiki_db,
  day
limit 10000;


Total MapReduce CPU Time Spent: 0 days 4 hours 15 minutes 19 seconds 780 msec
OK
wiki_db day creations non_autoconfirmed_creations
enwiki  2017-01-01  907 45
enwiki  2017-01-02  1115  72
enwiki  2017-01-03  977 87
enwiki  2017-01-04  1088  77
enwiki  2017-01-05  931 76
enwiki  2017-01-06  921 67
enwiki  2017-01-07  913 67
enwiki  2017-01-08  1251  70
enwiki  2017-01-09  1044  83
enwiki  2017-01-10  868 71
enwiki  2017-01-11  912 75
enwiki  2017-01-12  930 75
enwiki  2017-01-13  1154  77
enwiki  2017-01-14  1067  58
enwiki  2017-01-15  996 61
enwiki  2017-01-16  1197  60
enwiki  2017-01-17  879 76
enwiki  2017-01-18  981 74
enwiki  2017-01-19  925 54
enwiki  2017-01-20  990 89
enwiki  2017-01-21  1017  59
enwiki  2017-01-22  971 56
enwiki  2017-01-23  883 80
enwiki  2017-01-24  1072  84
enwiki  2017-01-25  1253  86
enwiki  2017-01-26  996 84
enwiki  2017-01-27  1090  64
enwiki  2017-01-28  1558  66
enwiki  2017-01-29  1337  76
enwiki  2017-01-30  1446  80
enwiki  2017-01-31  939 61
enwiki  2017-02-01  964 66
enwiki  2017-02-02  833 74
enwiki  2017-02-03  909 74
enwiki  2017-02-04  962 53
enwiki  2017-02-05  1110  65
enwiki  2017-02-06  902 64
enwiki  2017-02-07  840 81
enwiki  2017-02-08  956 87
enwiki  2017-02-09  989 59
enwiki  2017-02-10  963 72
enwiki  2017-02-11  928 49
enwiki  2017-02-12  1021  58
enwiki  2017-02-13  969 57
enwiki  2017-02-14  914 66
enwiki  2017-02-15  1022  67
enwiki  2017-02-16  974 70
enwiki  2017-02-17  899 62
enwiki  2017-02-18  1038  59
enwiki  2017-02-19  1038  70
enwiki  2017-02-20  1186  72
enwiki  2017-02-21  1064  98
enwiki  2017-02-22  963 85
enwiki  2017-02-23  1267  77
enwiki  2017-02-24  943 72
enwiki  2017-02-25  1228  72
enwiki  2017-02-26  1351  68
enwiki  2017-02-27  1117  89
enwiki  2017-02-28  977 85
enwiki  2017-03-01  885 93
enwiki  2017-03-02  1004  100
enwiki  2017-03-03  1048  87
enwiki  2017-03-04  1206  126
enwiki  2017-03-05  1262  81
enwiki  2017-03-06  1188  93
enwiki  2017-03-07  1086  87
enwiki  2017-03-08  1004  114
enwiki  2017-03-09  1076  80
enwiki  2017-03-10  1095  92
enwiki  2017-03-11  1118  99
enwiki  2017-03-12  1359  76
enwiki  2017-03-13  1180  105
enwiki  2017-03-14  1548  76
enwiki  2017-03-15  1101  65
enwiki  2017-03-16  1190  77
enwiki  2017-03-17  965 91
enwiki  2017-03-18  1130  82
enwiki  2017-03-19  1202  61
enwiki  2017-03-20  1299  81
enwiki  2017-03-21  1332  69
enwiki  2017-03-22  1183  100
enwiki  2017-03-23  1105  83
enwiki  2017-03-24  946 80
enwiki  2017-03-25  993 69
enwiki  2017-03-26  1084  74
enwiki  2017-03-27  1207  81
enwiki  2017-03-28  1051  111
enwiki  2017-03-29  1371  116
enwiki  2017-03-30  1359  102
enwiki  2017-03-31  872 74
enwiki  2017-04-01  864 78
enwiki  2017-04-02  886 75
enwiki  2017-04-03  899 84
enwiki  2017-04-04  1001  82
enwiki  2017-04-05  870 76
enwiki  2017-04-06  1195  74
enwiki  2017-04-07  1078  86
enwiki  2017-04-08  1251  63
enwiki  2017-04-09  975 78
enwiki  2017-04-10  1086  92
enwiki  2017-04-11  1127  95
enwiki  2017-04-12  1114  75
enwiki  2017-04-13  975 69
enwiki  2017-04-14  1097  60
enwiki  2017-04-15  925 68
enwiki  2017-04-16  851 69
enwiki  2017-04-17  1143  86
enwiki  2017-04-18  1001  64
enwiki  2017-04-19  1126  85
enwiki  2017-04-20  1039  83
enwiki  2017-04-21  1113  105
enwiki  2017-04-22  962 68
enwiki  2017-04-23  891 65
enwiki  2017-04-24  1012  90
enwiki  2017-04-25  959 88
enwiki  2017-04-26  1013  101
enwiki  2017-04-27  991 83
enwiki  2017-04-28  891 85
enwiki  2017-04-29  1021  60
enwiki  2017-04-30  1086  74
enwiki  2017-05-01  1018  80
enwiki  2017-05-02  969 73
enwiki  2017-05-03  1020  70
enwiki  2017-05-04  983 79
enwiki  2017-05-05  994 93
enwiki  2017-05-06  827 58
enwiki  2017-05-07  1180  63
enwiki  2017-05-08  990 85
enwiki  2017-05-09  1105  96
enwiki  2017-05-10  1026  98
enwiki  2017-05-11  894 92
enwiki  2017-05-12  923 74
enwiki  2017-05-13  957 67
enwiki  2017-05-14  1056  62
enwiki  2017-05-15  1058  70
enwiki  2017-05-16  940 69
enwiki  2017-05-17  878 74
enwiki  2017-05-18  956 82
enwiki  2017-05-19  877 75
enwiki  2017-05-20  852 91
enwiki  2017-05-21  986 83
enwiki  2017-05-22  910 71
enwiki  2017-05-23  1006  113
enwiki  2017-05-24  920 109
enwiki  2017-05-25  954 96
enwiki  2017-05-26  802 102
enwiki  2017-05-27  851 94
enwiki  2017-05-28  889 117
enwiki  2017-05-29  954 88
enwiki  2017-05-30  869 110
enwiki  2017-05-31  1040  113

Also @Tbayer:
I managed to have your query finish. HDFS write issues was possibly related to an HDFS fillin-up issue we noticed at the beginning of the week (solved now, plus new alarms on their way in our side for this not happen).

However, the total time and actual time is very much slower using the cumulative edits counts for very similar results.

select
  mediawiki_history.wiki_db,
  substr(event_timestamp, 0, 8) as day,
  count(*) as creations,
  sum( if( ((unix_timestamp(event_timestamp, 'yyyyMMddHHmmss') - unix_timestamp(coalesce(event_user_creation_timestamp, '20050101000000'), 'yyyyMMddHHmmss')) < 345600) or (user_edits_by_hour.edit_count < 10), 1, 0) ) as non_autoconfirmed_creations
from wmf.mediawiki_history
inner join (
  select
    wiki_db,
    event_user_id,
    hour,
    sum(edit_count) over (partition by wiki_db, event_user_id order by hour) as edit_count
  from (
    select
      wiki_db,
      event_user_id,
      substr(event_timestamp, 0, 10) as hour,
      count(*) as edit_count
    from wmf.mediawiki_history
    where event_entity = 'revision'
      and event_type = 'create'
      and wiki_db = 'enwiki'
      and snapshot = '2017-04'
    group by
      wiki_db,
      event_user_id,
      substr(event_timestamp, 0, 10)
    ) user_edits_per_hour
  ) user_edits_by_hour
    ON mediawiki_history.wiki_db = user_edits_by_hour.wiki_db
      AND mediawiki_history.event_user_id = user_edits_by_hour.event_user_id
      AND substr(event_timestamp, 0, 10) = user_edits_by_hour.hour
  where event_entity = 'revision'
  and event_type = 'create'
  and page_namespace = 0
  and mediawiki_history.wiki_db = 'enwiki'
  and revision_parent_id = 0
  and LCASE(event_comment) NOT REGEXP 'redir'
  AND event_comment NOT REGEXP '\\{\\{R from '
  and snapshot = '2017-04'
  and event_timestamp > '20170101000000'
group by
  mediawiki_history.wiki_db,
  substr(event_timestamp, 0, 8)
order by
  mediawiki_history.wiki_db,
  day
limit 10000;


Total MapReduce CPU Time Spent: 1 days 3 hours 51 minutes 37 seconds 510 msec
OK
mediawiki_history.wiki_db	day	creations	non_autoconfirmed_creations
enwiki	20170101	907	43
enwiki	20170102	1116	71
enwiki	20170103	979	87
enwiki	20170104	1091	74
enwiki	20170105	934	74
enwiki	20170106	925	64
enwiki	20170107	915	68
enwiki	20170108	1251	65
enwiki	20170109	1045	78
enwiki	20170110	872	71
enwiki	20170111	914	74
enwiki	20170112	933	71
enwiki	20170113	1156	72
enwiki	20170114	1067	57
enwiki	20170115	1000	55
enwiki	20170116	1198	57
enwiki	20170117	885	75
enwiki	20170118	982	72
enwiki	20170119	926	51
enwiki	20170120	992	87
enwiki	20170121	1019	55
enwiki	20170122	973	54
enwiki	20170123	886	78
enwiki	20170124	1076	84
enwiki	20170125	1255	80
enwiki	20170126	1001	78
enwiki	20170127	1095	62
enwiki	20170128	1564	62
enwiki	20170129	1343	74
enwiki	20170130	1447	77
enwiki	20170131	942	60
enwiki	20170201	965	65
enwiki	20170202	836	68
enwiki	20170203	911	70
enwiki	20170204	964	50
enwiki	20170205	1115	61
enwiki	20170206	904	65
enwiki	20170207	840	79
enwiki	20170208	957	78
enwiki	20170209	992	55
enwiki	20170210	964	70
enwiki	20170211	928	49
enwiki	20170212	1025	53
enwiki	20170213	977	58
enwiki	20170214	913	61
enwiki	20170215	1024	67
enwiki	20170216	981	67
enwiki	20170217	904	63
enwiki	20170218	1048	54
enwiki	20170219	1040	69
enwiki	20170220	1190	70
enwiki	20170221	1067	93
enwiki	20170222	968	82
enwiki	20170223	1272	74
enwiki	20170224	949	72
enwiki	20170225	1228	70
enwiki	20170226	1358	66
enwiki	20170227	1122	89
enwiki	20170228	979	84
enwiki	20170301	890	86
enwiki	20170302	1006	98
enwiki	20170303	1050	83
enwiki	20170304	1210	124
enwiki	20170305	1275	79
enwiki	20170306	1199	88
enwiki	20170307	1094	88
enwiki	20170308	1011	110
enwiki	20170309	1080	78
enwiki	20170310	1101	89
enwiki	20170311	1124	96
enwiki	20170312	1380	78
enwiki	20170313	1186	106
enwiki	20170314	1562	80
enwiki	20170315	1104	63
enwiki	20170316	1199	74
enwiki	20170317	970	86
enwiki	20170318	1138	83
enwiki	20170319	1211	65
enwiki	20170320	1312	80
enwiki	20170321	1342	70
enwiki	20170322	1189	99
enwiki	20170323	1115	87
enwiki	20170324	964	84
enwiki	20170325	1000	69
enwiki	20170326	1094	76
enwiki	20170327	1212	79
enwiki	20170328	1058	111
enwiki	20170329	1375	112
enwiki	20170330	1369	104
enwiki	20170331	882	76
enwiki	20170401	872	77
enwiki	20170402	893	76
enwiki	20170403	915	86
enwiki	20170404	1014	83
enwiki	20170405	878	77
enwiki	20170406	1207	76
enwiki	20170407	1081	85
enwiki	20170408	1257	63
enwiki	20170409	984	76
enwiki	20170410	1100	93
enwiki	20170411	1138	96
enwiki	20170412	1131	76
enwiki	20170413	989	71
enwiki	20170414	1110	58
enwiki	20170415	934	65
enwiki	20170416	866	72
enwiki	20170417	1167	87
enwiki	20170418	1016	68
enwiki	20170419	1144	83
enwiki	20170420	1053	88
enwiki	20170421	1126	110
enwiki	20170422	984	78
enwiki	20170423	916	72
enwiki	20170424	1042	99
enwiki	20170425	1009	108
enwiki	20170426	1068	125
enwiki	20170427	1063	111
enwiki	20170428	934	104
enwiki	20170429	1076	89
enwiki	20170430	1128	93

@Tbayer, I just read more carefully the definition of editcount and found that it includes Flow posts. Since these are not revisions, there could be a significant difference between revision_user_cumulative_revision_count and user_editcount. Because of this, any analysis we've done so far that approximates user_editcount historically could be under-counting the number of autoconfirmed users.

https://www.mediawiki.org/wiki/Manual:User_table#user_editcount

Also @Tbayer:
I managed to have your query finish. HDFS write issues was possibly related to an HDFS fillin-up issue we noticed at the beginning of the week (solved now, plus new alarms on their way in our side for this not happen).

That's T168390, I presume?

However, the total time and actual time is very much slower using the cumulative edits counts for very similar results.

Great, so that means we can save a lot of time using your new table at joal.mediawiki_history, and get more accurate results at the same time from that new field revision_user_cumulative_revision_count. (The nested user_edits_by_hour subquery was merely an approximation that @Milimetric came up with in the absence of such a field, T149021#3260176 ). I'm going to try that for the actual queries I was intending to run (the above example was a simplified version for investigating the HDFS write fail bug; also, the resulting numbers should not be used).

Are there any other things to be aware of when using joal.mediawiki_history instead of the main production version mediawiki_history? Is the added edit count field the only difference?

@Tbayer, I just read more carefully the definition of editcount and found that it includes Flow posts. Since these are not revisions, there could be a significant difference between revision_user_cumulative_revision_count and user_editcount. Because of this, any analysis we've done so far that approximates user_editcount historically could be under-counting the number of autoconfirmed users.
https://www.mediawiki.org/wiki/Manual:User_table#user_editcount

Thanks for the heads-up - replied at T166269#3367820 .

@Tbayer: HDFS issue realted taks is indeed T168390.

Things you should be aware of with the table joal.mediawiki_history:

  • Every timestamps are in JDBC format (YYYY-mm-dd HH:MM:SS) - This differs a lot from the current production one (YYYYmmddHHMMSS)
  • New per-user and per-page revision fields:
revision_user_cumulative_revision_count	        bigint              	In revision events: Cumulative revision count per user for the current event_user_id
revision_user_seconds_to_previous_revision	bigint              	In revision events: seconds elapsed since the previous revision made by the current event_user_id
revision_page_cumulative_revision_count	        bigint              	In revision events: Cumulative revision count per page for the current page_id
revision_page_seconds_to_previous_revision	bigint              	In revision events: seconds elapsed since the previous revision made on the current page_id

Only one snapshot available: 2017-05
You're probably only going to use revision_user_cumulative_revision_count for the moment, but the other are interesting too.
Also, names are (strongly) probably going to change in productionized version of this.

Nuria assigned this task to JAllemandou.Jun 26 2017, 3:57 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.

@Tbayer : Do you ming if I merge this task with T161147? The latter is the solution to the former.

No answer, merging.