Page MenuHomePhabricator

Failure while refining webrequest upload 2018-12-01-14. Upgrade alarms
Closed, ResolvedPublic5 Estimated Story Points

Description

Tracking task to handle the following event:

Data Loss ERROR - Workflow webrequest-load-check_sequence_statistics-wf-upload-2018-12-1-14

This is an ERROR.
This job has failed and refine jobs have been cancelled.

Please, have a look at the error file attached to this email
and the statistics table wmf_raw.webrequest_sequence_stats_hourly for a detailed explanation,
and take necessary action!

Thanks :)
-- Oozie

Attached data:

30 requests (0.0% of total) have incomplete records. 8041489 requests (6.038% of valid ones) were lost.

This is probably a occurrent of the infamous dt:"-" issue (varnishkafka not able to grab the Varnish request's Timestamp:Resp) generating a false alarm, but better to investigate properly. This means that related data will need to hold for a bit before being generated (like pageviews, etc..).

Event Timeline

elukey triaged this task as Medium priority.Dec 3 2018, 10:55 AM
elukey created this task.
elukey raised the priority of this task from Medium to High.Dec 3 2018, 11:15 AM

TL;DR: 83 requests in the failed hour are responsible for the failure. Their sequence-number is fairly bigger than the ones in the current hour, while their timestamp is still assigned to the hour (this is not the dt = '-' case).

Details:
Using spark, I extracted "wmf_raw.webrequest" for the failed-refine hour, with only rowsd from a hostname that had a high error rate in stats.

spark.sql("select * from wmf_raw.webrequest where webrequest_source = 'upload' and year = 2018 and month = 12 and day = 1 and hour = 14 and hostname = 'cp3036.esams.wmnet'").createOrReplaceTempView("wr")

I then queried the extracted data for by-minute stats (min(sequence),max(sequence), their diff and the actuall count).

spark.sql("select substr(dt, 15, 2) as sec, min(sequence) as min_seq, max(sequence) as max_seq, (max(sequence) - min(sequence)) as diff_seq, count(distinct sequence) as count_seq from wr group by substr(dt, 15, 2) order by sec").show(1000)


+---+----------+----------+--------+---------+                                  
|sec|   min_seq|   max_seq|diff_seq|count_seq|
+---+----------+----------+--------+---------+
|   |2969765034|2969765034|       0|        1|
| 00|2969647441|2969735788|   88347|    88348|
| 01|2969735789|2970800139| 1064350|    89172|
| 02|2969824962|2969916503|   91541|    91014|
| 03|2969915975|2970934084| 1018109|    93746|
| 04|2970009720|2970101496|   91776|    91776|
| 05|2970101497|2971157671| 1056174|    93856|
| 06|2970195350|2971213435| 1018085|    90624|
| 07|2970285973|2971277581|  991608|    86507|
| 08|2970372480|2970459450|   86970|    86969|
| 09|2970459451|2970546784|   87333|    87334|
| 10|2970546785|2970893380|  346595|    90122|
| 11|2970636905|2971004502|  367597|    88171|
| 12|2970725076|2970820022|   94946|    94946|
| 13|2970820023|2971092207|  272184|    92202|
| 14|2970912225|2971020718|  108493|    94044|
| 15|2971006270|2971158151|  151881|    89600|
| 16|2971095871|2971400546|  304675|    91426|
| 17|2971187298|2971960116|  772818|    92166|
| 18|2971279463|2971581414|  301951|    89895|
| 19|2971369356|2971870974|  501618|    90228|
| 20|2971459584|2971548397|   88813|    88812|
| 21|2971548398|2971738061|  189663|    88064|
| 22|2971636462|2971728573|   92111|    92111|
| 23|2971728574|2972814455| 1085881|    89689|
| 24|2971818261|2972604362|  786101|    89144|
| 25|2971907405|2971999039|   91634|    91633|
| 26|2971999040|2973003218| 1004178|    90611|
| 27|2972089648|2972179542|   89894|    89895|
| 28|2972179543|2972273677|   94134|    94135|
| 29|2972273678|2973233708|  960030|    94025|
| 30|2972367704|2972458484|   90780|    90781|
| 31|2972458485|2973029308|  570823|    92551|
| 32|2972551035|2972643784|   92749|    92749|
| 33|2972643785|2972737183|   93398|    93399|
| 34|2972737184|2972833119|   95935|    95935|
| 35|2972833120|2972924290|   91170|    91170|
| 36|2972924291|2973016109|   91818|    91818|
| 37|2973016110|2973111251|   95141|    95141|
| 38|2973111252|2973336717|  225465|    95402|
| 39|2973206653|2973301245|   94592|    94592|
| 40|2973301246|2973396909|   95663|    95240|
| 41|2973396486|2973487892|   91406|    91406|
| 42|2973487893|2974240334|  752441|    95654|
| 43|2973583544|2973679576|   96032|    96031|
| 44|2973679577|2974665668|  986091|    93221|
| 45|2973772797|2973863541|   90744|    90723|
| 46|2973863519|2973957151|   93632|    93632|
| 47|2973957152|2974986768| 1029616|    88741|
| 48|2974045889|2974139465|   93576|    93577|
| 49|2974139466|2975158988| 1019522|    94965|
| 50|2974234430|2974326744|   92314|    92313|
| 51|2974326745|2974423126|   96381|    96381|
| 52|2974423127|2974659762|  236635|    92887|
| 53|2974516013|2975552541| 1036528|    87474|
| 54|2974603486|2974696688|   93202|    92187|
| 55|2974695674|2975001933|  306259|    92559|
| 56|2974788233|2975883279| 1095046|    95365|
| 57|2974883596|2975991479| 1107883|    91648|
| 58|2974975244|2975070433|   95189|    95188|
| 59|2975070434|2976106912| 1036478|    95615|
+---+----------+----------+--------+---------+

I picked a minute having discrepancies and checked by second:

spark.sql("select substr(dt, 18, 2) as sec, min(sequence) as min_seq, max(sequence) as max_seq, (max(sequence) - min(sequence)) as diff_seq, count(distinct sequence) as count_seq from wr where  substr(dt, 15, 2) = '26' group by substr(dt, 18, 2) order by sec").show(1000)


+---+----------+----------+--------+---------+
|sec|   min_seq|   max_seq|diff_seq|count_seq|
+---+----------+----------+--------+---------+
| 00|2971999040|2972000432|    1392|     1393|
| 01|2972000433|2972001804|    1371|     1372|
| 02|2972001805|2972003240|    1435|     1436|
| 03|2972003241|2973003218|  999977|     1529|
| 04|2972004769|2972006288|    1519|     1520|
| 05|2972006289|2972007746|    1457|     1458|
| 06|2972007747|2972009037|    1290|     1291|
| 07|2972009038|2972010829|    1791|     1792|
| 08|2972010830|2972012472|    1642|     1643|
| 09|2972012473|2972013977|    1504|     1505|
| 10|2972013978|2972015439|    1461|     1462|
| 11|2972015440|2972017007|    1567|     1568|
| 12|2972017008|2972867952|  850944|     1553|
| 13|2972018560|2972019886|    1326|     1327|
| 14|2972019887|2972021118|    1231|     1232|
| 15|2972021119|2972286055|  264936|     1383|
| 16|2972022501|2972024885|    2384|     1506|
| 17|2972024006|2972025437|    1431|     1430|
| 18|2972025438|2972027407|    1969|     1473|
| 19|2972026910|2972028269|    1359|     1359|
| 20|2972028270|2972029658|    1388|     1350|
| 21|2972029619|2972031219|    1600|     1600|
| 22|2972031220|2972032894|    1674|     1675|
| 23|2972032895|2972034535|    1640|     1641|
| 24|2972034536|2972036120|    1584|     1453|
| 25|2972035988|2972037465|    1477|     1477|
| 26|2972037466|2972039208|    1742|     1743|
| 27|2972039209|2972040460|    1251|     1164|
| 28|2972040372|2972042314|    1942|     1664|
| 29|2972042036|2972043744|    1708|     1708|
| 30|2972043745|2972045412|    1667|     1668|
| 31|2972045413|2972046841|    1428|     1429|
| 32|2972046842|2972048148|    1306|     1307|
| 33|2972048149|2972049662|    1513|     1514|
| 34|2972049663|2972051312|    1649|     1650|
| 35|2972051313|2972052848|    1535|     1536|
| 36|2972052849|2972054357|    1508|     1509|
| 37|2972054358|2972055863|    1505|     1506|
| 38|2972055864|2972057412|    1548|     1549|
| 39|2972057413|2972059793|    2380|     2268|
| 40|2972059680|2972334201|  274521|     1308|
| 41|2972060988|2972062567|    1579|     1580|
| 42|2972062568|2972064275|    1707|     1708|
| 43|2972064276|2972065701|    1425|     1426|
| 44|2972065702|2972067001|    1299|     1201|
| 45|2972066902|2972068624|    1722|     1722|
| 46|2972068625|2972070251|    1626|     1627|
| 47|2972070252|2972071598|    1346|     1347|
| 48|2972071599|2972073141|    1542|     1543|
| 49|2972073142|2972074391|    1249|     1250|
| 50|2972074392|2972075874|    1482|     1483|
| 51|2972075875|2972077217|    1342|     1343|
| 52|2972077218|2972078852|    1634|     1607|
| 53|2972078824|2972081143|    2319|     1722|
| 54|2972080546|2972081886|    1340|     1340|
| 55|2972081887|2972083272|    1385|     1386|
| 56|2972083273|2972084991|    1718|     1719|
| 57|2972084992|2972086538|    1546|     1547|
| 58|2972086539|2972088066|    1527|     1528|
| 59|2972088067|2972089647|    1580|     1581|
+---+----------+----------+--------+---------+

And finally check in seconds showing discrepancies if I could find a row having an outlying sequence number (meaning that current sequence-number is not equal to the previous sequence number found in the data + 1):

spark.sql("""
select
  sequence,
  LAG(sequence) OVER (ORDER by sequence) as prev_seq
from wr
where substr(dt, 15, 2) = '26'
  AND substr(dt, 18, 2) = '03'
HAVING sequence != prev_seq + 1
""").show(1000)

+----------+----------+
|  sequence|  prev_seq|
+----------+----------+
|2973003218|2972004768|
+----------+----------+

I then replicated the last bit of the query on the overall failed hour, with a diff in sequence-numbers to be bigger than 10 (to show only big outliers):

spark.sql("""
select
  hostname,
  sequence,
  LAG(sequence) OVER (PARTITION BY hostname ORDER by sequence) as prev_seq
from wmf_raw.webrequest
where webrequest_source = 'upload'
  and year = 2018
  and month = 12
  and day = 1
  and hour = 14
  and dt != '-'
HAVING (sequence - prev_seq > 10)
""").show(1000)

+------------------+----------+----------+                                      
|          hostname|  sequence|  prev_seq|
+------------------+----------+----------+
|cp3046.esams.wmnet|3289197438|3289196147|
|cp3046.esams.wmnet|3289610551|3289197438|
|cp3046.esams.wmnet|3289669279|3289610551|
|cp3045.esams.wmnet|2036434594|2036431061|
|cp1088.eqiad.wmnet|2148435111|2148416264|
|cp5003.eqsin.wmnet|2410946218|2410945684|
|cp5003.eqsin.wmnet|2410958979|2410946218|
|cp5003.eqsin.wmnet|2410976324|2410958979|
|cp5003.eqsin.wmnet|2411010753|2410976324|
|cp5003.eqsin.wmnet|2411031858|2411010753|
|cp5003.eqsin.wmnet|2411099872|2411031858|
|cp5003.eqsin.wmnet|2411362070|2411099872|
|cp3039.esams.wmnet|2925061520|2924225848|
|cp3039.esams.wmnet|2925170560|2925061520|
|cp3039.esams.wmnet|2925205068|2925170560|
|cp5005.eqsin.wmnet|2455820583|2455818605|
|cp5005.eqsin.wmnet|2455865141|2455820583| + 2455865142
|cp5005.eqsin.wmnet|2455865980|2455865142|
|cp5005.eqsin.wmnet|2455895817|2455865980|
|cp5005.eqsin.wmnet|2455956524|2455895817|
|cp5005.eqsin.wmnet|2455960290|2455956524|
|cp5005.eqsin.wmnet|2456130377|2455960290|
|cp5005.eqsin.wmnet|2456265260|2456130377|
|cp5005.eqsin.wmnet|2456280714|2456265260|
|cp3034.esams.wmnet|1873686782|1873670506|
|cp3034.esams.wmnet|1873717576|1873686782|
|cp3034.esams.wmnet|1873880845|1873717576|
|cp5006.eqsin.wmnet|1972675333|1972633498|
|cp5006.eqsin.wmnet|1972679608|1972675333|
|cp5006.eqsin.wmnet|1972909675|1972679608|
|cp5006.eqsin.wmnet|1972955265|1972909675|
|cp3036.esams.wmnet|2975552541|2975166049|
|cp3036.esams.wmnet|2975883279|2975552541|
|cp3036.esams.wmnet|2975991479|2975883279|
|cp3036.esams.wmnet|2976106912|2975991479|
|cp3037.esams.wmnet|2709124094|2709121768|
|cp3037.esams.wmnet|2709202162|2709124094|
|cp3044.esams.wmnet|2028778802|2028770005|
|cp3044.esams.wmnet|2028859118|2028778802|
|cp3044.esams.wmnet|2029008837|2028859118|
|cp3044.esams.wmnet|2029248813|2029008837|
|cp3044.esams.wmnet|2029579549|2029248813|
|cp5004.eqsin.wmnet|2396115757|2396115681|
|cp5004.eqsin.wmnet|2396136261|2396115757|
|cp5004.eqsin.wmnet|2396200222|2396136261|
|cp5004.eqsin.wmnet|2396224314|2396200222|
|cp5004.eqsin.wmnet|2396240536|2396224314|
|cp5004.eqsin.wmnet|2396243068|2396240536|
|cp5004.eqsin.wmnet|2396260283|2396243068|
|cp5004.eqsin.wmnet|2396276116|2396260283|
|cp5004.eqsin.wmnet|2396331834|2396276116|
|cp5004.eqsin.wmnet|2396468206|2396331834|
|cp5004.eqsin.wmnet|2396580291|2396468206|
|cp5004.eqsin.wmnet|2396624225|2396580291|
|cp5002.eqsin.wmnet|2363062082|2363061882|
|cp5002.eqsin.wmnet|2363100800|2363062082|
|cp5002.eqsin.wmnet|2363119650|2363100800|
|cp5002.eqsin.wmnet|2363123534|2363119650| + 2363123535
|cp5002.eqsin.wmnet|2363171353|2363123535|
|cp5002.eqsin.wmnet|2363318635|2363171353|
|cp5002.eqsin.wmnet|2363324094|2363318635|
|cp5002.eqsin.wmnet|2363419053|2363324094|
|cp5002.eqsin.wmnet|2363655575|2363419053|
|cp3038.esams.wmnet|2678725127|2678656478|
|cp3038.esams.wmnet|2679089425|2678725127|
|cp2005.codfw.wmnet| 523278660| 523278632|
|cp3035.esams.wmnet|2824347424|2823557155|
|cp1084.eqiad.wmnet| 815066339| 815043026|
|cp4022.ulsfo.wmnet| 802226878| 802207151|
|cp5001.eqsin.wmnet| 388607377| 388600413|
|cp5001.eqsin.wmnet| 388609864| 388607377|
|cp5001.eqsin.wmnet| 388611464| 388609864|
|cp5001.eqsin.wmnet| 388619497| 388611464|
|cp5001.eqsin.wmnet| 388641869| 388619497|
|cp5001.eqsin.wmnet| 388672509| 388641869|
|cp5001.eqsin.wmnet| 388678653| 388672509|
|cp5001.eqsin.wmnet| 388879504| 388678653|
|cp5001.eqsin.wmnet| 389046938| 388879504|
|cp5001.eqsin.wmnet| 389146001| 389046938|
|cp3047.esams.wmnet|1581189357|1580778440|
|cp1086.eqiad.wmnet|1942509620|1942509385|
+------------------+----------+----------+

The rows showing another value with a + sign mean that the sequence number in the table and then next one (with the + sign) where present in the dataset.

Finally, I reran a manual production-like stats-job in spark, in which I had removed the problematic sequence numbers found above:

spark.sql("""
WITH
  statistics AS (
    SELECT
      hostname,
      MIN(sequence)                                                  AS sequence_min,
      MAX(sequence)                                                  AS sequence_max,
      COUNT(*)                                                       AS count_actual,
      MAX(sequence) - MIN(sequence) + 1                              AS count_expected,
      MAX(sequence) - MIN(sequence) + 1 - COUNT(hostname)            AS count_different,
      COUNT(*) - COUNT(DISTINCT sequence)                            AS count_duplicate,
      SUM(if(sequence IS NULL,1,0))                                  AS count_null_sequence,
      ((COUNT(*) / (MAX(sequence) - MIN(sequence) + 1)) - 1) * 100.0 AS percent_different
    FROM
      wmf_raw.webrequest
    WHERE
      webrequest_source='upload' AND
      year=2018 AND month=12 AND day=1 AND hour=14 AND
      dt != '-' AND 
      sequence NOT IN (3289197438, 3289610551, 3289669279, 2036434594, 2148435111, 2410946218, 2410958979, 2410976324, 2411010753, 2411031858, 2411099872, 2411362070, 2925061520, 2925170560, 2925205068, 2455820583, 2455865141, 2455865142, 2455865980, 2455895817, 2455956524, 2455960290, 2456130377, 2456265260, 2456280714, 1873686782, 1873717576, 1873880845, 1972675333, 1972679608, 1972909675, 1972955265, 2975552541, 2975883279, 2975991479, 2976106912, 2709124094, 2709202162, 2028778802, 2028859118, 2029008837, 2029248813, 2029579549, 2396115757, 2396136261, 2396200222, 2396224314, 2396240536, 2396243068, 2396260283, 2396276116, 2396331834, 2396468206, 2396580291, 2396624225, 2363062082, 2363100800, 2363119650, 2363123534, 2363123535, 2363171353, 2363318635, 2363324094, 2363419053, 2363655575, 2678725127, 2679089425, 523278660, 2824347424, 815066339, 802226878, 388607377, 388609864, 388611464, 388619497, 388641869, 388672509, 388678653, 388879504, 389046938, 389146001, 1581189357, 1942509620)
    GROUP BY
      hostname, webrequest_source, year, month, day, hour
  ),
  undefined AS (
    SELECT
      hostname,
      SUM(IF(dt='-',1,0)) AS count_incomplete
    FROM
      wmf_raw.webrequest
    WHERE
      webrequest_source='upload' AND
      year=2018 AND month=12 AND day=1 AND hour=14
    GROUP BY
      hostname, webrequest_source, year, month, day, hour
  )

  SELECT
    undefined.hostname,
    sequence_min,
    sequence_max,
    count_actual,
    count_expected,
    count_different,
    count_duplicate,
    count_null_sequence,
    percent_different,
    count_incomplete
  FROM
    statistics RIGHT JOIN undefined
      ON statistics.hostname = undefined.hostname
""").show(1000, false)

The above stats would not have generated even a warning message.

This means we should be safe to rerun the failed refine job with higher error-threshold (errors are due to false positives).

And finally - We need to find the reason for which some rows in VK are assigned a sequence-number long after their timestamp is assigned (which should be the end-request timestamp).

fdans raised the priority of this task from High to Unbreak Now!.Dec 3 2018, 5:38 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

Operational problem solved: the refinement job has been restarted with higher error-acceptance rates - Blocked jobs have caught up.
Some more analysis to come to try root-causing this.

elukey lowered the priority of this task from Unbreak Now! to High.Dec 3 2018, 7:15 PM

Since the refined data should now be there, lowering the priority to High :)

Problematic requests for end-of-hour upload 2018-12-01T14:

spark.sql("""
SELECT
  hour,
  hostname,
  sequence,
  LAG(sequence) OVER hour_hostname_window AS prev_seq,
  (sequence - LAG(sequence) OVER hour_hostname_window) AS diff_seq,
  dt,
  LAG(dt) OVER hour_hostname_window AS prev_dt
FROM wmf_raw.webrequest
WHERE webrequest_source = 'upload'
  AND year = 2018
  AND month = 12
  AND day = 1
  AND hour = 14
  AND dt != '-'
HAVING (sequence - prev_seq > 10)
WINDOW hour_hostname_window AS (PARTITION BY hour, hostname ORDER by hour, sequence)
""").show(1000)


+----+------------------+----------+----------+--------+-------------------+-------------------+
|hour|          hostname|  sequence|  prev_seq|diff_seq|                 dt|            prev_dt|
+----+------------------+----------+----------+--------+-------------------+-------------------+
|  14|cp5004.eqsin.wmnet|2396115757|2396115681|      76|2018-12-01T14:59:59|2018-12-01T14:59:59|
|  14|cp5004.eqsin.wmnet|2396136261|2396115757|   20504|2018-12-01T14:57:50|2018-12-01T14:59:59|
|  14|cp5004.eqsin.wmnet|2396200222|2396136261|   63961|2018-12-01T14:52:33|2018-12-01T14:57:50|
|  14|cp5004.eqsin.wmnet|2396224314|2396200222|   24092|2018-12-01T14:51:16|2018-12-01T14:52:33|
|  14|cp5004.eqsin.wmnet|2396240536|2396224314|   16222|2018-12-01T14:59:49|2018-12-01T14:51:16|
|  14|cp5004.eqsin.wmnet|2396243068|2396240536|    2532|2018-12-01T14:53:44|2018-12-01T14:59:49|
|  14|cp5004.eqsin.wmnet|2396260283|2396243068|   17215|2018-12-01T14:59:24|2018-12-01T14:53:44|
|  14|cp5004.eqsin.wmnet|2396276116|2396260283|   15833|2018-12-01T14:59:39|2018-12-01T14:59:24|
|  14|cp5004.eqsin.wmnet|2396331834|2396276116|   55718|2018-12-01T14:53:10|2018-12-01T14:59:39|
|  14|cp5004.eqsin.wmnet|2396468206|2396331834|  136372|2018-12-01T14:55:22|2018-12-01T14:53:10|
|  14|cp5004.eqsin.wmnet|2396580291|2396468206|  112085|2018-12-01T14:58:14|2018-12-01T14:55:22|
|  14|cp5004.eqsin.wmnet|2396624225|2396580291|   43934|2018-12-01T14:57:48|2018-12-01T14:58:14|
|  14|cp3036.esams.wmnet|2975552541|2975166049|  386492|2018-12-01T14:53:20|2018-12-01T14:59:59|
|  14|cp3036.esams.wmnet|2975883279|2975552541|  330738|2018-12-01T14:56:54|2018-12-01T14:53:20|
|  14|cp3036.esams.wmnet|2975991479|2975883279|  108200|2018-12-01T14:57:17|2018-12-01T14:56:54|
|  14|cp3036.esams.wmnet|2976106912|2975991479|  115433|2018-12-01T14:59:31|2018-12-01T14:57:17|
|  14|cp3034.esams.wmnet|1873686782|1873670506|   16276|2018-12-01T14:58:30|2018-12-01T14:59:59|
|  14|cp3034.esams.wmnet|1873717576|1873686782|   30794|2018-12-01T14:59:27|2018-12-01T14:58:30|
|  14|cp3034.esams.wmnet|1873880845|1873717576|  163269|2018-12-01T14:59:27|2018-12-01T14:59:27|
|  14|cp2005.codfw.wmnet| 523278660| 523278632|      28|2018-12-01T14:59:59|2018-12-01T14:59:59|
|  14|cp5002.eqsin.wmnet|2363062082|2363061882|     200|2018-12-01T14:59:59|2018-12-01T14:59:59|
|  14|cp5002.eqsin.wmnet|2363100800|2363062082|   38718|2018-12-01T14:59:07|2018-12-01T14:59:59|
|  14|cp5002.eqsin.wmnet|2363119650|2363100800|   18850|2018-12-01T14:58:03|2018-12-01T14:59:07|
|  14|cp5002.eqsin.wmnet|2363123534|2363119650|    3884|2018-12-01T14:58:03|2018-12-01T14:58:03|
|  14|cp5002.eqsin.wmnet|2363171353|2363123535|   47818|2018-12-01T14:58:58|2018-12-01T14:57:23|
|  14|cp5002.eqsin.wmnet|2363318635|2363171353|  147282|2018-12-01T14:55:19|2018-12-01T14:58:58|
|  14|cp5002.eqsin.wmnet|2363324094|2363318635|    5459|2018-12-01T14:53:56|2018-12-01T14:55:19|
|  14|cp5002.eqsin.wmnet|2363419053|2363324094|   94959|2018-12-01T14:56:25|2018-12-01T14:53:56|
|  14|cp5002.eqsin.wmnet|2363655575|2363419053|  236522|2018-12-01T14:59:10|2018-12-01T14:56:25|
|  14|cp1084.eqiad.wmnet| 815066339| 815043026|   23313|2018-12-01T14:59:31|2018-12-01T14:59:59|
|  14|cp5003.eqsin.wmnet|2410946218|2410945684|     534|2018-12-01T14:57:19|2018-12-01T14:59:59|
|  14|cp5003.eqsin.wmnet|2410958979|2410946218|   12761|2018-12-01T14:49:47|2018-12-01T14:57:19|
|  14|cp5003.eqsin.wmnet|2410976324|2410958979|   17345|2018-12-01T14:51:41|2018-12-01T14:49:47|
|  14|cp5003.eqsin.wmnet|2411010753|2410976324|   34429|2018-12-01T14:58:24|2018-12-01T14:51:41|
|  14|cp5003.eqsin.wmnet|2411031858|2411010753|   21105|2018-12-01T14:59:32|2018-12-01T14:58:24|
|  14|cp5003.eqsin.wmnet|2411099872|2411031858|   68014|2018-12-01T14:59:54|2018-12-01T14:59:32|
|  14|cp5003.eqsin.wmnet|2411362070|2411099872|  262198|2018-12-01T14:57:29|2018-12-01T14:59:54|
|  14|cp5001.eqsin.wmnet| 388607377| 388600413|    6964|2018-12-01T14:49:19|2018-12-01T14:59:59|
|  14|cp5001.eqsin.wmnet| 388609864| 388607377|    2487|2018-12-01T14:59:51|2018-12-01T14:49:19|
|  14|cp5001.eqsin.wmnet| 388611464| 388609864|    1600|2018-12-01T14:50:34|2018-12-01T14:59:51|
|  14|cp5001.eqsin.wmnet| 388619497| 388611464|    8033|2018-12-01T14:58:19|2018-12-01T14:50:34|
|  14|cp5001.eqsin.wmnet| 388641869| 388619497|   22372|2018-12-01T14:58:09|2018-12-01T14:58:19|
|  14|cp5001.eqsin.wmnet| 388672509| 388641869|   30640|2018-12-01T14:51:30|2018-12-01T14:58:09|
|  14|cp5001.eqsin.wmnet| 388678653| 388672509|    6144|2018-12-01T14:58:18|2018-12-01T14:51:30|
|  14|cp5001.eqsin.wmnet| 388879504| 388678653|  200851|2018-12-01T14:54:14|2018-12-01T14:58:18|
|  14|cp5001.eqsin.wmnet| 389046938| 388879504|  167434|2018-12-01T14:55:52|2018-12-01T14:54:14|
|  14|cp5001.eqsin.wmnet| 389146001| 389046938|   99063|2018-12-01T14:57:35|2018-12-01T14:55:52|
|  14|cp3044.esams.wmnet|2028778802|2028770005|    8797|2018-12-01T14:57:09|2018-12-01T14:59:59|
|  14|cp3044.esams.wmnet|2028859118|2028778802|   80316|2018-12-01T14:58:33|2018-12-01T14:57:09|
|  14|cp3044.esams.wmnet|2029008837|2028859118|  149719|2018-12-01T14:55:12|2018-12-01T14:58:33|
|  14|cp3044.esams.wmnet|2029248813|2029008837|  239976|2018-12-01T14:54:48|2018-12-01T14:55:12|
|  14|cp3044.esams.wmnet|2029579549|2029248813|  330736|2018-12-01T14:57:14|2018-12-01T14:54:48|
|  14|cp3047.esams.wmnet|1581189357|1580778440|  410917|2018-12-01T14:53:07|2018-12-01T14:59:59|
|  14|cp3039.esams.wmnet|2925061520|2924225848|  835672|2018-12-01T14:59:14|2018-12-01T14:59:59|
|  14|cp3039.esams.wmnet|2925170560|2925061520|  109040|2018-12-01T14:59:07|2018-12-01T14:59:14|
|  14|cp3039.esams.wmnet|2925205068|2925170560|   34508|2018-12-01T14:59:45|2018-12-01T14:59:07|
|  14|cp5006.eqsin.wmnet|1972675333|1972633498|   41835|2018-12-01T14:58:58|2018-12-01T14:59:59|
|  14|cp5006.eqsin.wmnet|1972679608|1972675333|    4275|2018-12-01T14:58:51|2018-12-01T14:58:58|
|  14|cp5006.eqsin.wmnet|1972909675|1972679608|  230067|2018-12-01T14:53:46|2018-12-01T14:58:51|
|  14|cp5006.eqsin.wmnet|1972955265|1972909675|   45590|2018-12-01T14:54:21|2018-12-01T14:53:46|
|  14|cp3045.esams.wmnet|2036434594|2036431061|    3533|2018-12-01T14:57:48|2018-12-01T14:59:59|
|  14|cp1086.eqiad.wmnet|1942509620|1942509385|     235|2018-12-01T14:59:59|2018-12-01T14:59:59|
|  14|cp1088.eqiad.wmnet|2148435111|2148416264|   18847|2018-12-01T14:59:26|2018-12-01T14:59:59|
|  14|cp3038.esams.wmnet|2678725127|2678656478|   68649|2018-12-01T14:50:31|2018-12-01T14:59:59|
|  14|cp3038.esams.wmnet|2679089425|2678725127|  364298|2018-12-01T14:54:16|2018-12-01T14:50:31|
|  14|cp3035.esams.wmnet|2824347424|2823557155|  790269|2018-12-01T14:58:45|2018-12-01T14:59:59|
|  14|cp3046.esams.wmnet|3289197438|3289196147|    1291|2018-12-01T14:59:59|2018-12-01T14:59:59|
|  14|cp3046.esams.wmnet|3289610551|3289197438|  413113|2018-12-01T14:53:26|2018-12-01T14:59:59|
|  14|cp3046.esams.wmnet|3289669279|3289610551|   58728|2018-12-01T14:53:31|2018-12-01T14:53:26|
|  14|cp5005.eqsin.wmnet|2455820583|2455818605|    1978|2018-12-01T14:59:55|2018-12-01T14:59:59|
|  14|cp5005.eqsin.wmnet|2455865141|2455820583|   44558|2018-12-01T14:58:28|2018-12-01T14:59:55|
|  14|cp5005.eqsin.wmnet|2455865980|2455865142|     838|2018-12-01T14:58:08|2018-12-01T14:58:26|
|  14|cp5005.eqsin.wmnet|2455895817|2455865980|   29837|2018-12-01T14:58:45|2018-12-01T14:58:08|
|  14|cp5005.eqsin.wmnet|2455956524|2455895817|   60707|2018-12-01T14:54:23|2018-12-01T14:58:45|
|  14|cp5005.eqsin.wmnet|2455960290|2455956524|    3766|2018-12-01T14:59:18|2018-12-01T14:54:23|
|  14|cp5005.eqsin.wmnet|2456130377|2455960290|  170087|2018-12-01T14:54:04|2018-12-01T14:59:18|
|  14|cp5005.eqsin.wmnet|2456265260|2456130377|  134883|2018-12-01T14:56:37|2018-12-01T14:54:04|
|  14|cp5005.eqsin.wmnet|2456280714|2456265260|   15454|2018-12-01T14:59:01|2018-12-01T14:56:37|
|  14|cp3037.esams.wmnet|2709124094|2709121768|    2326|2018-12-01T14:57:22|2018-12-01T14:59:59|
|  14|cp3037.esams.wmnet|2709202162|2709124094|   78068|2018-12-01T14:53:41|2018-12-01T14:57:22|
|  14|cp4022.ulsfo.wmnet| 802226878| 802207151|   19727|2018-12-01T14:57:59|2018-12-01T14:59:59|
+----+------------------+----------+----------+--------+-------------------+-------------------+

Something to notice is that the problem with the rows listed above occurs all along the hour but only shows-up at the end-border of the calendar-hour since this is the time we finalize the batch. The following queries count how many rows have
big discrepancies in sequence-numbers (next sequence being more than 100 bigger instead of 1) when grouped per minute batches (instead of hour-batches), for the same hour as above:

spark.sql("""
SELECT COUNT(1) FROM (
SELECT
  sequence,
  LAG(sequence) OVER minute_hostname_window AS prev_seq
FROM wmf_raw.webrequest
WHERE webrequest_source = 'upload'
  AND year = 2018
  AND month = 12
  AND day = 1
  AND hour = 14
  AND dt != '-'
-- Catch only requests having a big diff in seq
HAVING abs(sequence - prev_seq) > 100
WINDOW minute_hostname_window AS (PARTITION BY substr(dt, 15, 2), hostname ORDER by hour, sequence)
) t
""").show(1000)


+--------+                                                                      
|count(1)|
+--------+
|    1908|
+--------+

When I try to repro these steps, i get a permission denied error in spark shell?

spark.sql("select * from wmf_raw.webrequest where webrequest_source = 'upload' and year = 2018 and month = 12 and day = 1 and hour = 14 and hostname = 'cp3036.esams.wmnet'").createOrReplaceTempView("wr")

18/12/03 23:31:12 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.

scala> spark.sql("desc wr");
res3: org.apache.spark.sql.DataFrame = [col_name: string, data_type: string ... 1 more field]

scala> spark.sql("select * from wr limit 10"). show(10)
Permission denied: user=nuria, access=READ, inode="/wmf/data/raw/webrequest/webrequest_upload/hourly/2018/12/01/14/webrequest_upload.1001.2.868557.38489041706.1543672800000":hdfs:hadoop:-rw-r-----

I looked at webrequests for "Problematic requests for end-of-hour upload 2018-12-01T14" (all "sequences" in table above) . Of the 82 requests 52 come from asia (at 11pm at nite) but I really cannot find a common pattern in the data.

Also , response sizes do not seem any special:
16264 /wikipedia/commons/thumb/8/80/Mamas_and_the_Papas%27_John_Phillips_in_1967.JPG/438px-Mamas_and_the_Papas%27_John_Phillips_in_1967.JPG
13481 /wikipedia/commons/thumb/c/cd/Bush_meets_Pranab_Mukherjee.jpg/220px-Bush_meets_Pranab_Mukherjee.jpg
174092 /wikipedia/commons/d/db/MaryRose-carpentry_tools1.jpg
257089 /wikipedia/commons/d/d9/OO_Scale_BR_25054.JPG
2405 /wikipedia/commons/thumb/6/6e/Peacock_pose.jpg/110px-Peacock_pose.jpg
4951 /wikipedia/commons/f/f6/Emmeloord-Berkenlaan_2.JPG
84030 /wikipedia/commons/thumb/5/5d/USSArizonaSinkingsimple.png/640px-USSArizonaSinkingsimple.png
187271 /wikipedia/commons/d/d4/BASA-56K-1-6-1-Asen_Belkovski.JPG
11032 /wikipedia/commons/thumb/archive/1/1a/20180918145659%21Engelbert_Dollfuss.png/98px-Engelbert_Dollfuss.png
21498 /wikipedia/commons/thumb/e/e1/Hillary_Clinton_Raleigh_%2829892054003%29.jpg/220px-Hillary_Clinton_Raleigh_%2829892054003%29.jpg
51295 /wikipedia/commons/7/72/USS_Wyoming_BB-32_circa_1912-13.jpg
12299 /wikipedia/commons/thumb/b/bc/%D0%9C%D0%90%D0%9A%D0%A1-2007-%D0%97%D0%92%D0%93-007.jpg/640px-%D0%9C%D0%90%D0%9A%D0%A1-2007-%D0%97%D0%92%D0%93-007.jpg
752 /wikipedia/commons/thumb/a/a4/Echo_I_starboard.jpg/280px-Echo_I_starboard.jpg
14226 /wikipedia/commons/thumb/c/c8/DogenP2.JPG/225px-DogenP2.JPG
19035 /wikipedia/commons/thumb/2/23/Scadbury_Park_path.JPG/1024px-Scadbury_Park_path.JPG
14048 /wikipedia/commons/thumb/3/31/Coat_of_arms_Holy_See.svg/113px-Coat_of_arms_Holy_See.svg.png
85309 /wikipedia/commons/2/2f/Lwówek_Śląski_Ratusz_(3).JPG
1053 /wikipedia/commons/thumb/4/41/Creative-Tail-Halloween-blood.svg/24px-Creative-Tail-Halloween-blood.svg.png
285 /wikipedia/commons/d/d4/Button_hide.png
37427 /wikipedia/id/8/8e/Venturer.png
12924 /wikipedia/commons/9/91/Quercus_falcata_leaves.jpg
13776 /wikipedia/commons/thumb/a/af/MER_STS-107_Memorial.JPG/220px-MER_STS-107_Memorial.JPG
70422 /wikipedia/commons/thumb/b/b8/Bazylika_katedralna_w_Toruniu.jpg/2560px-Bazylika_katedralna_w_Toruniu.jpg
66165 /wikipedia/commons/b/bc/The_Souper_Dress%2C_American_paper_dress%2C_1967.jpg
102791 /wikipedia/commons/thumb/2/24/Merida.jpg/720px-Merida.jpg
8701 /wikipedia/commons/thumb/4/4c/Fleur_de_lys_%28or%29.svg/84px-Fleur_de_lys_%28or%29.svg.png
1592639 /wikipedia/commons/0/04/2018-11-24_Women%27s_World_Cup_at_2018-19_Luge_World_Cup_in_Igls_by_Sandro_Halank%E2%80%93187.jpg
601 /wikipedia/commons/thumb/3/3e/Disambig_colour.svg/20px-Disambig_colour.svg.png
13201 /wikipedia/commons/thumb/d/dc/Tabularium-and-temple.jpg/220px-Tabularium-and-temple.jpg
427564 /wikipedia/commons/4/47/2018-11-24_Women%27s_World_Cup_at_2018-19_Luge_World_Cup_in_Igls_by_Sandro_Halank%E2%80%93181.jpg
33175 /wikipedia/commons/3/3f/Common_Redshank.jpg
12817 /wikipedia/commons/thumb/b/b8/Map_of_BISAN_Area_%2C_Hiroshima_1929.jpg/1024px-Map_of_BISAN_Area_%2C_Hiroshima_1929.jpg
14404 /wikipedia/commons/thumb/6/66/2015_Shamrock_Run.jpg/1280px-2015_Shamrock_Run.jpg
22371 /wikipedia/commons/thumb/a/ae/%E7%9C%BE%E6%81%A9%E7%A5%A0%E4%B8%80%E6%99%AF.jpg/1280px-%E7%9C%BE%E6%81%A9%E7%A5%A0%E4%B8%80%E6%99%AF.jpg
12467 /wikipedia/commons/thumb/d/d9/%E6%88%90%E5%90%89%E6%80%9D%E6%B1%97%E5%B9%BF%E5%9C%BA02.jpg/220px-%E6%88%90%E5%90%89%E6%80%9D%E6%B1%97%E5%B9%BF%E5%9C%BA02.jpg
11559 /wikipedia/commons/thumb/6/6b/American_Beaver.jpg/150px-American_Beaver.jpg
61839 /wikipedia/commons/2/23/ABARTH_124_SPIDER_%28ABARTH_Osaka-Higashi_demo_car%29_rear.jpg
6193 /wikipedia/commons/thumb/6/6f/Beijing_hutong_2005-3.JPG/300px-Beijing_hutong_2005-3.JPG
13928 /wikipedia/ru/7/71/Official_disney_princess.png
4986 /wikipedia/commons/1/1d/Boeing_727-2L5-Adv%2C_Libyan_Arab_Airlines_AN1046683.jpg
4953 /wikipedia/ja/thumb/d/df/Bamboo_Ere.jpg/480px-Bamboo_Ere.jpg
39249 /wikipedia/commons/thumb/a/a4/Ikaruga_Town_Office.jpg/800px-Ikaruga_Town_Office.jpg
25206 /wikipedia/commons/thumb/3/3e/Rose_McGowan_2012.jpg/221px-Rose_McGowan_2012.jpg
3783 /wikipedia/commons/thumb/d/db/Sabina_El_Hierro.jpg/120px-Sabina_El_Hierro.jpg
62735 /wikipedia/commons/thumb/0/06/Ax10akb18.jpg/2880px-Ax10akb18.jpg
20447 /wikipedia/commons/6/69/Belgia%27s_welfare.jpg
44436 /wikipedia/commons/3/34/SCR188.jpg
235053 /wikipedia/commons/thumb/9/9e/Macau_Tower_CE_Centre.jpg/1200px-Macau_Tower_CE_Centre.jpg
1119 /wikipedia/commons/thumb/4/4a/Arabic-dad-letter-2.svg/20px-Arabic-dad-letter-2.svg.png
9267 /wikipedia/commons/thumb/3/35/Ed_Norton_Shankbone_Metropolitan_Opera_2009.jpg/200px-Ed_Norton_Shankbone_Metropolitan_Opera_2009.jpg
4885 /wikipedia/commons/8/80/Waterfall_Naran_valley%27.jpg
79774 /wikipedia/commons/thumb/d/d0/The_Jack_Pine%2C_by_Tom_Thomson.jpg/220px-The_Jack_Pine%2C_by_Tom_Thomson.jpg
9161 /wikipedia/commons/thumb/4/4e/Uranium_series.gif/440px-Uranium_series.gif
13093 /wikipedia/commons/thumb/d/dc/Tabularium-and-temple.jpg/220px-Tabularium-and-temple.jpg
117643 /wikipedia/commons/thumb/7/76/Marquartstein.png/1280px-Marquartstein.png
?lang=ar /osm-intl/13/4896/3331@1.3x.png
6957 /wikipedia/commons/thumb/7/7e/Cultivation.jpg/330px-Cultivation.jpg
110618 /wikipedia/commons/thumb/6/69/Atlantis_welcome_home_ceremony_outside_the_OPF_July_22.png/660px-Atlantis_welcome_home_ceremony_outside_the_OPF_July_22.png
?lang=ja /osm-intl/14/14319/6496@1.3x.png
81399 /wikipedia/commons/thumb/8/81/Stilleven_in_rood_groen_en_koper-Jules_Grandgagnage.jpg/1200px-Stilleven_in_rood_groen_en_koper-Jules_Grandgagnage.jpg
151609 /wikipedia/commons/1/1e/Vincom_City_Tower_02.JPG
156548 /wikipedia/commons/0/00/18th_century_Panchatantra_manuscript_page%2C_the_talkative_turtle.jpg
235265 /wikipedia/commons/b/bd/Absolut_Apeach_Vodka_2.jpg
9039 /wikipedia/commons/d/d5/Dumas_-_Vingt_ans_apr%C3%A8s%2C_1846%2C_figure_page_0445.png
4222 /wikipedia/commons/thumb/3/3f/Tazara_crossing_bridge.jpg/220px-Tazara_crossing_bridge.jpg
3921 /wikipedia/commons/thumb/4/41/Portal_LGBT.svg/50px-Portal_LGBT.svg.png
14762 /wikipedia/commons/thumb/c/ce/Epanastasi.jpg/218px-Epanastasi.jpg
78628 /wikipedia/commons/thumb/7/74/Tuscanny_street.jpg/480px-Tuscanny_street.jpg
40908 /wikipedia/commons/thumb/f/f9/Animal_tales_cats_thai_or_16797_f003v.jpg/1280px-Animal_tales_cats_thai_or_16797_f003v.jpg
9039 /wikipedia/commons/thumb/7/79/Photo_of_Taipei_National_University_of_Arts_02.JPG/600px-Photo_of_Taipei_National_University_of_Arts_02.JPG
66778 /wikipedia/commons/9/98/Eduardo_Zaplana_en_la_rueda_de_prensa_posterior_al_Consejo_de_Ministros_%28cropped%29.jpg
120779 /wikipedia/commons/thumb/9/9a/Hiyoshi-taisha_omiyabashi02s4592.jpg/1920px-Hiyoshi-taisha_omiyabashi02s4592.jpg
29884 /wikipedia/commons/1/10/Suzuki_Karimun_Wagon-R_%28rear_right%29%2C_Denpasar.jpg
111355 /wikipedia/commons/thumb/6/67/Rio_Tefe.jpg/1200px-Rio_Tefe.jpg
311858 /wikipedia/commons/5/5b/Pterocles_decoratus_-Serengeti_National_Park%2C_Tanzania_-pair-8.jpg
4528 /wikipedia/commons/thumb/3/33/Pinocchio%27s_village.jpg/440px-Pinocchio%27s_village.jpg
4355 /wikipedia/commons/e/e5/Red_crown.png
235154 /wikipedia/commons/b/bd/Absolut_Apeach_Vodka_2.jpg
291 /wikipedia/commons/thumb/b/be/Flag_of_England.svg/50px-Flag_of_England.svg.png
88921 /wikipedia/ja/thumb/1/13/Fukutetsu-Step.JPG/744px-Fukutetsu-Step.JPG
4881 /wikipedia/zh/thumb/f/fe/Daddy%27s_Home_Poster.jpg/216px-Daddy%27s_Home_Poster.jpg

As FYI ema told me that https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/477424/ reverted https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/476311, an experiment to disable N-hit-wonder for some days. This caused issues while loading images from commons - https://phabricator.wikimedia.org/T210890 (hence upload).

Timeline:

Text of the revert:

This change appears to have triggered a high rate of problems on
upload frontends with "ExpKill LRU_Exhausted", which is resulting
in random client connection closes before all bytes are
transferred.

Change 478626 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Update webrequest data-loss false positive check

https://gerrit.wikimedia.org/r/478626

Change 478626 merged by Ottomata:
[analytics/refinery@master] Update webrequest data-loss false positive check

https://gerrit.wikimedia.org/r/478626

Nuria renamed this task from Failure while refining webrequest upload 2018-12-01-14 to Failure while refining webrequest upload 2018-12-01-14. Upgrade alarms.Dec 12 2018, 12:23 AM
Nuria moved this task from In Code Review to Ready to Deploy on the Analytics-Kanban board.
Nuria set the point value for this task to 5.