Page MenuHomePhabricator

missing wmf_netflow data, 18:30-19:00 May 31
Closed, ResolvedPublic

Description

0 data points for wmf_netflow in this interval: https://w.wiki/SVo

I checked a few nfacctd exporters and they were all sending data to Kafka in that interval.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The missing data seems from May 31st 18:30 to 19:00. I did a quick check via Spark and on HDFS the data seems present:

scala> spark.sql("select stamp_inserted from wmf.netflow where year=2020 and month=05 and day=31 and hour=18 and stamp_inserted like '2020-05-31 18:4%' limit 20").show(20);
+-------------------+
|     stamp_inserted|
+-------------------+
|2020-05-31 18:43:00|
|2020-05-31 18:40:00|
|2020-05-31 18:46:00|
|2020-05-31 18:43:00|
|2020-05-31 18:48:00|
|2020-05-31 18:40:00|
|2020-05-31 18:43:00|
|2020-05-31 18:43:00|
|2020-05-31 18:41:00|
|2020-05-31 18:42:00|
|2020-05-31 18:49:00|
|2020-05-31 18:40:00|
|2020-05-31 18:43:00|
|2020-05-31 18:49:00|
|2020-05-31 18:41:00|
|2020-05-31 18:43:00|
|2020-05-31 18:44:00|
|2020-05-31 18:48:00|
|2020-05-31 18:47:00|
|2020-05-31 18:44:00|
+-------------------+
scala> spark.sql("select count(*) from wmf.netflow where year=2020 and month=05 and day=31 and hour=18").show();
20/06/04 07:09:37 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+--------+
|count(1)|
+--------+
|12405683|
+--------+


scala> spark.sql("select count(*) from wmf.netflow where year=2020 and month=05 and day=31 and hour=19").show();
+--------+
|count(1)|
+--------+
|12335817|
+--------+


scala> spark.sql("select count(*) from wmf.netflow where year=2020 and month=05 and day=31 and hour=17").show();
+--------+
|count(1)|
+--------+
|12569783|
+--------+

The hole is now gone, but we discovered a major problem in T254383 :(