Page MenuHomePhabricator

Repopulate missing coal data in Graphite for 2019-04-17 outage
Closed, ResolvedPublicMay 14 2019

Assigned To
Authored By
Krinkle
Apr 18 2019, 7:20 PM
Referenced Files
F29595269: graphite.wikimedia.png
Jun 17 2019, 11:55 PM
F29595360: graphite.wikimedia.png
Jun 17 2019, 11:55 PM
F29516492: graphite.wikimedia.png
Jun 13 2019, 5:07 PM
F28698215: graphite.wikimedia.png
Apr 18 2019, 7:20 PM

Description

https://graphite.wikimedia.org/render/?width=800&height=400&target=coal.domInteractive&from=20190414&until=18:00_20190418

graphite.wikimedia.png (400×800 px, 34 KB)

Todo in the next couple weeks (e.g. before May 1st, let's say), as long as the data is still in Kafka.

Event Timeline

Gilles triaged this task as Medium priority.

I inspected the code carefully and couldn't find what might have gone wrong with our code.

Seems like it could be this bug in the python-kafka library, since we're running 1.4.3:

https://github.com/dpkp/kafka-python/issues/1590

This is currently the latest version available in Debian: https://packages.debian.org/buster/python-kafka

We can package an update ourselves, I guess. We seem to already have backported 1.4.3 ourselves anyway:

*** 1.4.3-1~stretch1 1001
       1001 http://apt.wikimedia.org/wikimedia stretch-wikimedia/main amd64 Packages

I'll give it a try.

Change 506612 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/coal@master] Make coal write to logs under normal conditions

https://gerrit.wikimedia.org/r/506612

Change 506612 abandoned by Gilles:
Make coal write to logs under normal conditions

https://gerrit.wikimedia.org/r/506612

Change 506626 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Fix coal syslog logging

https://gerrit.wikimedia.org/r/506626

Change 506640 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/coal@master] Add ability to start consuming at specific timestamp

https://gerrit.wikimedia.org/r/506640

Change 506626 merged by Effie Mouzeli:
[operations/puppet@production] Fix coal syslog logging

https://gerrit.wikimedia.org/r/506626

With coal logging fixed and the python-kafka library updated, I think the issue that caused the breakage should be fixed. If not, we'll be better prepared to understand it next time it happens.

Krinkle set Due Date to May 14 2019, 11:00 PM.
Restricted Application changed the subtype of this task from "Task" to "Deadline". · View Herald TranscriptMay 4 2019, 3:10 PM
Gilles raised the priority of this task from Medium to High.
Gilles added a subscriber: Gilles.

Change 506640 merged by jenkins-bot:
[performance/coal@master] Add ability to start consuming at specific timestamp

https://gerrit.wikimedia.org/r/506640

Mentioned in SAL (#wikimedia-operations) [2019-05-14T16:04:46Z] <gilles@deploy1001> Started deploy [performance/coal@5a32eb2]: T221401

Mentioned in SAL (#wikimedia-operations) [2019-05-14T16:04:52Z] <gilles@deploy1001> Finished deploy [performance/coal@5a32eb2]: T221401 (duration: 00m 06s)

Currently attempting to reprocess that timeframe with the following command:

/usr/bin/python /srv/deployment/performance/coal/run_coal.py --brokers kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092 --consumer-group coal_eqiad_recovery --schema NavigationTiming --schema SaveTiming --schema PaintTiming --graphite-host graphite-in.eqiad.wmnet --graphite-port 2003 --graphite-prefix coal --start-timestamp 1555459200

I don't see a gap anymore for coal.saveTiming not for coal.firstPaint, so I'm going to restart the command with it only looking at the NavigationTiming schema, so it doesn't waste time processing already processed events for the other schemas:

/usr/bin/python /srv/deployment/performance/coal/run_coal.py --brokers kafka-jumbo1001.eqiad.wmnet:9092,kafka-jumbo1002.eqiad.wmnet:9092,kafka-jumbo1003.eqiad.wmnet:9092,kafka-jumbo1004.eqiad.wmnet:9092,kafka-jumbo1005.eqiad.wmnet:9092,kafka-jumbo1006.eqiad.wmnet:9092 --consumer-group coal_eqiad_recovery --schema NavigationTiming --graphite-host graphite-in.eqiad.wmnet --graphite-port 2003 --graphite-prefix coal --start-timestamp 1555502400

Ah, the schema option is "append", so I'm bound to waste time processing the other schemas...

It doesn't seem to be working, I've had it running for a while and it's still not filling the gap. This will require manual investigation. Or maybe it's too late already and the oldest data in Kafka is too recent?

I think the retention is worse than we though, so it looks like it's too late.

First, I've just realized that it's millisecond timestamps, not second timestamps. Regardless, it doesn't help because the oldest data is too recent.

See:

>>> from kafka import KafkaConsumer, TopicPartition
>>> consumer = KafkaConsumer(bootstrap_servers=['kafka-jumbo1001.eqiad.wmnet:9092','kafka-jumbo1002.eqiad.wmnet:9092','kafka-jumbo1003.eqiad.wmnet:9092','kafka-jumbo1004.eqiad.wmnet:9092','kafka-jumbo1005.eqiad.wmnet:9092','kafka-jumbo1006.eqiad.wmnet:9092'], group_id='coal_eqiad_test', enable_auto_commit=False)
>>> timestamps = {}
>>> timestamps[TopicPartition('eventlogging_NavigationTiming', 0)] = 1555502400000
>>> offsets = consumer.offsets_for_times(timestamps)
>>> offsets
{TopicPartition(topic=u'eventlogging_NavigationTiming', partition=0): OffsetAndTimestamp(offset=297101788, timestamp=1557218989146)}

That timestamp we're getting back is from May 7 :(

Oh well, at least we've improved the tooling for future issues of that nature...

Change 510201 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/coal@master] Clarify unit for start-timestamp

https://gerrit.wikimedia.org/r/510201

Aye, confirmed from kafkacat as well.

[17:05 UTC] krinkle at stat1004.eqiad.wmnet in ~
$ kafkacat -C -b 'kafka-jumbo1001.eqiad.wmnet:9092' -t eventlogging_NavigationTiming -o beginning -c 1
{"dt": "2019-05-07T08:49:48Z", "event": {… }, "schema": "NavigationTiming",  …}

That's too bad!

@Gilles one thing I was wondering is how it interacts with the stored offsets from the primary instance (regarding consumer groups). Does this parameter disable that mechanism somehow?

Might be worth writing down a few works in the runbook for how to backfill.

When using start-timestamp while the primary instance is running, you simply need to use a different consumer group. Otherwise the command fails anyway, complaining that there's already a subscriber for that consumer group. So the manual backfill and the ongoing process don't interfere with each other.

Change 510201 merged by jenkins-bot:
[performance/coal@master] Clarify unit for start-timestamp

https://gerrit.wikimedia.org/r/510201

I added a short blurb about backfilling data on the runbook.

Talked with @Ottomata and he mentioned the data is still in Hadoop, but that is too much work for me to figure out now. But, it's also in /srv/log/eventlogging/archive/all-events.log* on stat1007 in the same format Kafka would provide, which I figured should be easy to pull into our script.

  • The all-events.log-20190417 file is 21G uncompressed.
  • Filtered down to schema=NavigationTiming, it is 1.4G containing 1.25 million messages. Given an average of about 1,000 NavigationTiming events per minute according to our dashboard (amounts to 1.40 million per day), that seems like it was likely filtered correctly (in the same ball-park)
  • Compressed as all-events.log-20190417.NavigationTiming.gz about 134 MB.
  • Transferred via cumin as intermediary host from stat1007 to webperf1001 by @Volans. (Thank you!)
  • Adapted coal.py in my home directory to read from a file instead of Kafka. (https://gist.github.com/Krinkle/c4686a613f51f5202068d648e534f384)
  • Run once with prefix coal_tmp to verify that it works.

graphite.wikimedia.png (400×800 px, 59 KB)

The result looks quite unsophisticated. The line barely moves at all with seemingly few data points. I sorted the whole data set by dt just in case, and ran it again, but the result was nearly identical.

I thought maybe there was just very little data submitted during that time for some reason or another, but looking at it manually shows that for each of the minutes in the gap (14:00 - 19:00), as well as the rest of the day, we got 800-1000 data points, as normal.

Sample count by minute
$ cat all-events.log-20190417_real.NavigationTiming.dtsorted | grep -o '"dt": "2019-04-17T[1]....' | uniq -c
    759 "dt": "2019-04-17T10:00
    734 "dt": "2019-04-17T10:01
    814 "dt": "2019-04-17T10:02
    775 "dt": "2019-04-17T10:03
    827 "dt": "2019-04-17T10:04
    761 "dt": "2019-04-17T10:05
    780 "dt": "2019-04-17T10:06
    771 "dt": "2019-04-17T10:07
    747 "dt": "2019-04-17T10:08
    834 "dt": "2019-04-17T10:09
    783 "dt": "2019-04-17T10:10
    793 "dt": "2019-04-17T10:11
    782 "dt": "2019-04-17T10:12
    848 "dt": "2019-04-17T10:13
    797 "dt": "2019-04-17T10:14
    761 "dt": "2019-04-17T10:15
    775 "dt": "2019-04-17T10:16
    812 "dt": "2019-04-17T10:17
    778 "dt": "2019-04-17T10:18
    815 "dt": "2019-04-17T10:19
    790 "dt": "2019-04-17T10:20
    802 "dt": "2019-04-17T10:21
    811 "dt": "2019-04-17T10:22
    782 "dt": "2019-04-17T10:23
    798 "dt": "2019-04-17T10:24
    754 "dt": "2019-04-17T10:25
    751 "dt": "2019-04-17T10:26
    794 "dt": "2019-04-17T10:27
    816 "dt": "2019-04-17T10:28
    808 "dt": "2019-04-17T10:29
    828 "dt": "2019-04-17T10:30
    776 "dt": "2019-04-17T10:31
    799 "dt": "2019-04-17T10:32
    802 "dt": "2019-04-17T10:33
    803 "dt": "2019-04-17T10:34
    846 "dt": "2019-04-17T10:35
    784 "dt": "2019-04-17T10:36
    781 "dt": "2019-04-17T10:37
    810 "dt": "2019-04-17T10:38
    873 "dt": "2019-04-17T10:39
    800 "dt": "2019-04-17T10:40
    792 "dt": "2019-04-17T10:41
    800 "dt": "2019-04-17T10:42
    825 "dt": "2019-04-17T10:43
    807 "dt": "2019-04-17T10:44
    798 "dt": "2019-04-17T10:45
    858 "dt": "2019-04-17T10:46
    801 "dt": "2019-04-17T10:47
    839 "dt": "2019-04-17T10:48
    823 "dt": "2019-04-17T10:49
    831 "dt": "2019-04-17T10:50
    836 "dt": "2019-04-17T10:51
    792 "dt": "2019-04-17T10:52
    803 "dt": "2019-04-17T10:53
    828 "dt": "2019-04-17T10:54
    830 "dt": "2019-04-17T10:55
    824 "dt": "2019-04-17T10:56
    854 "dt": "2019-04-17T10:57
    828 "dt": "2019-04-17T10:58
    901 "dt": "2019-04-17T10:59
    797 "dt": "2019-04-17T11:00
    793 "dt": "2019-04-17T11:01
    791 "dt": "2019-04-17T11:02
    751 "dt": "2019-04-17T11:03
    792 "dt": "2019-04-17T11:04
    797 "dt": "2019-04-17T11:05
    826 "dt": "2019-04-17T11:06
    803 "dt": "2019-04-17T11:07
    780 "dt": "2019-04-17T11:08
    798 "dt": "2019-04-17T11:09
    805 "dt": "2019-04-17T11:10
    880 "dt": "2019-04-17T11:11
    790 "dt": "2019-04-17T11:12
    844 "dt": "2019-04-17T11:13
    746 "dt": "2019-04-17T11:14
    801 "dt": "2019-04-17T11:15
    790 "dt": "2019-04-17T11:16
    796 "dt": "2019-04-17T11:17
    825 "dt": "2019-04-17T11:18
    792 "dt": "2019-04-17T11:19
    829 "dt": "2019-04-17T11:20
    865 "dt": "2019-04-17T11:21
    758 "dt": "2019-04-17T11:22
    817 "dt": "2019-04-17T11:23
    812 "dt": "2019-04-17T11:24
    838 "dt": "2019-04-17T11:25
    828 "dt": "2019-04-17T11:26
    812 "dt": "2019-04-17T11:27
    800 "dt": "2019-04-17T11:28
    801 "dt": "2019-04-17T11:29
    792 "dt": "2019-04-17T11:30
    859 "dt": "2019-04-17T11:31
    805 "dt": "2019-04-17T11:32
    810 "dt": "2019-04-17T11:33
    804 "dt": "2019-04-17T11:34
    793 "dt": "2019-04-17T11:35
    818 "dt": "2019-04-17T11:36
    848 "dt": "2019-04-17T11:37
    839 "dt": "2019-04-17T11:38
    827 "dt": "2019-04-17T11:39
    826 "dt": "2019-04-17T11:40
    845 "dt": "2019-04-17T11:41
    816 "dt": "2019-04-17T11:42
    845 "dt": "2019-04-17T11:43
    843 "dt": "2019-04-17T11:44
    818 "dt": "2019-04-17T11:45
    828 "dt": "2019-04-17T11:46
    796 "dt": "2019-04-17T11:47
    800 "dt": "2019-04-17T11:48
    818 "dt": "2019-04-17T11:49
    840 "dt": "2019-04-17T11:50
    801 "dt": "2019-04-17T11:51
    795 "dt": "2019-04-17T11:52
    807 "dt": "2019-04-17T11:53
    867 "dt": "2019-04-17T11:54
    841 "dt": "2019-04-17T11:55
    862 "dt": "2019-04-17T11:56
    810 "dt": "2019-04-17T11:57
    800 "dt": "2019-04-17T11:58
    779 "dt": "2019-04-17T11:59
    788 "dt": "2019-04-17T12:00
    781 "dt": "2019-04-17T12:01
    774 "dt": "2019-04-17T12:02
    805 "dt": "2019-04-17T12:03
    816 "dt": "2019-04-17T12:04
    787 "dt": "2019-04-17T12:05
    826 "dt": "2019-04-17T12:06
    814 "dt": "2019-04-17T12:07
    828 "dt": "2019-04-17T12:08
    766 "dt": "2019-04-17T12:09
    824 "dt": "2019-04-17T12:10
    869 "dt": "2019-04-17T12:11
    803 "dt": "2019-04-17T12:12
    858 "dt": "2019-04-17T12:13
    830 "dt": "2019-04-17T12:14
    782 "dt": "2019-04-17T12:15
    836 "dt": "2019-04-17T12:16
    964 "dt": "2019-04-17T12:17
   1038 "dt": "2019-04-17T12:18
    837 "dt": "2019-04-17T12:19
    788 "dt": "2019-04-17T12:20
    849 "dt": "2019-04-17T12:21
    891 "dt": "2019-04-17T12:22
    797 "dt": "2019-04-17T12:23
    898 "dt": "2019-04-17T12:24
    871 "dt": "2019-04-17T12:25
    862 "dt": "2019-04-17T12:26
    802 "dt": "2019-04-17T12:27
    837 "dt": "2019-04-17T12:28
    841 "dt": "2019-04-17T12:29
    788 "dt": "2019-04-17T12:30
    784 "dt": "2019-04-17T12:31
    817 "dt": "2019-04-17T12:32
    856 "dt": "2019-04-17T12:33
    843 "dt": "2019-04-17T12:34
    856 "dt": "2019-04-17T12:35
    860 "dt": "2019-04-17T12:36
    875 "dt": "2019-04-17T12:37
    874 "dt": "2019-04-17T12:38
    787 "dt": "2019-04-17T12:39
    868 "dt": "2019-04-17T12:40
    868 "dt": "2019-04-17T12:41
    850 "dt": "2019-04-17T12:42
    884 "dt": "2019-04-17T12:43
    845 "dt": "2019-04-17T12:44
    855 "dt": "2019-04-17T12:45
    892 "dt": "2019-04-17T12:46
    875 "dt": "2019-04-17T12:47
    884 "dt": "2019-04-17T12:48
    906 "dt": "2019-04-17T12:49
    876 "dt": "2019-04-17T12:50
    888 "dt": "2019-04-17T12:51
    907 "dt": "2019-04-17T12:52
    934 "dt": "2019-04-17T12:53
    850 "dt": "2019-04-17T12:54
    943 "dt": "2019-04-17T12:55
    870 "dt": "2019-04-17T12:56
    912 "dt": "2019-04-17T12:57
    952 "dt": "2019-04-17T12:58
    964 "dt": "2019-04-17T12:59
    915 "dt": "2019-04-17T13:00
    932 "dt": "2019-04-17T13:01
    811 "dt": "2019-04-17T13:02
    950 "dt": "2019-04-17T13:03
    883 "dt": "2019-04-17T13:04
    878 "dt": "2019-04-17T13:05
    874 "dt": "2019-04-17T13:06
    869 "dt": "2019-04-17T13:07
    868 "dt": "2019-04-17T13:08
    944 "dt": "2019-04-17T13:09
    891 "dt": "2019-04-17T13:10
    920 "dt": "2019-04-17T13:11
    926 "dt": "2019-04-17T13:12
    931 "dt": "2019-04-17T13:13
    933 "dt": "2019-04-17T13:14
    941 "dt": "2019-04-17T13:15
    894 "dt": "2019-04-17T13:16
    875 "dt": "2019-04-17T13:17
    898 "dt": "2019-04-17T13:18
    889 "dt": "2019-04-17T13:19
    958 "dt": "2019-04-17T13:20
    928 "dt": "2019-04-17T13:21
    931 "dt": "2019-04-17T13:22
    928 "dt": "2019-04-17T13:23
    942 "dt": "2019-04-17T13:24
    896 "dt": "2019-04-17T13:25
    921 "dt": "2019-04-17T13:26
    966 "dt": "2019-04-17T13:27
    886 "dt": "2019-04-17T13:28
    965 "dt": "2019-04-17T13:29
    955 "dt": "2019-04-17T13:30
    984 "dt": "2019-04-17T13:31
    970 "dt": "2019-04-17T13:32
    961 "dt": "2019-04-17T13:33
    966 "dt": "2019-04-17T13:34
    978 "dt": "2019-04-17T13:35
   1007 "dt": "2019-04-17T13:36
    924 "dt": "2019-04-17T13:37
    991 "dt": "2019-04-17T13:38
    990 "dt": "2019-04-17T13:39
    980 "dt": "2019-04-17T13:40
   1071 "dt": "2019-04-17T13:41
   1020 "dt": "2019-04-17T13:42
    986 "dt": "2019-04-17T13:43
    974 "dt": "2019-04-17T13:44
    950 "dt": "2019-04-17T13:45
   1009 "dt": "2019-04-17T13:46
    955 "dt": "2019-04-17T13:47
    921 "dt": "2019-04-17T13:48
   1026 "dt": "2019-04-17T13:49
    976 "dt": "2019-04-17T13:50
    953 "dt": "2019-04-17T13:51
    986 "dt": "2019-04-17T13:52
    973 "dt": "2019-04-17T13:53
   1004 "dt": "2019-04-17T13:54
    997 "dt": "2019-04-17T13:55
    997 "dt": "2019-04-17T13:56
    999 "dt": "2019-04-17T13:57
    961 "dt": "2019-04-17T13:58
   1005 "dt": "2019-04-17T13:59
    976 "dt": "2019-04-17T14:00
    989 "dt": "2019-04-17T14:01
    970 "dt": "2019-04-17T14:02
    968 "dt": "2019-04-17T14:03
    988 "dt": "2019-04-17T14:04
    944 "dt": "2019-04-17T14:05
    945 "dt": "2019-04-17T14:06
   1000 "dt": "2019-04-17T14:07
   1003 "dt": "2019-04-17T14:08
   1002 "dt": "2019-04-17T14:09
    948 "dt": "2019-04-17T14:10
   1000 "dt": "2019-04-17T14:11
    985 "dt": "2019-04-17T14:12
    939 "dt": "2019-04-17T14:13
    942 "dt": "2019-04-17T14:14
    978 "dt": "2019-04-17T14:15
   1002 "dt": "2019-04-17T14:16
    935 "dt": "2019-04-17T14:17
    978 "dt": "2019-04-17T14:18
    990 "dt": "2019-04-17T14:19
   1045 "dt": "2019-04-17T14:20
    988 "dt": "2019-04-17T14:21
    963 "dt": "2019-04-17T14:22
   1015 "dt": "2019-04-17T14:23
    966 "dt": "2019-04-17T14:24
   1025 "dt": "2019-04-17T14:25
   1015 "dt": "2019-04-17T14:26
   1000 "dt": "2019-04-17T14:27
    978 "dt": "2019-04-17T14:28
   1071 "dt": "2019-04-17T14:29
    893 "dt": "2019-04-17T14:30
    956 "dt": "2019-04-17T14:31
    952 "dt": "2019-04-17T14:32
   1034 "dt": "2019-04-17T14:33
    994 "dt": "2019-04-17T14:34
    943 "dt": "2019-04-17T14:35
    980 "dt": "2019-04-17T14:36
    976 "dt": "2019-04-17T14:37
    982 "dt": "2019-04-17T14:38
   1014 "dt": "2019-04-17T14:39
   1012 "dt": "2019-04-17T14:40
   1012 "dt": "2019-04-17T14:41
   1078 "dt": "2019-04-17T14:42
   1047 "dt": "2019-04-17T14:43
    970 "dt": "2019-04-17T14:44
    998 "dt": "2019-04-17T14:45
    995 "dt": "2019-04-17T14:46
   1057 "dt": "2019-04-17T14:47
    995 "dt": "2019-04-17T14:48
    973 "dt": "2019-04-17T14:49
    971 "dt": "2019-04-17T14:50
    970 "dt": "2019-04-17T14:51
   1003 "dt": "2019-04-17T14:52
    946 "dt": "2019-04-17T14:53
    990 "dt": "2019-04-17T14:54
   1042 "dt": "2019-04-17T14:55
   1022 "dt": "2019-04-17T14:56
    997 "dt": "2019-04-17T14:57
    972 "dt": "2019-04-17T14:58
    947 "dt": "2019-04-17T14:59
    964 "dt": "2019-04-17T15:00
    996 "dt": "2019-04-17T15:01
   1020 "dt": "2019-04-17T15:02
    981 "dt": "2019-04-17T15:03
   1019 "dt": "2019-04-17T15:04
    990 "dt": "2019-04-17T15:05
   1038 "dt": "2019-04-17T15:06
    967 "dt": "2019-04-17T15:07
   1002 "dt": "2019-04-17T15:08
   1025 "dt": "2019-04-17T15:09
   1029 "dt": "2019-04-17T15:10
   1074 "dt": "2019-04-17T15:11
    951 "dt": "2019-04-17T15:12
    998 "dt": "2019-04-17T15:13
   1002 "dt": "2019-04-17T15:14
    977 "dt": "2019-04-17T15:15
    983 "dt": "2019-04-17T15:16
    978 "dt": "2019-04-17T15:17
   1027 "dt": "2019-04-17T15:18
   1000 "dt": "2019-04-17T15:19
    951 "dt": "2019-04-17T15:20
    954 "dt": "2019-04-17T15:21
    954 "dt": "2019-04-17T15:22
    986 "dt": "2019-04-17T15:23
   1015 "dt": "2019-04-17T15:24
   1026 "dt": "2019-04-17T15:25
   1024 "dt": "2019-04-17T15:26
    980 "dt": "2019-04-17T15:27
    967 "dt": "2019-04-17T15:28
   1012 "dt": "2019-04-17T15:29
   1030 "dt": "2019-04-17T15:30
   1017 "dt": "2019-04-17T15:31
    950 "dt": "2019-04-17T15:32
    959 "dt": "2019-04-17T15:33
   1012 "dt": "2019-04-17T15:34
    995 "dt": "2019-04-17T15:35
   1031 "dt": "2019-04-17T15:36
   1006 "dt": "2019-04-17T15:37
   1009 "dt": "2019-04-17T15:38
   1021 "dt": "2019-04-17T15:39
   1051 "dt": "2019-04-17T15:40
   1075 "dt": "2019-04-17T15:41
   1033 "dt": "2019-04-17T15:42
   1042 "dt": "2019-04-17T15:43
    990 "dt": "2019-04-17T15:44
   1061 "dt": "2019-04-17T15:45
    987 "dt": "2019-04-17T15:46
    962 "dt": "2019-04-17T15:47
   1029 "dt": "2019-04-17T15:48
    980 "dt": "2019-04-17T15:49
    996 "dt": "2019-04-17T15:50
    997 "dt": "2019-04-17T15:51
    942 "dt": "2019-04-17T15:52
    988 "dt": "2019-04-17T15:53
    963 "dt": "2019-04-17T15:54
    961 "dt": "2019-04-17T15:55
   1055 "dt": "2019-04-17T15:56
    993 "dt": "2019-04-17T15:57
   1047 "dt": "2019-04-17T15:58
   1041 "dt": "2019-04-17T15:59
    985 "dt": "2019-04-17T16:00
    996 "dt": "2019-04-17T16:01
    988 "dt": "2019-04-17T16:02
   1033 "dt": "2019-04-17T16:03
   1009 "dt": "2019-04-17T16:04
    973 "dt": "2019-04-17T16:05
   1013 "dt": "2019-04-17T16:06
   1000 "dt": "2019-04-17T16:07
    928 "dt": "2019-04-17T16:08
    968 "dt": "2019-04-17T16:09
   1042 "dt": "2019-04-17T16:10
    986 "dt": "2019-04-17T16:11
   1031 "dt": "2019-04-17T16:12
    988 "dt": "2019-04-17T16:13
    961 "dt": "2019-04-17T16:14
   1010 "dt": "2019-04-17T16:15
   1006 "dt": "2019-04-17T16:16
    957 "dt": "2019-04-17T16:17
    983 "dt": "2019-04-17T16:18
    989 "dt": "2019-04-17T16:19
    978 "dt": "2019-04-17T16:20
    997 "dt": "2019-04-17T16:21
    937 "dt": "2019-04-17T16:22
    988 "dt": "2019-04-17T16:23
   1015 "dt": "2019-04-17T16:24
   1003 "dt": "2019-04-17T16:25
   1060 "dt": "2019-04-17T16:26
   1027 "dt": "2019-04-17T16:27
   1005 "dt": "2019-04-17T16:28
   1065 "dt": "2019-04-17T16:29
   1060 "dt": "2019-04-17T16:30
   1003 "dt": "2019-04-17T16:31
   1002 "dt": "2019-04-17T16:32
    972 "dt": "2019-04-17T16:33
   1044 "dt": "2019-04-17T16:34
    989 "dt": "2019-04-17T16:35
    969 "dt": "2019-04-17T16:36
    983 "dt": "2019-04-17T16:37
   1051 "dt": "2019-04-17T16:38
   1119 "dt": "2019-04-17T16:39
    961 "dt": "2019-04-17T16:40
   1028 "dt": "2019-04-17T16:41
    946 "dt": "2019-04-17T16:42
   1036 "dt": "2019-04-17T16:43
   1015 "dt": "2019-04-17T16:44
    972 "dt": "2019-04-17T16:45
   1005 "dt": "2019-04-17T16:46
    949 "dt": "2019-04-17T16:47
    990 "dt": "2019-04-17T16:48
   1035 "dt": "2019-04-17T16:49
    997 "dt": "2019-04-17T16:50
    993 "dt": "2019-04-17T16:51
   1006 "dt": "2019-04-17T16:52
    984 "dt": "2019-04-17T16:53
   1046 "dt": "2019-04-17T16:54
    990 "dt": "2019-04-17T16:55
   1014 "dt": "2019-04-17T16:56
   1034 "dt": "2019-04-17T16:57
   1021 "dt": "2019-04-17T16:58
    960 "dt": "2019-04-17T16:59
    994 "dt": "2019-04-17T17:00
    914 "dt": "2019-04-17T17:01
    967 "dt": "2019-04-17T17:02
    991 "dt": "2019-04-17T17:03
    981 "dt": "2019-04-17T17:04
   1001 "dt": "2019-04-17T17:05
    948 "dt": "2019-04-17T17:06
    942 "dt": "2019-04-17T17:07
    977 "dt": "2019-04-17T17:08
    973 "dt": "2019-04-17T17:09
    942 "dt": "2019-04-17T17:10
   1003 "dt": "2019-04-17T17:11
   1009 "dt": "2019-04-17T17:12
   1039 "dt": "2019-04-17T17:13
    979 "dt": "2019-04-17T17:14
    995 "dt": "2019-04-17T17:15
   1019 "dt": "2019-04-17T17:16
   1004 "dt": "2019-04-17T17:17
    989 "dt": "2019-04-17T17:18
   1009 "dt": "2019-04-17T17:19
    936 "dt": "2019-04-17T17:20
   1022 "dt": "2019-04-17T17:21
    967 "dt": "2019-04-17T17:22
    998 "dt": "2019-04-17T17:23
    948 "dt": "2019-04-17T17:24
   1010 "dt": "2019-04-17T17:25
    978 "dt": "2019-04-17T17:26
    990 "dt": "2019-04-17T17:27
   1070 "dt": "2019-04-17T17:28
    969 "dt": "2019-04-17T17:29
    969 "dt": "2019-04-17T17:30
    980 "dt": "2019-04-17T17:31
    966 "dt": "2019-04-17T17:32
    960 "dt": "2019-04-17T17:33
   1029 "dt": "2019-04-17T17:34
   1048 "dt": "2019-04-17T17:35
    949 "dt": "2019-04-17T17:36
    979 "dt": "2019-04-17T17:37
   1009 "dt": "2019-04-17T17:38
    954 "dt": "2019-04-17T17:39
    979 "dt": "2019-04-17T17:40
    972 "dt": "2019-04-17T17:41
    996 "dt": "2019-04-17T17:42
    987 "dt": "2019-04-17T17:43
   1015 "dt": "2019-04-17T17:44
   1051 "dt": "2019-04-17T17:45
    955 "dt": "2019-04-17T17:46
    955 "dt": "2019-04-17T17:47
    969 "dt": "2019-04-17T17:48
    996 "dt": "2019-04-17T17:49
    968 "dt": "2019-04-17T17:50
   1018 "dt": "2019-04-17T17:51
    963 "dt": "2019-04-17T17:52
   1010 "dt": "2019-04-17T17:53
    983 "dt": "2019-04-17T17:54
   1041 "dt": "2019-04-17T17:55
    983 "dt": "2019-04-17T17:56
   1034 "dt": "2019-04-17T17:57
   1048 "dt": "2019-04-17T17:58
    969 "dt": "2019-04-17T17:59
    992 "dt": "2019-04-17T18:00
    954 "dt": "2019-04-17T18:01
    990 "dt": "2019-04-17T18:02
    965 "dt": "2019-04-17T18:03
    989 "dt": "2019-04-17T18:04
    914 "dt": "2019-04-17T18:05
    947 "dt": "2019-04-17T18:06
    940 "dt": "2019-04-17T18:07
    970 "dt": "2019-04-17T18:08
   1016 "dt": "2019-04-17T18:09
    962 "dt": "2019-04-17T18:10
    976 "dt": "2019-04-17T18:11
    949 "dt": "2019-04-17T18:12
    972 "dt": "2019-04-17T18:13
    947 "dt": "2019-04-17T18:14
    931 "dt": "2019-04-17T18:15
    962 "dt": "2019-04-17T18:16
    971 "dt": "2019-04-17T18:17
    948 "dt": "2019-04-17T18:18
    923 "dt": "2019-04-17T18:19
    932 "dt": "2019-04-17T18:20
   1011 "dt": "2019-04-17T18:21
    971 "dt": "2019-04-17T18:22
    992 "dt": "2019-04-17T18:23
    913 "dt": "2019-04-17T18:24
    979 "dt": "2019-04-17T18:25
    927 "dt": "2019-04-17T18:26
    960 "dt": "2019-04-17T18:27
    988 "dt": "2019-04-17T18:28
    954 "dt": "2019-04-17T18:29
   1017 "dt": "2019-04-17T18:30
    997 "dt": "2019-04-17T18:31
    980 "dt": "2019-04-17T18:32
    946 "dt": "2019-04-17T18:33
    968 "dt": "2019-04-17T18:34
   1038 "dt": "2019-04-17T18:35
    929 "dt": "2019-04-17T18:36
    939 "dt": "2019-04-17T18:37
    987 "dt": "2019-04-17T18:38
    966 "dt": "2019-04-17T18:39
   1025 "dt": "2019-04-17T18:40
    992 "dt": "2019-04-17T18:41
    975 "dt": "2019-04-17T18:42
    899 "dt": "2019-04-17T18:43
    895 "dt": "2019-04-17T18:44
   1027 "dt": "2019-04-17T18:45
    926 "dt": "2019-04-17T18:46
    959 "dt": "2019-04-17T18:47
   1001 "dt": "2019-04-17T18:48
    977 "dt": "2019-04-17T18:49
    942 "dt": "2019-04-17T18:50
    970 "dt": "2019-04-17T18:51
   1017 "dt": "2019-04-17T18:52
    980 "dt": "2019-04-17T18:53
    995 "dt": "2019-04-17T18:54
   1004 "dt": "2019-04-17T18:55
    975 "dt": "2019-04-17T18:56
    933 "dt": "2019-04-17T18:57
    957 "dt": "2019-04-17T18:58
    977 "dt": "2019-04-17T18:59
    975 "dt": "2019-04-17T19:00
    965 "dt": "2019-04-17T19:01
    968 "dt": "2019-04-17T19:02
    894 "dt": "2019-04-17T19:03
    910 "dt": "2019-04-17T19:04
    988 "dt": "2019-04-17T19:05
    975 "dt": "2019-04-17T19:06
    995 "dt": "2019-04-17T19:07
    988 "dt": "2019-04-17T19:08
    938 "dt": "2019-04-17T19:09
    864 "dt": "2019-04-17T19:10
    936 "dt": "2019-04-17T19:11
    901 "dt": "2019-04-17T19:12
    908 "dt": "2019-04-17T19:13
    869 "dt": "2019-04-17T19:14
    969 "dt": "2019-04-17T19:15
    993 "dt": "2019-04-17T19:16
    975 "dt": "2019-04-17T19:17
    951 "dt": "2019-04-17T19:18
    923 "dt": "2019-04-17T19:19
   1008 "dt": "2019-04-17T19:20
    949 "dt": "2019-04-17T19:21
    957 "dt": "2019-04-17T19:22
    940 "dt": "2019-04-17T19:23
    920 "dt": "2019-04-17T19:24
    988 "dt": "2019-04-17T19:25
    999 "dt": "2019-04-17T19:26
    959 "dt": "2019-04-17T19:27
    971 "dt": "2019-04-17T19:28
    927 "dt": "2019-04-17T19:29
    927 "dt": "2019-04-17T19:30
    944 "dt": "2019-04-17T19:31
    966 "dt": "2019-04-17T19:32
    972 "dt": "2019-04-17T19:33
    981 "dt": "2019-04-17T19:34
    905 "dt": "2019-04-17T19:35
    949 "dt": "2019-04-17T19:36
    990 "dt": "2019-04-17T19:37
    972 "dt": "2019-04-17T19:38
    940 "dt": "2019-04-17T19:39
    993 "dt": "2019-04-17T19:40
    990 "dt": "2019-04-17T19:41
   1079 "dt": "2019-04-17T19:42
   1015 "dt": "2019-04-17T19:43
   1005 "dt": "2019-04-17T19:44
   1047 "dt": "2019-04-17T19:45
   1028 "dt": "2019-04-17T19:46
    963 "dt": "2019-04-17T19:47
    957 "dt": "2019-04-17T19:48
    974 "dt": "2019-04-17T19:49
   1002 "dt": "2019-04-17T19:50
    947 "dt": "2019-04-17T19:51
    956 "dt": "2019-04-17T19:52
    951 "dt": "2019-04-17T19:53
    958 "dt": "2019-04-17T19:54
    986 "dt": "2019-04-17T19:55
    904 "dt": "2019-04-17T19:56
    970 "dt": "2019-04-17T19:57
    939 "dt": "2019-04-17T19:58
    955 "dt": "2019-04-17T19:59

The data is in my home directory on webperf1001 for now. But given the above, I've not submitted it to the real metric right now because it looks broken and haven't figured out why.

Having ruled out most everything else, I suspect it must have to do with the way I patched the coal.py script. I probably messed something up...

I've run your script and it's sending the right data to graphite, one data point per minute. When queried, graphite only gives back one data point per hour. I imagine this has to do with the default policy for a new whisper metric file to have a granularity of one hour when created on the fly?

I think it's worth pointing this at the real metric/whisper file and it might just work, because its granularity is configured correctly.

Mentioned in SAL (#wikimedia-operations) [2019-06-17T23:22:21Z] <Krinkle> Prune debugging data "coal_tmp2.*" and "coal_tmp3.*" from graphite1004 and graphite2003 from last week, ref T221401

Mentioned in SAL (#wikimedia-operations) [2019-06-17T23:40:28Z] <Krinkle> Repopulating lost "coal.*" data in Graphite from NavigationTiming for 2019-04-17, ref T221401

Krinkle changed the task status from Declined to Resolved.Jun 17 2019, 11:55 PM