Page MenuHomePhabricator

Raw webrequest bits partition for 2014-10-26T21/1H not marked successful
Closed, ResolvedPublic

Description

The bits webrequest partition [1] for 2014-10-26T21/1H has not been marked
successful.

What happened?

[1]


qchris@stat1002 jobs: 0 time: 07:51:53 // exit code: 0
cwd: ~
~/cluster-scripts/dump_webrequest_status.sh

+------------------+--------+--------+--------+--------+
| Date             |  bits  | mobile |  text  | upload |
+------------------+--------+--------+--------+--------+

[...]

| 2014-10-26T19/1H |    .   |    .   |    .   |    .   |
| 2014-10-26T20/1H |    .   |    .   |    .   |    .   |
| 2014-10-26T21/1H |    X   |    .   |    .   |    .   |
| 2014-10-26T22/1H |    .   |    .   |    .   |    .   |
| 2014-10-26T23/1H |    .   |    .   |    .   |    .   |

[...]

+------------------+--------+--------+--------+--------+

Statuses:

. --> Partition is ok
M --> Partition manually marked ok
X --> Partition is not ok (duplicates, missing, or nulls)

Version: unspecified
Severity: normal

Details

Reference
bz72548

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport set Reference to bz72548.
bzimport added a subscriber: Unknown Object (MLST).

Only cp3019 is affected. For that host data worth ~55 seconds got lost
in the ~1 minute between 2014-10-26T21:16:22 2014-10-26T21:17:24.

I could neither find changes in puppet, dns, or SAL that look relevant.

cp3019 (as all other esams caches) are gone from ganglia, so it's hard
to see further data from cp3019 itself for non-Ops.

Icinga shows the “Varnishkafka Delivery Errors” service having status
WARNING since 2014-10-24 17:11:57 (but the same holds true for the
other esams caches too).

Kafka logs did not show peculiar entries in the relevant period of time.

ganglia again shows data for esams caches, but the data between
~2014-10-24T12 and ~2014-10-27T16 is missing (which contains the
minute where we had cp3019 issues).
Judging from the cumulative counters, neither varnish nor varnishkafka
got restarted on cp3019.

ottomata ... since I cannot find any explanation, does cp3019
or 2014-10-26T21:16 ring a bell for you?

Was there some other migration/testing/network issue that I am missing?

ottomata had a look at the logs on cp3019 and said that there were
produce errors about full buffers.
So we're writing it off as temporary network issues for now.