Page MenuHomePhabricator

Raw webrequest partition monitoring did not flag data for 2014-08-18T13:..:.. as valid for text caches
Closed, DeclinedPublic

Description

The imported raw webrequests data from text caches for
2014-08-18T13:..:.. at

hdfs://analytics-hadoop/wmf/data/raw/webrequest/webrequest_text/hourly/2014/08/18/13

was not marked as ok.

Is that valid?
What happened?


Version: unspecified
Severity: normal

Details

Reference
bz69854

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:32 AM
bzimport set Reference to bz69854.
bzimport added a subscriber: Unknown Object (MLST).

Created attachment 16256
kafka-requests-per-second-2014-08-17--2014-08-19

Attached:

kafka-requests-per-second-2014-08-17--2014-08-19.png (266×577 px, 23 KB)

Monitoring worked as expected, as the data is missing sequence numbers:

+-----------------------------+-----------+---------------------+---------------------+
| Hostname                    | # missing | Start time          | End time            |
+-----------------------------+-----------+---------------------+---------------------+
| amssq37.esams.wmnet         |       155 | 2014-08-18T13:29:37 | 2014-08-18T13:29:39 |
| amssq47.esams.wmnet         |       125 | 2014-08-18T13:29:37 | 2014-08-18T13:29:39 |
| amssq48.esams.wikimedia.org |       149 | 2014-08-18T13:29:37 | 2014-08-18T13:29:39 |
| amssq59.esams.wikimedia.org |        74 | 2014-08-18T13:29:37 | 2014-08-18T13:29:39 |
| cp1052.eqiad.wmnet          |        96 | 2014-08-18T13:29:38 | 2014-08-18T13:29:39 |
| cp4008.ulsfo.wmnet          |       173 | 2014-08-18T13:29:37 | 2014-08-18T13:29:38 |
+-----------------------------+-----------+---------------------+---------------------+
| Total                       |       772 | 2014-08-18T13:29:37 | 2014-08-18T13:29:39 |
+-----------------------------+-----------+---------------------+---------------------+

Those hosts are all text caches, but are not limited to a datacenter.

The affect timespan, matches a leader re-election.
See attachment kafka-requests-per-second-2014-08-17--2014-08-19.

There goes kafka's "at least once" guarantee :-D

otto wrote:

Ha, ah yes, ok, if this corresponds with an election, then this makes sense. The producers themselves have errors in the amount of time it takes for the partition leadership to change. This shouldn't happen, and is something I need to look into for sure.