Page MenuHomePhabricator

Kafkatee zero files having 10% less requests than udp2log zero files
Closed, InvalidPublic

Description

While the kafkatee zero files and udp2log zero files previously
more or less agreed in number of lines, the kafkatee zero files
currently have ~10% less lines than udp2log zero files. As the
files start to run appart for the 2014-03-27 file [1], it seems
to have started around 2014-03-26.

Is this difference expected?

P.S.: Running

zgrep '^cp3014.*http://en\.m\.wikipedia\.org/wiki/Mamunur_Rashid' /a/{squid/archive,log/webrequest}/zero/zero.tsv.log-20140419.gz

on stat1002 will give you a harmless looking log line that's only in the udp2log stream, and not in the kafka stream.

[1]


qchris@stat1002 0 11:44:00
cwd: ~
for i in /a/squid/archive/zero/zero.tsv.log-20140[34]*.gz ; do echo "$i: $(zcat $i | wc -l)" ; done
[...]
/a/squid/archive/zero/zero.tsv.log-20140324.gz: 4092113
/a/squid/archive/zero/zero.tsv.log-20140325.gz: 3953345
/a/squid/archive/zero/zero.tsv.log-20140326.gz: 3944141
/a/squid/archive/zero/zero.tsv.log-20140327.gz: 4145574
/a/squid/archive/zero/zero.tsv.log-20140328.gz: 6191020
/a/squid/archive/zero/zero.tsv.log-20140329.gz: 8338602
/a/squid/archive/zero/zero.tsv.log-20140330.gz: 10344328
/a/squid/archive/zero/zero.tsv.log-20140331.gz: 11867766
/a/squid/archive/zero/zero.tsv.log-20140401.gz: 12191221
/a/squid/archive/zero/zero.tsv.log-20140402.gz: 13204630
/a/squid/archive/zero/zero.tsv.log-20140403.gz: 13793041
[...]


qchris@stat1002 0 11:44:15
cwd: ~
for i in /a/log/webrequest/zero/zero.tsv.log-20140[34]*.gz ; do echo "$i: $(zcat $i | wc -l)" ; done
[...]
/a/log/webrequest/zero/zero.tsv.log-20140324.gz: 4087376
/a/log/webrequest/zero/zero.tsv.log-20140325.gz: 3946964
/a/log/webrequest/zero/zero.tsv.log-20140326.gz: 3939808
/a/log/webrequest/zero/zero.tsv.log-20140327.gz: 3918716
/a/log/webrequest/zero/zero.tsv.log-20140328.gz: 5545742
/a/log/webrequest/zero/zero.tsv.log-20140329.gz: 7500078
/a/log/webrequest/zero/zero.tsv.log-20140330.gz: 9303477
/a/log/webrequest/zero/zero.tsv.log-20140331.gz: 10671821
/a/log/webrequest/zero/zero.tsv.log-20140401.gz: 10971750
/a/log/webrequest/zero/zero.tsv.log-20140402.gz: 11885403
/a/log/webrequest/zero/zero.tsv.log-20140403.gz: 12417002
[...]


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=71056

Details

Reference
bz64181

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:22 AM
bzimport set Reference to bz64181.
bzimport added a subscriber: Unknown Object (MLST).

bingle-admin wrote:

Prioritization and scheduling of this bug is tracked on Mingle card https://wikimedia.mingle.thoughtworks.com/projects/analytics/cards/cards/1560

otto wrote:

I found a line that was in udp2log output, but not in kafkatee output:

cp1046.eqiad.wmnet2014-04-19T07:00:350.000164270X.X.X.Xhit/20017375GEThttp://en.m.wikipedia.org/wiki/Mobile_application_software-text/html; charset=UTF-8--XXXXenzero=410-01

(Note that the sequence number has been removed from this line.)

I checked the webrequest_mobile Hive table in hdfs for this line, and it does exist. That means that the kafkatee missing logs were in the kafka log buffer, as they were consumed properly by Camus into HDFS.

Still sleuthing...

otto wrote:

Also of interest, the kafkatee process on analytics1003 has only been running since March 26th.

Let's prioritize this -- the kafka-tee stuff is pretty critical to get right.

Since it was constantly missing 10% and we have 10 partitions, the
assumption is that some partition did not get properly fed into
varnishkafka.

Ottomata restarted varnishkafka, and the problem disappeared:
(...-$DATE.gz is filename,

2014-04-20T$HOUR:xx:xx is the extracted hour, and
difference is NUMBER_OF_LINES_IN_KAFKATEE_FILE * 100 / NUMBER_OF_LINES_IN_ UDP2LOG_FILE - 100
  (we want difference to be 0)

)

...-20140421.gz: 2014-04-20T07:xx:xx -11% difference
...-20140421.gz: 2014-04-20T08:xx:xx -11% difference
...-20140421.gz: 2014-04-20T09:xx:xx -11% difference
...-20140421.gz: 2014-04-20T10:xx:xx -11% difference
...-20140421.gz: 2014-04-20T11:xx:xx -11% difference
...-20140421.gz: 2014-04-20T12:xx:xx -11% difference
...-20140421.gz: 2014-04-20T13:xx:xx -10% difference
...-20140421.gz: 2014-04-20T14:xx:xx -11% difference
...-20140421.gz: 2014-04-20T15:xx:xx -10% difference
...-20140421.gz: 2014-04-20T16:xx:xx -11% difference
...-20140421.gz: 2014-04-20T17:xx:xx -11% difference
...-20140421.gz: 2014-04-20T18:xx:xx -11% difference
...-20140421.gz: 2014-04-20T19:xx:xx -10% difference
...-20140421.gz: 2014-04-20T20:xx:xx -10% difference
...-20140421.gz: 2014-04-20T21:xx:xx -10% difference
...-20140421.gz: 2014-04-20T22:xx:xx -11% difference
...-20140421.gz: 2014-04-20T23:xx:xx -11% difference
...-20140421.gz: 2014-04-21T00:xx:xx -11% difference
...-20140421.gz: 2014-04-21T01:xx:xx -11% difference
...-20140421.gz: 2014-04-21T02:xx:xx -10% difference
...-20140421.gz: 2014-04-21T03:xx:xx -10% difference
...-20140421.gz: 2014-04-21T04:xx:xx -11% difference
...-20140421.gz: 2014-04-21T05.xx:xx -11% difference
...-20140421.gz: 2014-04-21T06:xx:xx -25% difference <-- log rotation
...-20140422.gz: 2014-04-21T06:xx:xx -4% difference <-- log rotation
...-20140422.gz: 2014-04-21T07:xx:xx -11% difference
...-20140422.gz: 2014-04-21T08:xx:xx -10% difference
...-20140422.gz: 2014-04-21T09:xx:xx -11% difference
...-20140422.gz: 2014-04-21T10:xx:xx -11% difference
...-20140422.gz: 2014-04-21T11:xx:xx -6% difference <-- ?
...-20140422.gz: 2014-04-21T12:xx:xx -11% difference
...-20140422.gz: 2014-04-21T13:xx:xx -10% difference
...-20140422.gz: 2014-04-21T14:xx:xx -10% difference
...-20140422.gz: 2014-04-21T15:xx:xx -10% difference
...-20140422.gz: 2014-04-21T16:xx:xx -10% difference
...-20140422.gz: 2014-04-21T17:xx:xx -10% difference
...-20140422.gz: 2014-04-21T18:xx:xx -9% difference <-- ottomata restarting kafkatee ~18:30
...-20140422.gz: 2014-04-21T19:xx:xx -48% difference
...-20140422.gz: 2014-04-21T20:xx:xx 0% difference <-- back to normal.
...-20140422.gz: 2014-04-21T21:xx:xx 0% difference
...-20140422.gz: 2014-04-21T22:xx:xx -1% difference
...-20140422.gz: 2014-04-21T23:xx:xx -1% difference
...-20140422.gz: 2014-04-22T00:xx:xx -1% difference
...-20140422.gz: 2014-04-22T01:xx:xx -1% difference
...-20140422.gz: 2014-04-22T02:xx:xx -1% difference
...-20140422.gz: 2014-04-22T03:xx:xx -1% difference
...-20140422.gz: 2014-04-22T04:xx:xx -1% difference
...-20140422.gz: 2014-04-22T05:xx:xx -1% difference

How can we guard against partitions not being fed into varnishkafka in
future?

We have other kafka issues we need to resolve before we can address this problem.

fgiunchedi subscribed.

Zero is no more, resolving