Page MenuHomePhabricator

analytics1021 getting kicked out of kafka partition leader role on 2014-10-27 ~07:12
Closed, ResolvedPublic

Description

analytics1021 again got kicked out of it's kafka partition leader role
on 2014-10-27 ~07:12.

I am not running leader re-elections for now, as ottomata wanted to
run some further experiments, if it happens to analytics1021 again.


Version: unspecified
Severity: normal

Details

Reference
bz72550

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:52 AM
bzimport set Reference to bz72550.
bzimport added a subscriber: Unknown Object (MLST).

I ran a leader re-election.
Analytics1021 is leader for a few partitions again.

(Still pending on check whether leader re-election caused loss/duplicates)

This bug is still missing the numbers of lost messages when
analytics1021 lost it's partition leader role.

For the text cluster, it only affected

amssq34
amssq53.esams.wikimedia.org
amssq56.esams.wikimedia.org
cp4008.ulsfo.wmnet

. The affected period was 2014-10-27T07:12:29/2014-10-27T07:12:32, and
in total 100 messages got lost, which is <<1 second worth of data for
text.

For the upload cluster, it affected all caches in that clustel except
for cp4015 .
The affected period was 2014-10-27T07:12:29/2014-10-27T07:12:46, and
in total ~51K messages got lost, which is <2 second worth of data for
upload.

When analytics1021 lost its partition leader role, bits, mobile, and
text already had the ACK fix. upload hadn't. So seeing the lost
messages on upload is expected.

It is also expected to see no loss on bits, and mobile.

However, I had expected to see no loss on text, as it already had the
ACK fix. It's strange to see exactly 100 lost messages on text.
100 is a suspiciously nice number.

(In reply to christian from comment #1)

(Still pending on check whether leader re-election caused loss/duplicates)

Bug 72679 has details on that.