Page MenuHomePhabricator

Upgrade analytics-eqiad Kafka cluster to Kafka 0.9
Closed, ResolvedPublic21 Estimated Story Points

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+15 -1
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -0
operations/puppetproduction+1 -1
operations/puppetproduction+14 -6
operations/puppetproduction+5 -2
operations/puppetproduction+5 -0
operations/puppetproduction+1 -1
operations/puppetproduction+11 -2
operations/mediawiki-configmaster+2 -0
operations/puppetproduction+146 -82
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 286660 had a related patch set uploaded (by Ottomata):
Alter role::kafka::analytics::broker to be able to use confluent module during upgrade

https://gerrit.wikimedia.org/r/286660

Change 286660 merged by Ottomata:
Alter role::kafka::analytics::broker to be able to use confluent module during upgrade

https://gerrit.wikimedia.org/r/286660

Change 287106 had a related patch set uploaded (by Ottomata):
Set analytics kafka broker info for labs deployment-prep

https://gerrit.wikimedia.org/r/287106

Change 287106 merged by Ottomata:
Set analytics kafka broker info for labs deployment-prep

https://gerrit.wikimedia.org/r/287106

Change 287627 had a related patch set uploaded (by Ottomata):
Add confluent mirror to get Kafka 0.9 in apt

https://gerrit.wikimedia.org/r/287627

Change 287627 merged by Ottomata:
Add confluent mirror to get Kafka 0.9 in apt

https://gerrit.wikimedia.org/r/287627

Change 287678 had a related patch set uploaded (by Ottomata):
Fix reprepro updates entry for confluent kafka

https://gerrit.wikimedia.org/r/287678

Change 287678 merged by Ottomata:
Fix reprepro updates entry for confluent kafka

https://gerrit.wikimedia.org/r/287678

Ottomata renamed this task from Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 (or 0.10?) to Upgrade analytics-eqiad Kafka cluster to Kafka 0.9.May 9 2016, 8:18 PM

Upgraded the analytics Kafka cluster in deployment-prep today. Along the way I had to create an extra kafka broker, and migrate the original one from precise over to jessie. Now analytics Kafka in deployment-prep has two kafka brokers, deployment-kafka0[13]. Both are Jessie running 0.9.

Change 288009 had a related patch set uploaded (by Ottomata):
Configure kafka1022 as a confluent 0.9 broker

https://gerrit.wikimedia.org/r/288009

Change 288009 merged by Ottomata:
Configure kafka1022 as a confluent 0.9 broker

https://gerrit.wikimedia.org/r/288009

Change 288012 had a related patch set uploaded (by Ottomata):
Fix include and thresholds for icinga alerts for confluent kafka brokers

https://gerrit.wikimedia.org/r/288012

Change 288012 merged by Ottomata:
Fix include and thresholds for icinga alerts for confluent kafka brokers

https://gerrit.wikimedia.org/r/288012

Change 288017 had a related patch set uploaded (by Ottomata):
Fix ferm rule for confluent kafka broker

https://gerrit.wikimedia.org/r/288017

Change 288017 merged by Ottomata:
Fix ferm rule for confluent kafka broker

https://gerrit.wikimedia.org/r/288017

Change 288027 had a related patch set uploaded (by Ottomata):
Fix for kafka broker process alert for confluent brokers

https://gerrit.wikimedia.org/r/288027

Change 288027 merged by Ottomata:
Fix for kafka broker process alert for confluent brokers

https://gerrit.wikimedia.org/r/288027

Change 288189 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1012

https://gerrit.wikimedia.org/r/288189

Change 288189 merged by Ottomata:
Kafka 0.9 on kafka1012

https://gerrit.wikimedia.org/r/288189

Change 288190 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1013

https://gerrit.wikimedia.org/r/288190

Change 288190 merged by Ottomata:
Kafka 0.9 on kafka1013

https://gerrit.wikimedia.org/r/288190

Change 288194 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1014

https://gerrit.wikimedia.org/r/288194

Change 288194 merged by Ottomata:
Kafka 0.9 on kafka1014

https://gerrit.wikimedia.org/r/288194

Change 288200 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1022

https://gerrit.wikimedia.org/r/288200

Change 288200 merged by Ottomata:
Kafka 0.9 on kafka1018

https://gerrit.wikimedia.org/r/288200

Change 288206 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1020

https://gerrit.wikimedia.org/r/288206

Change 288206 merged by Ottomata:
Kafka 0.9 on kafka1020

https://gerrit.wikimedia.org/r/288206

All analytics brokers are now upgraded to confluent 0.9!

Tomorrow we will switch off 0.8 inter broker protocol version and bounce all brokers again.

Ottomata set the point value for this task to 21.May 12 2016, 4:06 PM

Change 288451 had a related patch set uploaded (by Ottomata):
Add temporary hiera config to contitionally set broker protocol version

https://gerrit.wikimedia.org/r/288451

Change 288451 merged by Ottomata:
Add temporary hiera config to contitionally set broker protocol version

https://gerrit.wikimedia.org/r/288451

Change 288604 had a related patch set uploaded (by Ottomata):
Fix default replication factor in clusters with less than 3 brokers (labs)

https://gerrit.wikimedia.org/r/288604

Change 288604 merged by Ottomata:
Fix default replication factor in clusters with less than 3 brokers (labs)

https://gerrit.wikimedia.org/r/288604

We didn't get a chance to fully restart each broker with inter.broker.protocol.version=0.9.0.X this week. kafka1012 is the only broker with this set. Since it is Friday, I will wait until Monday to do the other 5.

Change 288971 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1013

https://gerrit.wikimedia.org/r/288971

Change 288972 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1014

https://gerrit.wikimedia.org/r/288972

Change 288973 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1018

https://gerrit.wikimedia.org/r/288973

Change 288974 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1020

https://gerrit.wikimedia.org/r/288974

Change 288975 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1022

https://gerrit.wikimedia.org/r/288975

Change 288971 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1013

https://gerrit.wikimedia.org/r/288971

Change 288972 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1014

https://gerrit.wikimedia.org/r/288972

Change 288973 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1018

https://gerrit.wikimedia.org/r/288973

Change 288974 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1020

https://gerrit.wikimedia.org/r/288974

Change 288975 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1022

https://gerrit.wikimedia.org/r/288975

So from the past week I can see:

  • kafka1012 increased steadily its logsize from 12/05 ~20:00 UTC more or less. Initial value 7.1 TB, Final value 11.8TB
  • All the brokers increased their logsize from 16/05 ~16:30 UTC. Initial value 7.2TB, final value 8.0TB

The latter seems to be an acceptable side effect of the migration, the former is very strange.

Distribution of the leaders:

elukey@kafka1012:~$ kafka topics --describe | grep Leader | awk '{print $5" "$6}' | sort | uniq -c
     57 Leader: 12
     58 Leader: 13
     67 Leader: 14
     63 Leader: 18
     64 Leader: 20
     57 Leader: 22

Distribution of the partitions:

elukey@kafka1012:~$ kafka topics --describe | grep -v PartitionCount | awk '{print $8}' | sed 's/,/\n/g' | sort | uniq -c
    177 12
    171 13
    187 14
    190 18
    191 20
    182 22
(#partitions, broker)

The increase in log size correlates to the time at which I set inter.broker.protocol.version=0.9.0.X. We did kafka1012 last week, and the rest of the brokers yesterday.

Since they all now have log size increasing, it doesn't look like its going to stop! Hm!

I wonder if this has something to do with retention logic changes in 0.9. Investigating.

Ok, I believe that when switching inter.broker.protocol.version and bouncing brokers, on startup they most have touched the data files on disk. Since log retention is based on file mtime, and all of the log files were touched when the broker started up, it will take a full week before any old logs are removed.

We will check back on kafka1012 on Thursday or Friday this week to see if my guess is correct. We'll also need to keep an aye on disk usage. If we ever get close to filling up disks, we should set log retention size to something small enough to get it to delete the files, and then bounce the broker.

Grr, these are getting close to full. Luca and I tried to dynamically set topic retention, but kafka didn't seem to care. (I had never tried it before). I'm going to have to bounce the brokers in order to get them to temporarily run with a shorter retention period. Starting this now...

I take it back! The command I had run previously looks like it had a larger retention.ms than the default! Doh! This is working:

kafka configs --alter --entity-type topics --entity-name webrequest_upload --add-config retention.ms=172800000 # 48 hours

Ok, brokers have deleted webrequest_upload data older than 48 hours. I've removed the topic config override via:

kafka configs --alter --entity-type topics --entity-name webrequest_upload --delete-config retention.ms

Things are a little better now. We might have to do this for webrequest_text too, if things start filling up again before Monday.