⚓ T121562 Upgrade analytics-eqiad Kafka cluster to Kafka 0.9

Subject	Repo	Branch	Lines +/-
Set inter.broker.protocol = 0.9.0.x for kafka1022	operations/puppet	production	+1 -0
Set inter.broker.protocol = 0.9.0.x for kafka1020	operations/puppet	production	+1 -0
Set inter.broker.protocol = 0.9.0.x for kafka1018	operations/puppet	production	+1 -0
Set inter.broker.protocol = 0.9.0.x for kafka1014	operations/puppet	production	+1 -0
Set inter.broker.protocol = 0.9.0.x for kafka1013	operations/puppet	production	+1 -0
Fix default replication factor in clusters with less than 3 brokers (labs)	operations/puppet	production	+1 -1
Add temporary hiera config to contitionally set broker protocol version	operations/puppet	production	+15 -1
Kafka 0.9 on kafka1020	operations/puppet	production	+1 -0
Kafka 0.9 on kafka1018	operations/puppet	production	+1 -0
Kafka 0.9 on kafka1014	operations/puppet	production	+1 -0
Kafka 0.9 on kafka1013	operations/puppet	production	+1 -0
Kafka 0.9 on kafka1012	operations/puppet	production	+1 -0
Fix for kafka broker process alert for confluent brokers	operations/puppet	production	+1 -1
Fix ferm rule for confluent kafka broker	operations/puppet	production	+14 -6
Fix include and thresholds for icinga alerts for confluent kafka brokers	operations/puppet	production	+5 -2
Configure kafka1022 as a confluent 0.9 broker	operations/puppet	production	+5 -0
Fix reprepro updates entry for confluent kafka	operations/puppet	production	+1 -1
Add confluent mirror to get Kafka 0.9 in apt	operations/puppet	production	+11 -2
Set analytics kafka broker info for labs deployment-prep	operations/mediawiki-config	master	+2 -0
Alter role::kafka::analytics::broker to be able to use confluent module during upgrade	operations/puppet	production	+146 -82

Status	Assigned	Task
Declined	elukey	T166833 Produce webrequests from varnishkafka to Kafka with Kafka message timestamp set to configurable content field
Resolved	Ottomata	T152015 Provision new Kafka cluster(s) with security features
Resolved	mforns	T121561 Encrypt Kafka traffic, and restrict access via ACLs
Resolved	elukey	T121407 Single Kafka partition replica periodically lags
Resolved	Ottomata	T121562 Upgrade analytics-eqiad Kafka cluster to Kafka 0.9
Resolved	Ottomata	T132595 Experiment with new Kafka versions and verify that they work with existing clients
Resolved	Ottomata	T132631 Puppetize and make useable confluent kafka packages

Change 286660 had a related patch set uploaded (by Ottomata):
Alter role::kafka::analytics::broker to be able to use confluent module during upgrade

https://gerrit.wikimedia.org/r/286660

gerritbot added a project: Patch-For-Review.May 3 2016, 2:43 PM

Change 286660 merged by Ottomata:
Alter role::kafka::analytics::broker to be able to use confluent module during upgrade

https://gerrit.wikimedia.org/r/286660

Change 287106 had a related patch set uploaded (by Ottomata):
Set analytics kafka broker info for labs deployment-prep

https://gerrit.wikimedia.org/r/287106

Change 287106 merged by Ottomata:
Set analytics kafka broker info for labs deployment-prep

https://gerrit.wikimedia.org/r/287106

Ottomata edited projects, added Analytics-Kanban; removed Analytics.May 9 2016, 2:14 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 287627 had a related patch set uploaded (by Ottomata):
Add confluent mirror to get Kafka 0.9 in apt

https://gerrit.wikimedia.org/r/287627

Change 287627 merged by Ottomata:
Add confluent mirror to get Kafka 0.9 in apt

https://gerrit.wikimedia.org/r/287627

Change 287678 had a related patch set uploaded (by Ottomata):
Fix reprepro updates entry for confluent kafka

https://gerrit.wikimedia.org/r/287678

Change 287678 merged by Ottomata:
Fix reprepro updates entry for confluent kafka

https://gerrit.wikimedia.org/r/287678

Upgraded the analytics Kafka cluster in deployment-prep today. Along the way I had to create an extra kafka broker, and migrate the original one from precise over to jessie. Now analytics Kafka in deployment-prep has two kafka brokers, deployment-kafka0[13]. Both are Jessie running 0.9.

Change 288009 had a related patch set uploaded (by Ottomata):
Configure kafka1022 as a confluent 0.9 broker

https://gerrit.wikimedia.org/r/288009

Change 288009 merged by Ottomata:
Configure kafka1022 as a confluent 0.9 broker

https://gerrit.wikimedia.org/r/288009

Ottomata mentioned this in rOPUP0aca6e5c0f73: Configure kafka1022 as a confluent 0.9 broker.May 10 2016, 6:05 PM

Change 288012 had a related patch set uploaded (by Ottomata):
Fix include and thresholds for icinga alerts for confluent kafka brokers

https://gerrit.wikimedia.org/r/288012

Change 288012 merged by Ottomata:
Fix include and thresholds for icinga alerts for confluent kafka brokers

https://gerrit.wikimedia.org/r/288012

Ottomata mentioned this in rOPUPdc7ec527be5b: Fix include and thresholds for icinga alerts for confluent kafka brokers.May 10 2016, 6:18 PM

Change 288017 had a related patch set uploaded (by Ottomata):
Fix ferm rule for confluent kafka broker

https://gerrit.wikimedia.org/r/288017

Change 288017 merged by Ottomata:
Fix ferm rule for confluent kafka broker

https://gerrit.wikimedia.org/r/288017

Ottomata mentioned this in rOPUP1191d1ce6503: Fix ferm rule for confluent kafka broker.May 10 2016, 6:28 PM

Change 288027 had a related patch set uploaded (by Ottomata):
Fix for kafka broker process alert for confluent brokers

https://gerrit.wikimedia.org/r/288027

Change 288027 merged by Ottomata:
Fix for kafka broker process alert for confluent brokers

https://gerrit.wikimedia.org/r/288027

Ottomata mentioned this in rOPUP729530bf6c53: Fix for kafka broker process alert for confluent brokers.May 10 2016, 7:11 PM

Change 288189 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1012

https://gerrit.wikimedia.org/r/288189

Change 288189 merged by Ottomata:
Kafka 0.9 on kafka1012

https://gerrit.wikimedia.org/r/288189

Change 288190 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1013

https://gerrit.wikimedia.org/r/288190

Ottomata mentioned this in rOPUP71d36c57e57d: Kafka 0.9 on kafka1012.May 11 2016, 1:38 PM

Change 288190 merged by Ottomata:
Kafka 0.9 on kafka1013

https://gerrit.wikimedia.org/r/288190

Change 288194 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1014

https://gerrit.wikimedia.org/r/288194

Ottomata mentioned this in rOPUP5888f9a1f732: Kafka 0.9 on kafka1013.May 11 2016, 1:56 PM

Change 288194 merged by Ottomata:
Kafka 0.9 on kafka1014

https://gerrit.wikimedia.org/r/288194

Ottomata mentioned this in rOPUP51df491cbf5c: Kafka 0.9 on kafka1014.May 11 2016, 2:04 PM

Change 288200 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1022

https://gerrit.wikimedia.org/r/288200

Change 288200 merged by Ottomata:
Kafka 0.9 on kafka1018

https://gerrit.wikimedia.org/r/288200

Ottomata mentioned this in rOPUPc478b94b1f87: Kafka 0.9 on kafka1018.May 11 2016, 2:15 PM

Change 288206 had a related patch set uploaded (by Ottomata):
Kafka 0.9 on kafka1020

https://gerrit.wikimedia.org/r/288206

Change 288206 merged by Ottomata:
Kafka 0.9 on kafka1020

https://gerrit.wikimedia.org/r/288206

Ottomata mentioned this in rOPUP6a337f395ed9: Kafka 0.9 on kafka1020.May 11 2016, 2:30 PM

All analytics brokers are now upgraded to confluent 0.9!

Tomorrow we will switch off 0.8 inter broker protocol version and bounce all brokers again.

• Nuria closed subtask T132631: Puppetize and make useable confluent kafka packages as Resolved.May 11 2016, 3:45 PM

Ottomata set the point value for this task to 21.May 12 2016, 4:06 PM

Change 288451 had a related patch set uploaded (by Ottomata):
Add temporary hiera config to contitionally set broker protocol version

https://gerrit.wikimedia.org/r/288451

Change 288451 merged by Ottomata:
Add temporary hiera config to contitionally set broker protocol version

https://gerrit.wikimedia.org/r/288451

Ottomata mentioned this in rOPUP68cc88ca5db4: Add temporary hiera config to contitionally set broker protocol version.May 12 2016, 7:04 PM

Change 288604 had a related patch set uploaded (by Ottomata):
Fix default replication factor in clusters with less than 3 brokers (labs)

https://gerrit.wikimedia.org/r/288604

Change 288604 merged by Ottomata:
Fix default replication factor in clusters with less than 3 brokers (labs)

https://gerrit.wikimedia.org/r/288604

Ottomata mentioned this in rOPUP7a41a8134eeb: Fix default replication factor in clusters with less than 3 brokers (labs).May 13 2016, 2:00 PM

We didn't get a chance to fully restart each broker with inter.broker.protocol.version=0.9.0.X this week. kafka1012 is the only broker with this set. Since it is Friday, I will wait until Monday to do the other 5.

Change 288971 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1013

https://gerrit.wikimedia.org/r/288971

Change 288972 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1014

https://gerrit.wikimedia.org/r/288972

Change 288973 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1018

https://gerrit.wikimedia.org/r/288973

Change 288974 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1020

https://gerrit.wikimedia.org/r/288974

Change 288975 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1022

https://gerrit.wikimedia.org/r/288975

Change 288971 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1013

https://gerrit.wikimedia.org/r/288971

Ottomata mentioned this in rOPUP929696f99ca6: Set inter.broker.protocol = 0.9.0.x for kafka1013.May 16 2016, 4:14 PM

Change 288972 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1014

https://gerrit.wikimedia.org/r/288972

Ottomata mentioned this in rOPUPc8e96de55415: Set inter.broker.protocol = 0.9.0.x for kafka1014.May 16 2016, 4:23 PM

Change 288973 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1018

https://gerrit.wikimedia.org/r/288973

Ottomata mentioned this in rOPUP69eadbae0fb7: Set inter.broker.protocol = 0.9.0.x for kafka1018.May 16 2016, 4:29 PM

Change 288974 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1020

https://gerrit.wikimedia.org/r/288974

Ottomata mentioned this in rOPUP01bdd40e8cdb: Set inter.broker.protocol = 0.9.0.x for kafka1020.May 16 2016, 4:44 PM

Change 288975 merged by Ottomata:
Set inter.broker.protocol = 0.9.0.x for kafka1022

https://gerrit.wikimedia.org/r/288975

Ottomata mentioned this in rOPUPf81bf422b54b: Set inter.broker.protocol = 0.9.0.x for kafka1022.May 16 2016, 4:59 PM

AAAnnnd we are done!

Ottomata mentioned this in rOPUP339a620a2d3f: Remove confluent conditional in role::kafka::analytics::broker.May 16 2016, 5:39 PM

HMm, @elukey let's keep an eye on Broker Log Size: https://grafana.wikimedia.org/dashboard/db/kafka?panelId=17&fullscreen

So from the past week I can see:

kafka1012 increased steadily its logsize from 12/05 ~20:00 UTC more or less. Initial value 7.1 TB, Final value 11.8TB
All the brokers increased their logsize from 16/05 ~16:30 UTC. Initial value 7.2TB, final value 8.0TB

The latter seems to be an acceptable side effect of the migration, the former is very strange.

Distribution of the leaders:

elukey@kafka1012:~$ kafka topics --describe | grep Leader | awk '{print $5" "$6}' | sort | uniq -c
     57 Leader: 12
     58 Leader: 13
     67 Leader: 14
     63 Leader: 18
     64 Leader: 20
     57 Leader: 22

Distribution of the partitions:

elukey@kafka1012:~$ kafka topics --describe | grep -v PartitionCount | awk '{print $8}' | sed 's/,/\n/g' | sort | uniq -c
    177 12
    171 13
    187 14
    190 18
    191 20
    182 22
(#partitions, broker)

The increase in log size correlates to the time at which I set inter.broker.protocol.version=0.9.0.X. We did kafka1012 last week, and the rest of the brokers yesterday.

Since they all now have log size increasing, it doesn't look like its going to stop! Hm!

I wonder if this has something to do with retention logic changes in 0.9. Investigating.

Ok, I believe that when switching inter.broker.protocol.version and bouncing brokers, on startup they most have touched the data files on disk. Since log retention is based on file mtime, and all of the log files were touched when the broker started up, it will take a full week before any old logs are removed.

We will check back on kafka1012 on Thursday or Friday this week to see if my guess is correct. We'll also need to keep an aye on disk usage. If we ever get close to filling up disks, we should set log retention size to something small enough to get it to delete the files, and then bounce the broker.

Grr, these are getting close to full. Luca and I tried to dynamically set topic retention, but kafka didn't seem to care. (I had never tried it before). I'm going to have to bounce the brokers in order to get them to temporarily run with a shorter retention period. Starting this now...

I take it back! The command I had run previously looks like it had a larger retention.ms than the default! Doh! This is working:

kafka configs --alter --entity-type topics --entity-name webrequest_upload --add-config retention.ms=172800000 # 48 hours

Ok, brokers have deleted webrequest_upload data older than 48 hours. I've removed the topic config override via:

kafka configs --alter --entity-type topics --entity-name webrequest_upload --delete-config retention.ms

Things are a little better now. We might have to do this for webrequest_text too, if things start filling up again before Monday.

• Nuria closed this task as Resolved.May 30 2016, 9:52 PM

Ottomata mentioned this in rOPUP7f3ef8ed6bac: Set inter.broker.protocol = 0.9.0.x for kafka1018.Jun 17 2016, 6:09 PM

Ottomata mentioned this in rOPUP11170eeb2cf0: Set inter.broker.protocol = 0.9.0.x for kafka1020.

Ottomata mentioned this in rOPUP178cdaa1e960: Set inter.broker.protocol = 0.9.0.x for kafka1022.

Ottomata mentioned this in rOPUPbf5e903ea356: Fix default replication factor in clusters with less than 3 brokers (labs).

Ottomata mentioned this in rOPUP34698edbaa4b: Add temporary hiera config to contitionally set broker protocol version.

Ottomata mentioned this in rOPUP1df5ae45f53d: Add temporary hiera config to contitionally set broker protocol version.

Ottomata mentioned this in rOPUP806b7b2ab31c: Kafka 0.9 on kafka1022.

Ottomata mentioned this in rOPUPe895b89486f0: Fix ferm rule for confluent kafka broker.

Ottomata mentioned this in rOPUP5b96120ae607: Fix include and thresholds for icinga alerts for confluent kafka brokers.

Ottomata mentioned this in rOPUPd19df8154ff5: Fix include and thresholds for icinga alerts for confluent kafka brokers.

Ottomata mentioned this in rOPUP15629c90897a: Add confluent mirror to get Kafka 0.9 in apt.

Ottomata mentioned this in rOPUPdfa2c8aa8c28: Add confluent mirror to get Kafka 0.9 in apt.

Ottomata mentioned this in rOPUP2531d38a52bd: Add confluent mirror to get Kafka 0.9 in apt.

Ottomata mentioned this in rOPUP57c40333da07: Alter role::kafka::analytics::broker to be able to use confluent module….

Ottomata mentioned this in rOPUPd5a17356bb20: Remove confluent conditional in role::kafka::analytics::broker.

Ottomata mentioned this in rOPUPa222ca11eebc: Remove confluent conditional in role::kafka::analytics::broker.

Ottomata mentioned this in rOPUPbb23c1571911: Alter role::kafka::analytics::broker to be able to use confluent module….

Ottomata mentioned this in rOPUPd360b6b90505: Alter role::kafka::analytics::broker to be able to use confluent module….

Upgrade analytics-eqiad Kafka cluster to Kafka 0.9
Closed, ResolvedPublic21 Estimated Story Points
Actions

Details

Related Objects
Search...

Event Timeline

	Ottomata
	Dec 15 2015, 7:02 PM

Upgrade analytics-eqiad Kafka cluster to Kafka 0.9Closed, ResolvedPublic21 Estimated Story PointsActions

Details

Related ObjectsSearch...

Event Timeline

Upgrade analytics-eqiad Kafka cluster to Kafka 0.9
Closed, ResolvedPublic21 Estimated Story Points
Actions

Related Objects
Search...