Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Declined | elukey | T166833 Produce webrequests from varnishkafka to Kafka with Kafka message timestamp set to configurable content field | |||
Resolved | Ottomata | T152015 Provision new Kafka cluster(s) with security features | |||
Resolved | mforns | T121561 Encrypt Kafka traffic, and restrict access via ACLs | |||
Resolved | elukey | T121407 Single Kafka partition replica periodically lags | |||
Resolved | Ottomata | T121562 Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 | |||
Resolved | Ottomata | T132595 Experiment with new Kafka versions and verify that they work with existing clients | |||
Resolved | Ottomata | T132631 Puppetize and make useable confluent kafka packages |
Event Timeline
Change 286660 had a related patch set uploaded (by Ottomata):
Alter role::kafka::analytics::broker to be able to use confluent module during upgrade
Change 286660 merged by Ottomata:
Alter role::kafka::analytics::broker to be able to use confluent module during upgrade
Change 287106 had a related patch set uploaded (by Ottomata):
Set analytics kafka broker info for labs deployment-prep
Change 287106 merged by Ottomata:
Set analytics kafka broker info for labs deployment-prep
Change 287627 had a related patch set uploaded (by Ottomata):
Add confluent mirror to get Kafka 0.9 in apt
Change 287678 had a related patch set uploaded (by Ottomata):
Fix reprepro updates entry for confluent kafka
Upgraded the analytics Kafka cluster in deployment-prep today. Along the way I had to create an extra kafka broker, and migrate the original one from precise over to jessie. Now analytics Kafka in deployment-prep has two kafka brokers, deployment-kafka0[13]. Both are Jessie running 0.9.
Change 288009 had a related patch set uploaded (by Ottomata):
Configure kafka1022 as a confluent 0.9 broker
Change 288012 had a related patch set uploaded (by Ottomata):
Fix include and thresholds for icinga alerts for confluent kafka brokers
Change 288012 merged by Ottomata:
Fix include and thresholds for icinga alerts for confluent kafka brokers
Change 288017 had a related patch set uploaded (by Ottomata):
Fix ferm rule for confluent kafka broker
Change 288027 had a related patch set uploaded (by Ottomata):
Fix for kafka broker process alert for confluent brokers
Change 288027 merged by Ottomata:
Fix for kafka broker process alert for confluent brokers
All analytics brokers are now upgraded to confluent 0.9!
Tomorrow we will switch off 0.8 inter broker protocol version and bounce all brokers again.
Change 288451 had a related patch set uploaded (by Ottomata):
Add temporary hiera config to contitionally set broker protocol version
Change 288451 merged by Ottomata:
Add temporary hiera config to contitionally set broker protocol version
Change 288604 had a related patch set uploaded (by Ottomata):
Fix default replication factor in clusters with less than 3 brokers (labs)
Change 288604 merged by Ottomata:
Fix default replication factor in clusters with less than 3 brokers (labs)
We didn't get a chance to fully restart each broker with inter.broker.protocol.version=0.9.0.X this week. kafka1012 is the only broker with this set. Since it is Friday, I will wait until Monday to do the other 5.
Change 288971 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1013
Change 288972 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1014
Change 288973 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1018
Change 288974 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1020
Change 288975 had a related patch set uploaded (by Ottomata):
Set inter.broker.protocol = 0.9.0.x for kafka1022
HMm, @elukey let's keep an eye on Broker Log Size: https://grafana.wikimedia.org/dashboard/db/kafka?panelId=17&fullscreen
So from the past week I can see:
- kafka1012 increased steadily its logsize from 12/05 ~20:00 UTC more or less. Initial value 7.1 TB, Final value 11.8TB
- All the brokers increased their logsize from 16/05 ~16:30 UTC. Initial value 7.2TB, final value 8.0TB
The latter seems to be an acceptable side effect of the migration, the former is very strange.
Distribution of the leaders:
elukey@kafka1012:~$ kafka topics --describe | grep Leader | awk '{print $5" "$6}' | sort | uniq -c 57 Leader: 12 58 Leader: 13 67 Leader: 14 63 Leader: 18 64 Leader: 20 57 Leader: 22
Distribution of the partitions:
elukey@kafka1012:~$ kafka topics --describe | grep -v PartitionCount | awk '{print $8}' | sed 's/,/\n/g' | sort | uniq -c 177 12 171 13 187 14 190 18 191 20 182 22 (#partitions, broker)
The increase in log size correlates to the time at which I set inter.broker.protocol.version=0.9.0.X. We did kafka1012 last week, and the rest of the brokers yesterday.
Since they all now have log size increasing, it doesn't look like its going to stop! Hm!
I wonder if this has something to do with retention logic changes in 0.9. Investigating.
Ok, I believe that when switching inter.broker.protocol.version and bouncing brokers, on startup they most have touched the data files on disk. Since log retention is based on file mtime, and all of the log files were touched when the broker started up, it will take a full week before any old logs are removed.
We will check back on kafka1012 on Thursday or Friday this week to see if my guess is correct. We'll also need to keep an aye on disk usage. If we ever get close to filling up disks, we should set log retention size to something small enough to get it to delete the files, and then bounce the broker.
Grr, these are getting close to full. Luca and I tried to dynamically set topic retention, but kafka didn't seem to care. (I had never tried it before). I'm going to have to bounce the brokers in order to get them to temporarily run with a shorter retention period. Starting this now...
I take it back! The command I had run previously looks like it had a larger retention.ms than the default! Doh! This is working:
kafka configs --alter --entity-type topics --entity-name webrequest_upload --add-config retention.ms=172800000 # 48 hours
Ok, brokers have deleted webrequest_upload data older than 48 hours. I've removed the topic config override via:
kafka configs --alter --entity-type topics --entity-name webrequest_upload --delete-config retention.ms
Things are a little better now. We might have to do this for webrequest_text too, if things start filling up again before Monday.