This is now a blocker for our eventlogging on Kafka project.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • kevinator | T102225 {stag} EventLogging on Kafka | |||
Resolved | Ottomata | T102831 Prep work for Eventlogging on Kafka {stag} | |||
Resolved | Ottomata | T104228 Puppetize parallel eventlogging-processor {stag} [5 pts] | |||
Resolved | LSobanski | T111653 Encrypt all the things | |||
Resolved | BBlack | T92602 Secure inter-datacenter web request log (Kafka) traffic | |||
Resolved | Ottomata | T106581 Build 0.8.2.1 Kafka package and upgrade Kafka brokers | |||
Declined | Ottomata | T98161 Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. | |||
Resolved | Ottomata | T103106 Create jmxtrans Jessie package | |||
Resolved | Ottomata | T90640 Audit hyperthreading on analytics nodes. |
Event Timeline
analytics1046-1049 are online as of today! I have started the decommission process of analytics1013,1014 and 1020. These nodes will become kafka brokers soon. Hopefully by Monday they will have their blocks replicated elsewhere.
Change 228826 had a related patch set uploaded (by Ottomata):
Preparing to reinstall and expand Kafka cluster on Jessie at Kafka 0.8.2.1
Change 228826 merged by Ottomata:
Preparing to reinstall and expand Kafka cluster on Jessie at Kafka 0.8.2.1
Change 228832 had a related patch set uploaded (by Ottomata):
Removing analytics1013,1014 and 1018 from hadoop worker list in site.pp
Change 228832 merged by Ottomata:
Removing analytics1013,1014 and 1018 from hadoop worker list in site.pp
Change 228847 had a related patch set uploaded (by Ottomata):
Provisioning analytics1013 as Kafka broker in analytics cluster
Change 228847 merged by Ottomata:
Provisioning analytics1013 as Kafka broker in analytics cluster
Change 228851 had a related patch set uploaded (by Ottomata):
Provision analytics1014 and analytics1020 as kafka brokers
Change 228851 merged by Ottomata:
Provision analytics1014 and analytics1020 as kafka brokers
Change 229012 had a related patch set uploaded (by Ottomata):
Remove newly provisioned kafka nodes from cluster
Change 229035 had a related patch set uploaded (by Ottomata):
Remove analytics1012 1013 1020 from list of kafka brokers in site.pp
Change 229035 merged by Ottomata:
Remove analytics1012 1013 1020 from list of kafka brokers in site.pp
Oof, had some problems yesterday :(
Incident documentation here:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150803-Kafka
Change 229193 had a related patch set uploaded (by Ottomata):
Updates and fixes for 0.8.2.1-2 release
Phew, ok, Joseph and I tested this migration again in labs:
https://etherpad.wikimedia.org/p/kafka_0.8.2.1_migration_labs
This all went smoothly. I've made a migration plan based on this process here:
https://etherpad.wikimedia.org/p/kafka_0.8.2.1_migration2
Alex is reviewing my latest packaging patch:
https://gerrit.wikimedia.org/r/#/c/229193/
Once we get that settled, I can rebuild and publish the .debs in our apt. I'll then test this package in labs one more time, and then I'll feel comfortable trying the upgrade again.
Change 229961 had a related patch set uploaded (by Ottomata):
Rename analytics1013,1014,1020 to kafka1013,1014,1020
Change 229961 merged by Ottomata:
Rename analytics1013,1014,1020 to kafka1013,1014,1020
Change 230576 had a related patch set uploaded (by Ottomata):
Override kafka jmxtrans metrics to test new config for version 0.8.2.1
Change 230576 merged by Ottomata:
Override kafka jmxtrans metrics to test new config for version 0.8.2.1
Change 230577 had a related patch set uploaded (by Ottomata):
Better alias name for All topic metrics from kafka
Phew, after much difficulty, the 4 original Precise brokers are now running 0.8.2.1. There was a bug in the version of snappy that 0.8.2.1 needs that caused us much headache.
Immediate clean up TODOs:
- Make an awesome grafana dashboard: http://grafana.wikimedia.org/#/dashboard/db/kafka
- Clean up jmxtrans metrics, adapt kafka jmxtrans class, etc.
- Audit Kafka alerts and make sure they are based on the correct metric names.
- Write incident report and document data loss
- Make sure Oozie jobs are running and normal (@JAllemandou is doing this).
- Repackage Kafka for Precise, Trusty and Jessie with snappy 1.1.1.7 fix.
I won't be doing the next steps of this migration until the above is done.
Change 231021 had a related patch set uploaded (by Ottomata):
Update JMX metrics names for Kafka 0.8.2
Change 231028 had a related patch set uploaded (by Ottomata):
Update alerts and jmx for Kafka 0.8.2
Change 232097 had a related patch set uploaded (by Ottomata):
Don't use partman for analytics kafka jessie reinstall, do this part manually
Change 232098 had a related patch set uploaded (by Ottomata):
Rename analytics1012 to kafka1012
Change 232097 merged by Ottomata:
Don't use partman for analytics kafka jessie reinstall, do this part manually
Change 232136 had a related patch set uploaded (by Ottomata):
Rename analytics1012 to kafka1012, site.pp puppetization coming in separate commit
Change 232136 merged by Ottomata:
Rename analytics1012 to kafka1012, site.pp puppetization coming in separate commit
Change 232202 had a related patch set uploaded (by Ottomata):
Puppetize kafka1012 as kafka broker in analytics Kafka cluster
Change 232202 merged by Ottomata:
Puppetize kafka1012 as kafka broker in analytics Kafka cluster
Change 232203 had a related patch set uploaded (by Ottomata):
Use kafka1012 as hostname in Kafka cluster config
Change 232319 had a related patch set uploaded (by Ottomata):
Puppetize systemd override for Kafka LimitNOFILE
Change 232534 had a related patch set uploaded (by Ottomata):
Rename analytics1022 to kafka1022
Change 232535 had a related patch set uploaded (by Ottomata):
Update camus property files with names of new brokers
Change 232535 merged by Ottomata:
Update camus property files with names of new brokers
Change 232542 had a related patch set uploaded (by Ottomata):
Rename analytics1022 -> kafka1022
Change 232557 had a related patch set uploaded (by Ottomata):
Rename analytics1022 -> kafka1022
Change 232559 had a related patch set uploaded (by Ottomata):
Temporarily set expire of PTR for kafka1022 to 5 min so I can reinstall asap
Change 232559 merged by Ottomata:
Temporarily set expire of PTR for kafka1022 to 5 min so I can reinstall asap
Change 232560 had a related patch set uploaded (by Ottomata):
Return expire of kafka1022 PTR to 1H
Change 232769 had a related patch set uploaded (by Ottomata):
Rename analytics1018 -> kafka1018 in linux-host-entries
Change 232769 merged by Ottomata:
Rename analytics1018 -> kafka1018 in linux-host-entries
Change 232774 had a related patch set uploaded (by Ottomata):
Rename A record for analytics1018 -> kafka1018
Change 232776 had a related patch set uploaded (by Ottomata):
Repuppetize kafka1018 as a broker
Change 234265 had a related patch set uploaded (by Ottomata):
Decom analytics1021 as a Kafka broker