It seems that Kafka Jumbo 1003's /srv partition is using more than 90% of the disk space.
Description
Details
Event Timeline
One of the issues seems to be:
1.1T atskafka_test_webrequest_text-0 1.1T atskafka_test_webrequest_text-1
We currently only send traffic to that topic from a single cp3xxx host, so the size is way too much. Going to enable snappy compression with Ema to see if things improve.
Change 589271 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] atskafka::instance: add snappy compression by default
Change 589271 merged by Elukey:
[operations/puppet@production] atskafka::instance: add snappy compression by default
Mentioned in SAL (#wikimedia-operations) [2020-04-16T09:33:22Z] <elukey> restart atskafka on cp3050 to pick up snappy compression - T250347
Change 589275 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] atskafka::instance: add missing comma in config file
Change 589275 merged by Elukey:
[operations/puppet@production] atskafka::instance: add missing comma in config file
Jumbo1002:
12G eventlogging_PaintTiming-0 19G eventlogging-valid-mixed-0 19G eventlogging-valid-mixed-1 19G eventlogging-valid-mixed-6 19G eventlogging-valid-mixed-7 19G eventlogging-valid-mixed-9 20G eventlogging_InukaPageView-0 21G codfw.resource_change-0 30G eventlogging-client-side-10 31G eventlogging-client-side-1 31G eventlogging-client-side-11 31G eventlogging-client-side-2 31G eventlogging-client-side-4 31G eventlogging-client-side-7 37G eventlogging_VirtualPageView-0 38G eventlogging_VirtualPageView-1 38G eventlogging_VirtualPageView-5 38G eventlogging_VirtualPageView-6 38G eventlogging_VirtualPageView-7 38G eventlogging_VirtualPageView-8 54G eventlogging_MobileWikiAppLinkPreview-0 56G eqiad.mediawiki.job.htmlCacheUpdate-0 106G eqiad.mediawiki.api-request-10 106G eqiad.mediawiki.api-request-2 106G eqiad.mediawiki.api-request-3 106G eqiad.mediawiki.cirrussearch-request-0 106G eqiad.mediawiki.cirrussearch-request-1 106G eqiad.mediawiki.cirrussearch-request-10 106G eqiad.mediawiki.cirrussearch-request-11 106G eqiad.mediawiki.cirrussearch-request-2 106G eqiad.mediawiki.cirrussearch-request-4 106G eqiad.mediawiki.cirrussearch-request-7 107G eqiad.mediawiki.api-request-0 107G eqiad.mediawiki.api-request-4 107G eqiad.mediawiki.api-request-9 113G eventlogging_MobileWikiAppSessions-0 160G webrequest_upload-0 160G webrequest_upload-16 160G webrequest_upload-18 160G webrequest_upload-23 160G webrequest_upload-3 160G webrequest_upload-6 161G webrequest_upload-1 161G webrequest_upload-10 161G webrequest_upload-12 161G webrequest_upload-17 161G webrequest_upload-19 161G webrequest_upload-9 216G eventlogging_SearchSatisfaction-0 316G netflow-0 566G webrequest_text-1 566G webrequest_text-11 566G webrequest_text-12 566G webrequest_text-13 566G webrequest_text-14 566G webrequest_text-19 566G webrequest_text-20 566G webrequest_text-4 566G webrequest_text-5 566G webrequest_text-7 567G webrequest_text-6 1.1T atskafka_test_webrequest_text-1 1.1T atskafka_test_webrequest_text-2
Jumbo1003:
14G eventlogging_NavigationTiming-0 18G eqiad.mediawiki.job.RecordLintJob-0 18G eqiad.wdqs-internal.sparql-query-0 19G eventlogging-valid-mixed-10 19G eventlogging-valid-mixed-11 19G eventlogging-valid-mixed-2 19G eventlogging-valid-mixed-3 19G eventlogging-valid-mixed-4 19G eventlogging-valid-mixed-5 19G eventlogging-valid-mixed-6 19G eventlogging-valid-mixed-9 22G eqiad.wdqs-external.sparql-query-0 29G eqiad.mediawiki.job.refreshLinks-0 30G eventlogging-client-side-10 31G eventlogging-client-side-0 31G eventlogging-client-side-1 31G eventlogging-client-side-11 31G eventlogging-client-side-4 31G eventlogging-client-side-5 31G eventlogging-client-side-7 31G eventlogging-client-side-8 38G eventlogging_VirtualPageView-1 38G eventlogging_VirtualPageView-10 38G eventlogging_VirtualPageView-11 38G eventlogging_VirtualPageView-2 38G eventlogging_VirtualPageView-3 38G eventlogging_VirtualPageView-4 38G eventlogging_VirtualPageView-8 38G eventlogging_VirtualPageView-9 56G eqiad.mediawiki.job.htmlCacheUpdate-0 105G eqiad.mediawiki.cirrussearch-request-8 106G eqiad.mediawiki.api-request-1 106G eqiad.mediawiki.api-request-5 106G eqiad.mediawiki.api-request-6 106G eqiad.mediawiki.api-request-8 106G eqiad.mediawiki.cirrussearch-request-0 106G eqiad.mediawiki.cirrussearch-request-1 106G eqiad.mediawiki.cirrussearch-request-10 106G eqiad.mediawiki.cirrussearch-request-11 106G eqiad.mediawiki.cirrussearch-request-4 106G eqiad.mediawiki.cirrussearch-request-5 106G eqiad.mediawiki.cirrussearch-request-7 107G eqiad.mediawiki.api-request-0 107G eqiad.mediawiki.api-request-11 107G eqiad.mediawiki.api-request-4 107G eqiad.mediawiki.api-request-7 113G eventlogging_MobileWikiAppSessions-0 160G webrequest_upload-0 160G webrequest_upload-13 160G webrequest_upload-20 160G webrequest_upload-22 160G webrequest_upload-3 160G webrequest_upload-4 160G webrequest_upload-5 160G webrequest_upload-6 160G webrequest_upload-7 161G webrequest_upload-10 161G webrequest_upload-12 161G webrequest_upload-14 161G webrequest_upload-15 161G webrequest_upload-19 161G webrequest_upload-21 161G webrequest_upload-9 216G eventlogging_SearchSatisfaction-0 308G netflow-2 312G netflow-1 566G webrequest_text-1 566G webrequest_text-10 566G webrequest_text-14 566G webrequest_text-16 566G webrequest_text-17 566G webrequest_text-18 566G webrequest_text-2 566G webrequest_text-21 566G webrequest_text-22 566G webrequest_text-23 566G webrequest_text-4 566G webrequest_text-5 566G webrequest_text-7 566G webrequest_text-8 566G webrequest_text-9 567G webrequest_text-15 1.1T atskafka_test_webrequest_text-0 1.1T atskafka_test_webrequest_text-1 17T total
So there are ~4TB of difference. Some notes:
- 1002 has 550GBx11 webrequest_text partitions (~6T), meanwhile 1003 has 16x550GB (~8.8T)
- 1002 has 160GBx12 webrequest_upload partitions (~1.9T), meanwhile 1003 has 160GBx16 (~2.5T)
The above makes a difference of 2.8T + 0.6T = 3.4T, that is more or less the bulk of the problem.
Overall I have the following ideas:
- short term, we could drop the atskafka test topic to free space, and re-create it with more partitions to spread the load among multiple brokers. With snappy compression the impact should be way less.
- long term, since shuffling partitions around is not easy, we could increase the number of webrequest_text partitions when the new jumbo hosts arrive, in order to smooth over time the size of each partition.
@Ottomata thoughts?
Mentioned in SAL (#wikimedia-operations) [2020-04-16T11:17:26Z] <elukey> stop atskafka on cp3050 to re-create the topic atskafka_test_webrequest_text on Kafka Jumbo - T250347
/srv usage is now around 83% after dropping the ats test topic! I have re-created it with 12 partitions, plus snappy compression from atskafka.
Hm. Moving partitions is a bit annoying but is possible.
https://docs.cloudera.com/runtime/7.1.0/kafka-managing/topics/kafka-manage-cli-reassign-overview.html
I'd also be ok with adding more partitions.