Page MenuHomePhabricator

kafka-jumbo1003 /srv disk space usage over 90%
Closed, ResolvedPublic

Description

It seems that Kafka Jumbo 1003's /srv partition is using more than 90% of the disk space.

Event Timeline

elukey triaged this task as High priority.Apr 16 2020, 6:34 AM
elukey created this task.

One of the issues seems to be:

1.1T	atskafka_test_webrequest_text-0
1.1T	atskafka_test_webrequest_text-1

We currently only send traffic to that topic from a single cp3xxx host, so the size is way too much. Going to enable snappy compression with Ema to see if things improve.

Change 589271 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] atskafka::instance: add snappy compression by default

https://gerrit.wikimedia.org/r/589271

Change 589271 merged by Elukey:
[operations/puppet@production] atskafka::instance: add snappy compression by default

https://gerrit.wikimedia.org/r/589271

Mentioned in SAL (#wikimedia-operations) [2020-04-16T09:33:22Z] <elukey> restart atskafka on cp3050 to pick up snappy compression - T250347

Change 589275 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] atskafka::instance: add missing comma in config file

https://gerrit.wikimedia.org/r/589275

Change 589275 merged by Elukey:
[operations/puppet@production] atskafka::instance: add missing comma in config file

https://gerrit.wikimedia.org/r/589275

Jumbo1002:

12G	eventlogging_PaintTiming-0
19G	eventlogging-valid-mixed-0
19G	eventlogging-valid-mixed-1
19G	eventlogging-valid-mixed-6
19G	eventlogging-valid-mixed-7
19G	eventlogging-valid-mixed-9
20G	eventlogging_InukaPageView-0
21G	codfw.resource_change-0
30G	eventlogging-client-side-10
31G	eventlogging-client-side-1
31G	eventlogging-client-side-11
31G	eventlogging-client-side-2
31G	eventlogging-client-side-4
31G	eventlogging-client-side-7
37G	eventlogging_VirtualPageView-0
38G	eventlogging_VirtualPageView-1
38G	eventlogging_VirtualPageView-5
38G	eventlogging_VirtualPageView-6
38G	eventlogging_VirtualPageView-7
38G	eventlogging_VirtualPageView-8
54G	eventlogging_MobileWikiAppLinkPreview-0
56G	eqiad.mediawiki.job.htmlCacheUpdate-0
106G	eqiad.mediawiki.api-request-10
106G	eqiad.mediawiki.api-request-2
106G	eqiad.mediawiki.api-request-3
106G	eqiad.mediawiki.cirrussearch-request-0
106G	eqiad.mediawiki.cirrussearch-request-1
106G	eqiad.mediawiki.cirrussearch-request-10
106G	eqiad.mediawiki.cirrussearch-request-11
106G	eqiad.mediawiki.cirrussearch-request-2
106G	eqiad.mediawiki.cirrussearch-request-4
106G	eqiad.mediawiki.cirrussearch-request-7
107G	eqiad.mediawiki.api-request-0
107G	eqiad.mediawiki.api-request-4
107G	eqiad.mediawiki.api-request-9
113G	eventlogging_MobileWikiAppSessions-0
160G	webrequest_upload-0
160G	webrequest_upload-16
160G	webrequest_upload-18
160G	webrequest_upload-23
160G	webrequest_upload-3
160G	webrequest_upload-6
161G	webrequest_upload-1
161G	webrequest_upload-10
161G	webrequest_upload-12
161G	webrequest_upload-17
161G	webrequest_upload-19
161G	webrequest_upload-9
216G	eventlogging_SearchSatisfaction-0
316G	netflow-0
566G	webrequest_text-1
566G	webrequest_text-11
566G	webrequest_text-12
566G	webrequest_text-13
566G	webrequest_text-14
566G	webrequest_text-19
566G	webrequest_text-20
566G	webrequest_text-4
566G	webrequest_text-5
566G	webrequest_text-7
567G	webrequest_text-6
1.1T	atskafka_test_webrequest_text-1
1.1T	atskafka_test_webrequest_text-2

Jumbo1003:

14G	eventlogging_NavigationTiming-0
18G	eqiad.mediawiki.job.RecordLintJob-0
18G	eqiad.wdqs-internal.sparql-query-0
19G	eventlogging-valid-mixed-10
19G	eventlogging-valid-mixed-11
19G	eventlogging-valid-mixed-2
19G	eventlogging-valid-mixed-3
19G	eventlogging-valid-mixed-4
19G	eventlogging-valid-mixed-5
19G	eventlogging-valid-mixed-6
19G	eventlogging-valid-mixed-9
22G	eqiad.wdqs-external.sparql-query-0
29G	eqiad.mediawiki.job.refreshLinks-0
30G	eventlogging-client-side-10
31G	eventlogging-client-side-0
31G	eventlogging-client-side-1
31G	eventlogging-client-side-11
31G	eventlogging-client-side-4
31G	eventlogging-client-side-5
31G	eventlogging-client-side-7
31G	eventlogging-client-side-8
38G	eventlogging_VirtualPageView-1
38G	eventlogging_VirtualPageView-10
38G	eventlogging_VirtualPageView-11
38G	eventlogging_VirtualPageView-2
38G	eventlogging_VirtualPageView-3
38G	eventlogging_VirtualPageView-4
38G	eventlogging_VirtualPageView-8
38G	eventlogging_VirtualPageView-9
56G	eqiad.mediawiki.job.htmlCacheUpdate-0
105G	eqiad.mediawiki.cirrussearch-request-8
106G	eqiad.mediawiki.api-request-1
106G	eqiad.mediawiki.api-request-5
106G	eqiad.mediawiki.api-request-6
106G	eqiad.mediawiki.api-request-8
106G	eqiad.mediawiki.cirrussearch-request-0
106G	eqiad.mediawiki.cirrussearch-request-1
106G	eqiad.mediawiki.cirrussearch-request-10
106G	eqiad.mediawiki.cirrussearch-request-11
106G	eqiad.mediawiki.cirrussearch-request-4
106G	eqiad.mediawiki.cirrussearch-request-5
106G	eqiad.mediawiki.cirrussearch-request-7
107G	eqiad.mediawiki.api-request-0
107G	eqiad.mediawiki.api-request-11
107G	eqiad.mediawiki.api-request-4
107G	eqiad.mediawiki.api-request-7
113G	eventlogging_MobileWikiAppSessions-0
160G	webrequest_upload-0
160G	webrequest_upload-13
160G	webrequest_upload-20
160G	webrequest_upload-22
160G	webrequest_upload-3
160G	webrequest_upload-4
160G	webrequest_upload-5
160G	webrequest_upload-6
160G	webrequest_upload-7
161G	webrequest_upload-10
161G	webrequest_upload-12
161G	webrequest_upload-14
161G	webrequest_upload-15
161G	webrequest_upload-19
161G	webrequest_upload-21
161G	webrequest_upload-9
216G	eventlogging_SearchSatisfaction-0
308G	netflow-2
312G	netflow-1
566G	webrequest_text-1
566G	webrequest_text-10
566G	webrequest_text-14
566G	webrequest_text-16
566G	webrequest_text-17
566G	webrequest_text-18
566G	webrequest_text-2
566G	webrequest_text-21
566G	webrequest_text-22
566G	webrequest_text-23
566G	webrequest_text-4
566G	webrequest_text-5
566G	webrequest_text-7
566G	webrequest_text-8
566G	webrequest_text-9
567G	webrequest_text-15
1.1T	atskafka_test_webrequest_text-0
1.1T	atskafka_test_webrequest_text-1
17T	total

So there are ~4TB of difference. Some notes:

  • 1002 has 550GBx11 webrequest_text partitions (~6T), meanwhile 1003 has 16x550GB (~8.8T)
  • 1002 has 160GBx12 webrequest_upload partitions (~1.9T), meanwhile 1003 has 160GBx16 (~2.5T)

The above makes a difference of 2.8T + 0.6T = 3.4T, that is more or less the bulk of the problem.

Overall I have the following ideas:

  • short term, we could drop the atskafka test topic to free space, and re-create it with more partitions to spread the load among multiple brokers. With snappy compression the impact should be way less.
  • long term, since shuffling partitions around is not easy, we could increase the number of webrequest_text partitions when the new jumbo hosts arrive, in order to smooth over time the size of each partition.

@Ottomata thoughts?

Mentioned in SAL (#wikimedia-operations) [2020-04-16T11:17:26Z] <elukey> stop atskafka on cp3050 to re-create the topic atskafka_test_webrequest_text on Kafka Jumbo - T250347

elukey lowered the priority of this task from High to Medium.EditedApr 16 2020, 11:30 AM

/srv usage is now around 83% after dropping the ats test topic! I have re-created it with 12 partitions, plus snappy compression from atskafka.

Hm. Moving partitions is a bit annoying but is possible.
https://docs.cloudera.com/runtime/7.1.0/kafka-managing/topics/kafka-manage-cli-reassign-overview.html

I'd also be ok with adding more partitions.