Page MenuHomePhabricator

Kafka Broker disk usage is imbalanced
Closed, ResolvedPublic

Description

Disk usage across brokers looks imbalanced: Here's analytics1022:

/dev/sdi1       1.8T  770G  1.1T  42% /var/spool/kafka/i
/dev/sdd1       1.8T  1.2T  617G  67% /var/spool/kafka/d
/dev/sde1       1.8T  768G  1.1T  42% /var/spool/kafka/e
/dev/sdl1       1.8T  769G  1.1T  42% /var/spool/kafka/l
/dev/sdc1       1.8T  769G  1.1T  42% /var/spool/kafka/c
/dev/sdh1       1.8T  1.3T  566G  70% /var/spool/kafka/h
/dev/sdg1       1.8T  604G  1.3T  33% /var/spool/kafka/g
/dev/sdk1       1.8T  614G  1.2T  34% /var/spool/kafka/k
/dev/sdf1       1.8T  1.3T  578G  69% /var/spool/kafka/f
/dev/sdj1       1.8T  600G  1.3T  33% /var/spool/kafka/j
/dev/sdb3       1.8T  1.7T  102G  95% /var/spool/kafka/b
/dev/sda3       1.8T  1.2T  587G  68% /var/spool/kafka/a

Why? Webrequests should be sent to random partitions. A week ago I did start sending eventlogging data to Kafka, keyed by schema names. This is using the python Kafka producer. Perhaps it isn't random? Perhaps it is hashing keys to partitions, which is causing some partitions to be much more imbalanced. Need to look into this.

Event Timeline

Ottomata raised the priority of this task from to Needs Triage.
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: Ottomata, Gage, JAllemandou.
Dzahn triaged this task as Medium priority.May 26 2015, 9:21 PM
Dzahn subscribed.

Actual status:

kafka1014.eqiad.wmnet:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       1.8T  618G  1.2T  35% /var/spool/kafka/a
/dev/sdb3       1.8T  614G  1.2T  34% /var/spool/kafka/b
/dev/sdc1       1.8T  524G  1.3T  29% /var/spool/kafka/c
/dev/sdd1       1.8T  616G  1.2T  34% /var/spool/kafka/d
/dev/sde1       1.8T  616G  1.2T  34% /var/spool/kafka/e
/dev/sdf1       1.8T  629G  1.2T  35% /var/spool/kafka/f
/dev/sdg1       1.8T  525G  1.3T  29% /var/spool/kafka/g
/dev/sdh1       1.8T  615G  1.2T  34% /var/spool/kafka/h
/dev/sdi1       1.8T  534G  1.3T  30% /var/spool/kafka/i
/dev/sdj1       1.8T  523G  1.3T  29% /var/spool/kafka/j
/dev/sdk1       1.8T  531G  1.3T  29% /var/spool/kafka/k
/dev/sdl1       1.8T  529G  1.3T  29% /var/spool/kafka/l

kafka1020.eqiad.wmnet:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       1.8T  634G  1.2T  36% /var/spool/kafka/a
/dev/sdb3       1.8T  612G  1.2T  34% /var/spool/kafka/b
/dev/sdc1       1.8T  539G  1.3T  30% /var/spool/kafka/c
/dev/sdd1       1.8T  640G  1.2T  35% /var/spool/kafka/d
/dev/sde1       1.8T  530G  1.3T  29% /var/spool/kafka/e
/dev/sdf1       1.8T  616G  1.2T  34% /var/spool/kafka/f
/dev/sdg1       1.8T  506G  1.3T  28% /var/spool/kafka/g
/dev/sdh1       1.8T  632G  1.2T  35% /var/spool/kafka/h
/dev/sdi1       1.8T  505G  1.3T  28% /var/spool/kafka/i
/dev/sdj1       1.8T  520G  1.3T  29% /var/spool/kafka/j
/dev/sdk1       1.8T  626G  1.2T  35% /var/spool/kafka/k
/dev/sdl1       1.8T  541G  1.3T  30% /var/spool/kafka/l

kafka1022.eqiad.wmnet:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       1.8T  504G  1.3T  28% /var/spool/kafka/a
/dev/sdb3       1.8T  1.1T  690G  62% /var/spool/kafka/b
/dev/sdc1       1.8T  620G  1.2T  34% /var/spool/kafka/c
/dev/sdd1       1.8T   22G  1.8T   2% /var/spool/kafka/d
/dev/sde1       1.8T  511G  1.3T  28% /var/spool/kafka/e
/dev/sdf1       1.8T   55G  1.8T   3% /var/spool/kafka/f
/dev/sdg1       1.8T  620G  1.2T  34% /var/spool/kafka/g
/dev/sdh1       1.8T   34G  1.8T   2% /var/spool/kafka/h
/dev/sdi1       1.8T  1.1T  719G  61% /var/spool/kafka/i
/dev/sdj1       1.8T  627G  1.2T  35% /var/spool/kafka/j
/dev/sdk1       1.8T  1.1T  709G  62% /var/spool/kafka/k
/dev/sdl1       1.8T  506G  1.3T  28% /var/spool/kafka/l

kafka1013.eqiad.wmnet:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       1.8T  631G  1.2T  35% /var/spool/kafka/a
/dev/sdb3       1.8T  612G  1.2T  34% /var/spool/kafka/b
/dev/sdc1       1.8T  531G  1.3T  29% /var/spool/kafka/c
/dev/sdd1       1.8T  631G  1.2T  35% /var/spool/kafka/d
/dev/sde1       1.8T  654G  1.2T  36% /var/spool/kafka/e
/dev/sdf1       1.8T  613G  1.2T  34% /var/spool/kafka/f
/dev/sdg1       1.8T  505G  1.3T  28% /var/spool/kafka/g
/dev/sdh1       1.8T  631G  1.2T  35% /var/spool/kafka/h
/dev/sdi1       1.8T  506G  1.3T  28% /var/spool/kafka/i
/dev/sdj1       1.8T  505G  1.3T  28% /var/spool/kafka/j
/dev/sdk1       1.8T  505G  1.3T  28% /var/spool/kafka/k
/dev/sdl1       1.8T  541G  1.3T  30% /var/spool/kafka/l

kafka1012.eqiad.wmnet:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc1       1.8T  104G  1.7T   6% /var/spool/kafka/c
/dev/sdj1       1.8T  621G  1.2T  34% /var/spool/kafka/j
/dev/sdf1       1.8T  1.1T  733G  61% /var/spool/kafka/f
/dev/sdd1       1.8T  608G  1.2T  34% /var/spool/kafka/d
/dev/sdi1       1.8T  535G  1.3T  30% /var/spool/kafka/i
/dev/sdh1       1.8T  603G  1.3T  33% /var/spool/kafka/h
/dev/sdk1       1.8T  525G  1.3T  29% /var/spool/kafka/k
/dev/sde1       1.8T  8.6G  1.8T   1% /var/spool/kafka/e
/dev/sdl1       1.8T  619G  1.2T  34% /var/spool/kafka/l
/dev/sdg1       1.8T  532G  1.3T  29% /var/spool/kafka/g
/dev/sdb3       1.8T 1023G  783G  57% /var/spool/kafka/b
/dev/sda3       1.8T  515G  1.3T  29% /var/spool/kafka/a

kafka1018.eqiad.wmnet:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       1.8T  531G  1.3T  30% /var/spool/kafka/a
/dev/sdd1       1.8T   34G  1.8T   2% /var/spool/kafka/d
/dev/sde1       1.8T  635G  1.2T  35% /var/spool/kafka/e
/dev/sdf1       1.8T  649G  1.2T  36% /var/spool/kafka/f
/dev/sdg1       1.8T  1.1T  724G  61% /var/spool/kafka/g
/dev/sdh1       1.8T   41G  1.8T   3% /var/spool/kafka/h
/dev/sdi1       1.8T  516G  1.3T  29% /var/spool/kafka/i
/dev/sdj1       1.8T  1.1T  736G  60% /var/spool/kafka/j
/dev/sdk1       1.8T  615G  1.2T  34% /var/spool/kafka/k
/dev/sdl1       1.8T   35G  1.8T   2% /var/spool/kafka/l
/dev/sdb1       1.8T  1.2T  701G  62% /var/spool/kafka/b
/dev/sdc3       1.8T  508G  1.3T  29% /var/spool/kafka/c

@Ottomata: It indeed seems that message keys are actually input to the partitioner :)

Having 'per-schema' topics might also impact.

Ottomata lowered the priority of this task from Medium to Lowest.Feb 3 2016, 6:00 PM
Milimetric changed the task status from Open to Stalled.Feb 4 2016, 6:09 PM
Milimetric subscribed.

No actionables. If we install we'll use RAID