Page MenuHomePhabricator

Possibly expand Kafka main-{eqiad,codfw} clusters in Q4 2019.
Closed, ResolvedPublic

Description

When hardware planning last year, Analytics asked for budget to expand the 2 Kafka main clusters.

We planned to increase RAM to 64G (+32G) on each of the existent brokers, and add +2 brokers to each cluster. Analytics would like to put this task in SRE's backlog. The Kafka main clusters should be owned by SRE, so we want to leave it up to them to decide if and when this is needed.

AFAIK, the load on Kafka main is not planned to increase significantly in the near future.

Event Timeline

Ottomata created this task.Feb 28 2019, 6:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2019, 6:07 PM
Milimetric moved this task from Incoming to Radar on the Analytics board.Mar 4 2019, 4:36 PM
jbond triaged this task as Medium priority.Mar 4 2019, 7:41 PM
CDanis added a subscriber: CDanis.Mar 13 2019, 4:29 PM
herron added a subscriber: herron.Mar 13 2019, 4:29 PM

According to netbox support for hosts kafka[12]00[123] expired in Dec 2018. After discussing a bit with @Ottomata, a server refresh with higher-spec hardware would be a reasonable course of action to address both server age and capacity.

If folks agree with that approach, the next question that comes to mind is do we have budget to proceed with that as part of Q4 work? Or would this need to wait until the new FY?

In the SRE spreadsheet I can see that the suggested replacement FY is 20/21, not the upcoming one.. Just adding the info, not sure if these servers are eligible or not for refresh before the 5y of usage.

In the SRE spreadsheet I can see that the suggested replacement FY is 20/21, not the upcoming one.. Just adding the info, not sure if these servers are eligible or not for refresh before the 5y of usage.

After some discussions off-task we do have permission from SRE management to proceed with a HW refresh into higher spec systems. Which to me leads to the question of what spec will be sufficient for the next 2-3 years?

I'm inclined to err on the side of caution and spec in the ballpark of 128G RAM hosts. But at the same time we have a sub-goal (T220389) this quarter to review current capacity and plan for the next 2-3y. So fleshing those details out seems a good place to start.

128GB sounds good, the more page cache Kafka has the better :)

I'd also think about SSD disks (IIRC we don't have them now), and I'd also think about checking if 10G interfaces are needed or not. The current traffic, even per second one analyzed via ifstat, seems to fit well into 1G bandwidth but not sure about the future. If we are planning to add a lot more consumers, for example, TX bandwidth might get close to warning levels (Jumbo is starting to complain about 1G bandwidth for example - https://phabricator.wikimedia.org/T220700 - but it is of course a different use case).

+1 for SSDs. We will be directly exposing time-based messages to consumers, so I imagine random disk access will greatly increase in the mid-term.

Today we discussed desired hardware configs and expansion strategies during a meeting with @elukey @mobrovac @Ottomata and myself. Here are the outcomes:

  1. Any new hosts added to the Kafka main cluster should be configured with SSD backed storage.
  2. Mixing host/storage configurations for any period longer than necessary for an online migration should be avoided as load will be distributed evenly across brokers regardless of specs and lead to inconsistent performance.
  3. Looking at a 2-3 year roadmap, and considering an upgrade to SSDs, 10G net interfaces are strongly recommended.
  4. 3 hosts per site is the bare minimum cluster size, ideally we expand to 5 hosts per site.
  5. Desired per-host hardware specs are:
    • 128G RAM
    • 8x960G SSDs in RAID-10
    • 10G network interfaces
    • CPU >= to current (2x quad core 3.0GHz)

Taking the above into account, and considering upgrading existing hosts in place is a non-starter, cluster expansion is largely tied to hardware replacement/refresh.

In terms of timing I can think of a few possible approaches:

  1. Split expansion/refresh across 2019/2020 FY
    • Replace the existing Kafka main cluster with 3 new hosts of above spec in Q4 2019 FY
    • Add 2 new hosts of above spec in 2020 FY
  1. Accelerate expansion/refresh
    • Replace the existing Kafka main cluster with 5 new hosts of above spec in Q4 2019 FY
  1. Postpone expansion/refresh
    • Replace the existing Kafka main cluster with 5 new hosts of above spec in 2020 FY
herron mentioned this in Unknown Object (Task).Apr 23 2019, 3:44 PM
mobrovac added a subtask: Unknown Object (Task).Apr 25 2019, 11:09 PM
herron moved this task from Backlog to Working on on the User-herron board.May 9 2019, 8:05 PM
Papaul closed subtask Unknown Object (Task) as Resolved.May 21 2019, 8:36 PM
RobH added a subtask: Unknown Object (Task).Jul 16 2019, 3:27 PM
herron closed this task as Resolved.Nov 1 2019, 1:45 PM
herron claimed this task.

To circle back on this, we moved forward with option 2 and are using task T225005 to track the migration effort

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM