Page MenuHomePhabricator

Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams
Closed, ResolvedPublic

Description

For memcached purges (and later CDN purges) to reliable happen both within and across datacenters we will need a pubsub system. Ori, Andrew, and I talking about using Kafka for this.

A modest 2 node setup in each DC is enough for HA, and the purge streams will consist of small JSON messages which are quickly consumed, so we should not need a huge amount of disk.

A few notes about the usage:

  • The varnish and memcached nodes in codfw will be subscribed to the purge stream cluster in eqiad
  • The varnish and memcached nodes in eqiad will be subscribed to the purge stream cluster in codfw for completeness and simpler fail-over (though little traffic should come this way)
  • MediaWiki will be the only initial producer of memcached and varnish purge JSON messages
  • Subscribers are all thus cross-DC
  • Since producers always talk to local kakfa clusters, latency should not be an issue, though the DeferredUpdates class for MediaWiki can be put to use if needed
  • Messages only convey purges, not new values, so they are very small
  • The rate of purges is normally tied to the rate of editing across all sites, which is low (<< 100 hz)
  • Maintenance scripts sometimes trigger lots of purges, which should still be fine, but is worth thinking about more than normal editing
  • The cluster will likely be expanded and used for the larger event bus project down the road...

Since varnish and memcached themselves can handle high purge rates, it would be nice not to bottleneck them with the bus too much, even if we don't purge at high rates *normally*. There is a lot of room for discussion about disk type and RAM. I'd defer to Andrew on those.

Some notes were also logged at https://etherpad.wikimedia.org/p/KafkaPurge

The consumer "pull" logic would be ported from the redis prototype at https://git.wikimedia.org/tree/mediawiki%2Fservices%2Fpython-cache-relay

Update of task from discussion:

eqiad (one of the below): comments from @GWicke support use of the spares for at least a year of projected use.

  • - allocate two spare R610 single cpu out of warranty system, swap in the 250GB disks per @Ottomata's request
  • - purchase two single cpu system, priced on T117240

codfw (one of the below): Neither option selected, as preference is dictated by what costs less in budget, ordering a new single cpu machine or using an overprovisioned spare.

  • - allocate two over-provisioned Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32 GB Memory, (2) 500GB Disks - we have 4 remaining, two out of warranty as of this year, and two that expire in January of 2016)
  • - purchase two single cpu system, priced on T117240

Related Objects

Event Timeline

aaron created this task.Sep 29 2015, 10:53 PM
aaron claimed this task.
aaron raised the priority of this task from to Normal.
aaron updated the task description. (Show Details)
aaron added a project: acl*sre-team.
aaron set Security to None.
aaron added subscribers: ori, Gilles, aaron and 4 others.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptSep 29 2015, 10:54 PM
aaron updated the task description. (Show Details)Sep 29 2015, 10:55 PM
aaron updated the task description. (Show Details)

Aye cool!

Quick note before I leave for the day:

These Kafka clusters will likely be used for the production use of EventBus which is currently only in the brainstorm phase. The throughput of the production clusters will likely be much less than the Analytics Kafka cluster, but we should still provision hardware with more use cases in mind than just this one.

aaron removed aaron as the assignee of this task.Oct 2 2015, 6:38 PM
ori edited projects, added hardware-requests; removed procurement.Oct 2 2015, 6:59 PM
Krinkle added a subscriber: Krinkle.Oct 5 2015, 6:15 PM
Ottomata added a subscriber: GWicke.

@ori, @mark, @GWicke, @kevinator

I'm not really sure how to move this forward. Who's budget does this come from? Who approves the procurement request?

Ottomata claimed this task.Oct 23 2015, 3:29 PM
GWicke added a comment.EditedOct 23 2015, 4:09 PM

@Ottomata, we have some hardware budget left in services that we could potentially use to get this started. However, the hardware requirements should be fairly modest, so old spares or co-location with other services might be worth looking into.

@aaron, do you have a specific reason for wanting three nodes, rather than two? Kafka isn't using quorums, so my understanding is that two nodes would be fine for HA.

ori added a comment.Oct 23 2015, 4:20 PM

We have hardware budget for this in performance, too.

Yea as long as one node can handle all the production traffic, the 2 is fine for HA.

I forgot about the spares idea. @Cmjohnson, what we got? More RAM and disk is good, but we can live with spares for now. Do we have spare nodes in eqiad and codfw we could use for this?

GWicke added a comment.EditedOct 23 2015, 5:19 PM

Based on the data we got in labs (1300 events / s on a single-core labs vm) I'd say that any server with >= 16G RAM will be fine. For storage, 120G storage is sufficient to store seven day's worth of events with a compressed size of 1k per message, produced at 200 messages / second mean. Realistic messages are likely going to be smaller, especially with compression. I'd say that anything >= 120G of storage should work. SSDs would be nice, but aren't a strict requirement.

There are quite a few candidates in https://wikitech.wikimedia.org/wiki/Server_Spares fitting these criteria. For example, the R 610s look more than capable. @RobH, @Cmjohnson: Do you think we could use two of those in eqiad?

GWicke renamed this task from Setup a 3 server Kafka instance in both eqiad and codfw for reliable purge streams to Setup a 2 server Kafka instance in both eqiad and codfw for reliable purge streams.EditedOct 26 2015, 5:16 PM
GWicke updated the task description. (Show Details)

Updated the ask to two boxes per DC in the description.

In codfw, I see that there are some R420s that would work for this task. I understand that spares are tight though, so the decision on using these spares vs. buying new boxes probably depends on whether other tasks would benefit from more powerful (new) machines.

RobH claimed this task.Oct 26 2015, 5:21 PM
RobH added a comment.Oct 27 2015, 12:59 AM

I meant to get to this today, but other tasks took priority.

I'll investigate the potential spares for this and also notate the costs of them on a linked sub-task (since costs/quotes aren't public). Then with the pricing data and current spares, we'll be able to discuss allocation/budgeting/potential ordering with @mark.

Hm, the R610s look good, although we don't need SSDs. If I had to choose, I'd prefer to go with larger HDDs over the smaller SSDs. @RobH, if we use the R610s, can/should we swap out their SSDs with HDDs?

The R420s are perfect!

RobH reassigned this task from RobH to mark.Oct 30 2015, 5:08 PM

@Ottomata: We can swap the dual 160GB SSD with dual 250GB HDD. They are all 2.5" SFF (small form factor) disks, so we don't have any larger capacity in sff.

Additionally, the R610s are well out of warranty. This means any hardware failures will require replacement parts, or if its the mainboard/NIC/mgmt, the entire system will need replacement. So while this has a low (to no) up front cost using the old machines, there are backend costs to pay down the road.

This request was initially for both sites, but we only have the R610s in eqiad..

codfw only has: Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32 GB Memory, (2) 500GB Disks

We recently got a quote for a single CPU 32GB dual 500GB system on T117240. That ticket can be checked for pricing. (DO NOT PLACE PRICING IN THIS PUBLIC TASK!) Only those with access

Thus, I see the options as follows (if anyone else has alternative suggestions, please comment):

eqiad (one of the below):

  • - allocate a spare R610 single cpu out of warranty system, swap in the 250GB disks per @Ottomata's request
  • - purchase a single cpu system, priced on T117240

codfw (one of the below):

  • - allocate an over-provisioned Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32 GB Memory, (2) 500GB Disks - we have 4 remaining, two out of warranty as of this year, and two that expire in January of 2016)
  • - purchase a single cpu system, priced on T117240

Budget: @mark will need to determine with @GWicke (or whoever handles services budget) and @ori (or whoever handles performances budget) about where (and how much) this will apply.

As this is now at the approval of spares use or purchase of new hardware, I'm escalating this task to @mark for his review/approval/corrections.

I'd like it if these systems were as close to homogenous as possible, so I think we should go with option 2 for eqiad, and either of the options for codfw.

Thanks @RobH!

What would the ETAs be for either options? I.e. if we go with spares, how much time is needed to image them / put them to work. Likewise, what would be the expected wait time for going with new boxes?

Budget considerations aside, it seems to me the optimal way forward would be to start with the spares and then replace them with new boxes as they arrive.

I'd like it if these systems were as close to homogenous as possible

Me too, which is why I think we could go with spares initially, but later expansion will be easier if we order new boxes.

Budget considerations aside, it seems to me the optimal way forward would be to start with the spares and then replace them with new boxes as they arrive.

If we are going to replace these soon after we set them up, I'd prefer not to bother. For development purposes, we can just use the existing Analytics kafka cluster in eqiad until we are ready to productionize.

GWicke added a comment.EditedOct 30 2015, 8:31 PM

Given the request volumes we expect the main consideration for homogeneity is probably disk space, with the smaller system (if uneven) driving the uniform retention setting in kafka. Our use cases don't require events to be kept for more than a couple of days, so going with the existing 2x160G (or 250G) would be fine. It will likely be good for about 30 days worth of messages.

Given the redundancy of the setup and low / generic requirements (and thus likely availability of suitable spares), I would say that using old systems for this is a good use of our money and time. Ultimately though it's ops' call on whether these spares should be used for something else, or if we should use new hardware in the expectation that it'll be less likely to fail.

For us, the most important consideration is time. We'd love to get started in eqiad in the next weeks, and hardware orders tend to add a month or so.

All in all, the weaker hardware would be fine as a start, and if we have to have different specs, we can deal with it, especially to start with. I'd just prefer if we didn't have to plan a hardware migration if we choose to upgrade shortly after these are used in production.

GWicke added a comment.EditedNov 3 2015, 7:37 PM

@Ottomata, based on the data we have so far (see T88459#1600439) even the smallest spares should have about two orders of magnitude throughput headroom over the event volume we expect initially. Those spares should last us a long time.

Mmk!

@GWicke, I can't remember. Have we talked about where the event http service will run? Will we colocate with these Kafka boxes for now?

@Ottomata: Yes, we'd like to co-locate. The benchmarks were done with co-location too.

Ok, good with me. Am fine with the spares decision. :)

RobH updated the task description. (Show Details)Nov 3 2015, 7:59 PM
RobH updated the task description. (Show Details)
RobH added a comment.Nov 3 2015, 8:01 PM

I've updated the task description to reflect the discussion results of using the two spares already in eqiad, and no preference on spare use or new purchase for codfw deployment.

Mark will still need to determine the budgets these pull from with the above teams on task.

Please review, comment accordingly, and assign back to me for followup. Thanks!

mark added a comment.Nov 13 2015, 1:00 PM

We're not deploying new production services on systems out of warranty.

Rob is already in the process of procuring a batch of new miscelleanous servers, and I think we should allocate 2+2 for these.

@RobH: please prioritize that procurement request, as it's now a blocker for this.

@mark, @RobH: What is the timeline for this procurement? Are those servers going to be usable before early December?

RobH claimed this task.Nov 13 2015, 3:33 PM

I'm currently working on the misc system quotes, I expect we'll order them sometime next week or the following, and they have a 2-3 week delivery leadtime. This would mean early to mid December.

faidon added a subscriber: faidon.Nov 13 2015, 3:36 PM

Mmk!
@GWicke, I can't remember. Have we talked about where the event http service will run? Will we colocate with these Kafka boxes for now?

@Ottomata: Yes, we'd like to co-locate. The benchmarks were done with co-location too.

Just to be clear, as far as I know, we've never agreed to this extra Event HTTP service. I don't agree with colocating and I'm not sure I even agree with having this in the first place.

This task is about a new Kafka setup for the purposes of the event bus — I'm OK with that.

OK, the above comment caused some confusion — apologies for that. I had a chat with @Ottomata about this last week, I'll clarify here for posterity as well:

  • This hardware request has been for 2+2 servers for the Kafka cluster for EventBus (and to be precise, not even that, the title still says "for reliable purge streams"). This is something that is fairly uncontroversial I think.
  • The term "Event HTTP service" is slightly confusing. Are we all talking about the new EventLogging-based service and not restevent? This was the source of my "having this in the first place" comment. If we all agree that this is Andrew's EL-based service, you can ignore that part.
  • The other part of my disagreement was about colocating the HTTP service on these servers; this was on the premise that this is unnecessary coupling and that it felt more right to colocate this service on e.g. the eventlogXXXX servers. Andrew is worried that we have only a single eventlog[12]001 server per DC though, to which my response was that then this is something that should probably be fixed too :) I realize this is much to ask for the EventBus deployment though, so I'm willing to withdraw my objections if Andrew feels strongly about it.
  • In any case, this task is about a hardware allocation. Let's not make it about the architecture of EventBus :)
GWicke added a comment.EditedNov 19 2015, 4:28 PM

@RobH, any updates on the timeline? Are we on track for having this hardware ready to be used before everything freezes for Christmas mid-December?

RobH added a comment.Nov 19 2015, 4:31 PM

I have the first vendors initial quotes in (they require a single correction, being submitted today). With the review of the first vendor done, I now have the base hardware to submit to the second vendor (today).

RobH added a subtask: Unknown Object (Task).Nov 19 2015, 4:40 PM
RobH added a subtask: Unknown Object (Task).Nov 24 2015, 7:13 PM
RobH mentioned this in Unknown Object (Task).
RobH changed the status of subtask Unknown Object (Task) from Open to Stalled.Nov 24 2015, 7:15 PM
RobH closed subtask Unknown Object (Task) as Resolved.Nov 30 2015, 6:29 PM
RobH edited subtasks, added: Unknown Object (Task); removed: Unknown Object (Task).Dec 3 2015, 5:47 PM
RobH added a comment.Dec 3 2015, 5:55 PM

The order was placed for the hardware to fulfill this request, but the estimated delivery date is 2015-12-16. Previous IRC discussion with @GWicke suggest we need to get these in place before the code freeze starts back up at the end of the week they arrive. They arrive on a Wednesday, and we'd have until Friday to get them online.

This may not be enough time, in particular if the shipment is delayed. I'll be following up with Dell to attempt to expedite their delivery, but I wanted to share the info I have at this moment.

Cmjohnson mentioned this in Unknown Object (Task).Dec 15 2015, 4:56 PM
RobH mentioned this in Unknown Object (Task).Dec 15 2015, 6:52 PM
RobH closed this task as Resolved.Dec 16 2015, 9:29 PM

Both tasks for the deployment of these systems are at the service implementation stage and have been assigned to @aaron, as he was the initial requestor.

As such, this request is resolved.

YEehaw, thank you!

RobH closed subtask Unknown Object (Task) as Resolved.Jul 11 2016, 5:29 PM