Page MenuHomePhabricator

eqiad: 1 misc node for the Kerberos KDC service
Closed, ResolvedPublic

Description

Please note there are two requests currently open to add a single cpu misc host for Kerberos KDC and Kadmin daemons, one in codfw T227425 and one in eqiad T227288.

The original #hw-requests are for codfw T227425 & eqiad T227288. They are identical in all respects, merely listing one per site.

Site/Location: eqiad
Number of systems: 1
Service: kerberos
Networking Requirements: internal IP, no specific network subnet (it will not need to be in the analytics vlan).
Spec: current one for misc nodes

This is one of the two nodes that will be hosting the Kerberos KDC and Kadmin daemons (one host will act as master and the other one as standby, one in eqiad and the other one in codfw). We don't need any special requirements, one node following the current misc specs will be more than enough. It would be really great, if possible, to purchase and rack eqiad and codfw nodes in Q1.

Event Timeline

elukey created this task.Jul 4 2019, 3:36 PM
elukey updated the task description. (Show Details)

Should these really be both in eqiad? The initial use case is for analytics, but we might very well come up with a use case outside of analytics going forward. If we have one in eqiad and one in codfw we gain DC redundancy and not only row redundancy.

elukey added a comment.Jul 5 2019, 6:27 AM

Should these really be both in eqiad? The initial use case is for analytics, but we might very well come up with a use case outside of analytics going forward. If we have one in eqiad and one in codfw we gain DC redundancy and not only row redundancy.

This is a very good point. Would we have only one KDC per datacenter? Because another solution, given the fact that misc nodes are relatively cheap, could be to have a couple of KDC for each data center with only one acting also as kadmin "globally". This would allow us to be resilient to one host down in each DC, and we'd also have better latencies per-dc. Would it make sense? Basically start with two in eqiad, see how it goes, and if more use cases pop up, then add two more in codfw.

The alternative could be to have one physical misc host for each data center, and possibly a "standby" KDC in Ganeti as well. Or just one kerberos host per DC, but I'd be worried of cross-dc latencies for hosts that need to authenticate to a KDC not in the local DC (not sure if it makes sense or not).

elukey moved this task from Backlog to Kerberos on the User-Elukey board.Jul 5 2019, 7:00 AM

This is a very good point. Would we have only one KDC per datacenter?

I think having one KDC per data centre seems fine, even if we've "failed over" to codfw (let's assume the row failed where the eqiad KDC is running but Hadoop is not affected), the extra round trips seem entirely negligible, it's just for the TGT/TGS, once the client has a service ticket, the KDC is entirely out of the loop.

Before deploying/purchasing the baremetal hosts for production we could experiment with a KDC in Ganeti/eqiad and Ganeti/codfw for some actual numbers, but I doubt it's really measurable.

Because another solution, given the fact that misc nodes are relatively cheap, could be to have a couple of KDC for each data center with only one acting also as kadmin "globally".

We can only have one kadminserver per realm, do you mean to have a designated kadmin server in addition to KDCs? That doesn't buy us anything aside from more complexity?

The alternative could be to have one physical misc host for each data center, and possibly a "standby" KDC in Ganeti as well.

I'm not fond of the idea of running a KDC in Ganeti, think of cross-VM information attacks/leaks. We've seen plenty of those already and I'm certain those were not the last :-)

MoritzMuehlenhoff triaged this task as Normal priority.Jul 5 2019, 12:02 PM

Makes sense, the extra latency to codfw shouldn't be a big deal. I know that we need to have only one kadmin server, but I was thinking about moving it among the two eqiad nodes (if needed, say the master host going down) and in case also to codfw, but indeed it is probably going to increase the complexity to keep stuff in sync among the hosts.

If you think that we'll have a future use case for codfw, I am +1 to buy one misc node in eqiad and one in codfw.

+1 also for the ganeti comment, it is indeed not wise :)

elukey added a subscriber: Ottomata.Jul 8 2019, 7:06 AM

Adding @Ottomata for a quick check about the next steps, but it sounds to me that having one kerberos host per DC seems the most flexible solution. If Andrew agrees I'll create another task for codfw :)

If you think that we'll have a future use case for codfw, I am +1 to buy one misc node in eqiad and one in codfw.

There's no specific, written-down use case yet, but there were considerations about e.g. using Kerberos for running Cumin as non-root and I'm pretty sure once the infrastructure is in place, we'll come up with more. And given that the overhead of a non-DC-local KDC seems negligible, I'm in favour of having one in eqiad and one in codfw.

elukey renamed this task from eqiad: 2 misc nodes for the Kerberos KDC service to eqiad: 1 misc node for the Kerberos KDC service.Jul 8 2019, 7:49 AM
elukey updated the task description. (Show Details)
elukey added a comment.Jul 8 2019, 7:52 AM

Amended this task and created T227425 :)

+1 for 1 eqiad and 1 codfw

Milimetric moved this task from Incoming to Radar on the Analytics board.Jul 8 2019, 3:58 PM
elukey added subscribers: wiki_willy, RobH.EditedJul 16 2019, 2:38 PM

@wiki_willy @RobH hi! Don't mean to jump the queue, but I am wondering if this task and its codfw one could be prioritized over the next weeks. It will help a ton the Analytics goals, but if not possible I'll wait! :)

@elukey @RobH - I've marked it as accelerate on the procurement doc. Rob, can you work on getting these two servers included on this procurement cycle? Much appreciated.

Thanks,
Willy

RobH moved this task from Backlog to Stalled on the hardware-requests board.Jul 16 2019, 5:06 PM
RobH assigned this task to elukey.Jul 16 2019, 6:11 PM
RobH updated the task description. (Show Details)
RobH moved this task from Stalled to In Discussion / Review on the hardware-requests board.

Please note there are two requests currently open to add a single cpu misc host for Kerberos KDC and Kadmin daemons, one in codfw T227425 and one in eqiad T227288.

The original #hw-requests are for codfw T227425 & eqiad T227288. They are identical in all respects, merely listing one per site.

Unfortunately, our spare pool systems in both sites are NOT identical. We have not had to order spare pool systems in codfw as recently as eqiad. So their specifications do not exactly match. We need the sign off from @elukey that these being slightly different won't matter. @RobH doesn't think they will, since its just asking for a minimum specification host in each site, but wants to ensure this is ok with Analytics.

A list of every spare pool system is viewable via this netbox url: https://netbox.wikimedia.org/dcim/devices/?q=&role=server&status=5&mac_address=&has_primary_ip=&console_ports=&console_server_ports=&power_ports=&power_outlets=&interfaces=&pass_through_ports=&cf_owner=&cf_purchase_date=&cf_support_contract=&cf_support_until=&cf_ticket=

CODFW Host

We do not have any single cpu systems available within warranty for this use in codfw. However, it may be cheaper to allocate an existing dual CPU system (which has its warranty countdown clock already started) versus purchasing a new single CPU system. As such, I'll list the info for the dual CPU system, knowing it may be overkill but has already been purchased and is in the rack.

  • PowerEdge R430 - WMF6577 (1 of 3 of these hosts available) - purchased on T166265
    • Dual Intel Xeon E5-2623 v4 2.6GHz/4Cores
    • 64GB RAM
    • 1GB NIC
    • (4) 4TB LFF SATA disks (software raid only)

EQIAD host

  • Dell PowerEdge R440 1U system - WMF5173 (1 of 3 of these hosts available) - purchased on T216269
    • Single Intel Xeon Silver 4110 2.1G, 8C/16T (comes out to same number of cores as the dual core older system in codfw recommended above)
    • 32 GB RAM
    • 1GB NIC
    • (2) 480GB SSD (software raid only)

Before I create the private S4 procurement tasks for management approval of these allocations (which will include pricing), I wanted to check and see if this actual hardware selection above meets the approval of @elukey / MoritzMuehlenhoff / @Ottomata?

I'm tasking this over to @elukey since they created the requests for feedback on the above. Please comment and assign back to me for followup.

Thanks!

Also followed up on the codfw task, but adding here for completeness as well: This looks good to me!

elukey reassigned this task from elukey to RobH.Jul 17 2019, 8:05 AM
elukey added a comment.Aug 7 2019, 9:00 AM

Looks good to me (followed up only on the codfw task). Can we get them repurposed?

elukey mentioned this in Unknown Object (Task).Sep 9 2019, 6:31 AM
RobH reassigned this task from RobH to faidon.Sep 11 2019, 3:04 PM
RobH moved this task from In Discussion / Review to Pending Approval on the hardware-requests board.
RobH added a subscriber: faidon.

@faidon,

Please note that T227425 & T227288 are for spare pool allocations for kerbos in both codfw and eqiad. as such, I need approvals for allocating a single spare pool system in each location:

CODFW Host

We do not have any single cpu systems available within warranty for this use in codfw. However, it may be cheaper to allocate an existing dual CPU system (which has its warranty countdown clock already started) versus purchasing a new single CPU system. As such, I'll list the info for the dual CPU system, knowing it may be overkill but has already been purchased and is in the rack.

  • PowerEdge R430 - WMF6577 (1 of 3 of these hosts available) - purchased on T166265
    • Dual Intel Xeon E5-2623 v4 2.6GHz/4Cores
    • 64GB RAM
    • 1GB NIC
    • (4) 4TB LFF SATA disks (software raid only)

EQIAD host

  • Dell PowerEdge R440 1U system - WMF5173 (1 of 3 of these hosts available) - purchased on T216269
    • Single Intel Xeon Silver 4110 2.1G, 8C/16T (comes out to same number of cores as the dual core older system in codfw recommended above)
    • 32 GB RAM
    • 1GB NIC
    • (2) 480GB SSD (software raid only)

Please comment with approval (or questionsand assign back to me for followup.

faidon reassigned this task from faidon to RobH.Sep 13 2019, 12:03 PM

Approved.

It sounds like our spare pools are being drained, so if that's the case feel free to open a task to replenish them.

@RobH let me know if I can help with the host repurpose (also with the codfw one), I can take care of the DNS/puppet/DHCP/etc.. steps :)

RobH closed this task as Resolved.Sep 17 2019, 6:40 PM

T233141 created for setup. resolving this request task!