Page MenuHomePhabricator

Determine any impacts to SRE from OIT's planned move to JumpCloud for LDAP
Open, MediumPublic

Description

OIT is evaluating moving LDAP functionality to a cloud provider named JumpCloud. If that happens, any impacts to SRE-maintained production need to be addressed as part of the plan.

OIT's planning document for this project is at: https://docs.google.com/document/d/1Gj3E0NEepTJHaERGTWk6SoOTkbu5hFsZMwQ6jBp9IBk/edit?usp=sharing

Details

Related Gerrit Patches:

Event Timeline

Dsharpe created this task.Feb 10 2020, 9:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2020, 9:50 PM

The LDAP replicas are critical to the wikimedia.org mail servers: We currently have a replication setup between two OpenLDAP servers in the production realm (ldap-corp1001.wikimedia.org in Virginia and ldap-corp2001.wikimedia.org in Texas) and ldap1.corp.wikimedia.org in the OIT network. The mail servers then query the ldap-corp* systems in production to determine whether a given @wikimedia.org address is legitimate or not.

So if ldap1 is migrated to JumpCloud, then we need an endpoint somewhere on JumpCloud which can provide LDAP replication using the syncrepl LDAP replication method (https://openldap.org/doc/admin24/replication.html) .

jbond triaged this task as Medium priority.Feb 11 2020, 11:51 AM
jbond added a project: User-jbond.
jbond added a subscriber: jbond.

So if ldap1 is migrated to JumpCloud, then we need an endpoint somewhere on JumpCloud which can provide LDAP replication using the syncrepl LDAP replication method (https://openldap.org/doc/admin24/replication.html) .

@HMarcus can you check on this use case for jumpcloud?

chasemp moved this task from Incoming to Waiting on the Security-Team board.Feb 11 2020, 2:22 PM

@MoritzMuehlenhoff I will reach out to our support engineer and see if this can be supported.

Can you elaborate further on why this replication is needed\necessary? I understand it was in place for legacy reasons before the move to Google Enterprise, but I'm curious what needs it fills at this point. For instance, when we have had (multiple) LDAP outages at the office, it did not seem to impact mailflow from the SRE side (nobody yelled fire, at least), so it would be great if you could give me a rundown on why this is needed.

Can you elaborate further on why this replication is needed\necessary? I understand it was in place for legacy reasons before the move to Google Enterprise, but I'm curious what needs it fills at this point. For instance, when we have had (multiple) LDAP outages at the office, it did not seem to impact mailflow from the SRE side (nobody yelled fire, at least), so it would be great if you could give me a rundown on why this is needed.

For exactly the reason you mentioned: Having a local replica of the LDAP directory within the production environment (which properly pages in case of errors) provides us resilience that the mail servers are unaffected from unavailability of the upstream OIT LDAP directory.

Is there a reason why the SRE mx servers need to be the authority for email address validity? Couldn't Google's servers accomplish the same thing? Apologies if this seems like redundant information, but this sounds like a pretty significant blocker in this project if Jumpcloud can't replicate as needed.

Also, I'm not sure if this is related, but the mail team is already offloading aliases from the mx servers to Google per T122144. What would be the impact if we didn't have our mail route through WMF's mx servers?

Dzahn added a subscriber: Dzahn.Feb 12 2020, 6:02 PM

the mail team is already offloading aliases from the mx servers to Google per T122144

That ticket is waiting for updates from your team about acquiring "mail only" licenses from Google to be able to continue that task (for aliases for former staff). So i assume it would at the minimum mean higher costs from the OIT budget.

Also that's limited to personal aliases and there are a lot of other mail aliases.

What would be the impact if we didn't have our mail route through WMF's mx servers?

A lot of things would break. SRE would not get any root/postmaster/hostmaster mail anymore. Security would stop getting abuse mail, netops would stop receiving peering mails, analytics would stop receving alerts, DMARC reports would stop coming in , wikimania and fundraising queues in OTRS would stop receiving mails where normally volunteers handle them, fundraising's silverpop API setup would stop working, the Wikimedia store would stop getting orders/returns. techcom would stop receiving mails ..etc.

Thanks for clarifying that Daniel, and I was actually going to follow up on that ticket with you this week.

That's very helpful, and I appreciate you sharing all of the workflows that this is tied to. I was not aware of any of this. Is there any kind of documentation (maybe a diagram of sorts) that outlines the mail flow? I would love to have some kind of insight into how it all works, especially as it seems more of our systems are tied in with yours than I previously thought.

HMarcus added a subscriber: Wikimedia.EditedFeb 12 2020, 8:13 PM

@MoritzMuehlenhoff I received an answer from Jumpcloud's support engineers:

"It is important to note that we support bind, search, and compare over LDAP, so as long as all this operation is doing is determining if an address is valid or not, we are good. Complications could arise when considering whether or not passwords need to be replicated. Per the description of the operation they are performing, there are potentially ways to automate this outside of LDAP protocol replication, although from their perspective the easiest. From my reading of the operations that they are performing, they are not using the other directory, in this context, for authentication but for internal verification for email addresses. Per their statement, “The mail servers then query the ldap-corp* systems in production to determine whether a given @wikimedia.org address is legitimate or not.” does not imply they are authenticating with the User’s password for the “query” but simply querying that the email address is valid."

Are the replications just validating @Wikimedia addresses, or is there an authentication factor bound into this as well? If it would be easier for us to arrange a call with their support engineers, let me know and I'd be happy to facilitate.

There are two different angles to consider:

  • From Jumpcloud we need the replication (as in a full copy) of the LDAP directory. This achieved using a standard LDAP protocol called syncrepl (enwiki has a great description of it: https://en.wikipedia.org/wiki/OpenLDAP#Replication). This replication is agnostic of the directory's content, password hashes are replicated like any other bit of directory data.
  • The validation whether a user is legit or not then entirely happens on the local, replicated directory data within the production networks.

So what need to know from Jumpcloud is whether/how they can provide an endpoint from their solution which supports the "syncrepl" protocol.

@MoritzMuehlenhoff I believe it would be best if we engaged in a call with Jumpcloud's LDAP engineers to find out exactly what can be replicated and how it would be replicated. They got back to me with a few days and times. I believe you are based in Germany, correct me if I'm wrong.

Tuesday (2/18) - 7am PST, 4pm CET
Thursday (2/20) - 7am PST, 4pm CET

Let me know if either of these timeframes work for you.

@HMarcus Sure, we can do that. Let's do Thursday (2/20) - 7am PST, 4pm CET

@HMarcus would you mind looping me in on that meeting?

@MoritzMuehlenhoff @chasemp Thank you for joining the technical call today and helping to push this further along.

Moritz, it sounded like the verification needed to take place can happen from a query perspective, but any user changes (user is terminated, etc) would have to be manually done via LDIF exports (can be automated, although something to consider).

As far as next steps go, you mentioned that this will be a judgment call at this point and you will need to loop in your manager. Please add them to this ticket so we can try to address any concerns that may arise, or feel free to schedule a follow up meeting with myself and @eliza if that would be easier.

My take aways

Jumpcloud:

  • does not offer syncrepl access
  • is rfc 2307 compliant
  • does offer LDAP bind and search
  • has a status page and ability to alert on status changes at https://status.jumpcloud.com/

WMF:

  • Uses the mirrored syncrepl entry in the production LDAP for user validation only
    • This is really our core mail domains, and I think relates to the 400-ish staff explicitly
  • Uses some schemes for representing users in the LDAP tree that have left/are gone but the entry is still present. Moving OU etc

It seems as if we want to follow one of these paths. Most of them provide continuing to have entries in prod LDAP for WMF users to validate email addresses:

  • Deploy a method of dual entries on creation for new users and dual removal for offboarded users. (Dual being jumpcloud and bare bones entry in prod LDAP)
  • Automated sync (requires us to write something). @MoritzMuehlenhoff and I agree we probably do not really have to have real time sync here.
    • Jumpcloud has an API
    • Jumpcloud allows LDAP bind and search
  • Accept the risk of Jumpcloud as a service provider being unavailable. Which leads to the questions I'll list below that @MoritzMuehlenhoff has the best view of to my knowledge.

It's worth noting that there are risks to our setup now, and those risks need to be weighed for the organization against Jumpcloud.


Operational impact questions as they relate Jumpcloud adoption @MoritzMuehlenhoff

What would be the operational outcome of Jumpcloud being down if we were to rely on it for real time lookup? Would emails to addresses we could not validate fail? What are those emails? What would the impact be on the LDAP cluster in prod? Would it stall trying to reach out or would it fail gracefully and continue to function in other respect?

I believe if we answer the above we would then pursue a standard risk assessment and treat this as a business risk where the owner would depend on the quantitative risk determined (could require C level sign off / manager / etc).

jbond moved this task from Unsorted 💣 to Watching 👀 on the User-jbond board.Feb 21 2020, 1:33 PM

As far as next steps go, you mentioned that this will be a judgment call at this point and you will need to loop in your manager. Please add them to this ticket so we can try to address any concerns that may arise, or feel free to schedule a follow up meeting with myself and @eliza if that would be easier.

Sounds good! We'll discuss this in our Infrastructure Foundations SRE meeting on Wednesday.

@HMarcus We talked about this in yesterday's Infrastructure Foundations SRE; we would avoid to query the LDAP endpoint of Jumpcloud (for latency/reliability reasons), but given we only need to know whether a user/alias/group exists we can either fetch the user data from the Jumpcloud APIs regularly (or rather fetch that data directly from Google if there's an API we can use). We'd stop maintaining a local LDAP replica, then.

So when you move forward to evaluate Jumpcloud let us know (with some headsup/advance time!) so that we can make the changes on our end.

@HMarcus @MoritzMuehlenhoff Can we all agree on 6 weeks notice to SRE before going live as a control here?

If so I think that closes this task :)

@MoritzMuehlenhoff @chasemp Yes, we will plan on providing notice six weeks before the product goes live (assuming it passes the other reviews).

Thank you all for helping push this forward in a timely manner, as well as providing our team a better understanding on the LDAP relationship that exists here. It's very much appreciated :)

Yes, we will plan on providing notice six weeks before the product goes live (assuming it passes the other reviews).

You need to include us earlier; when there's a pilot installation of Jumpcloud running (but the old setup is not switched off), loop us in so that we can test the API integration before this goes live.

Also, when you a high level project time line ready, please let us know so that we can plan ahead, thanks.

HMarcus added a comment.EditedFeb 28 2020, 4:30 PM

We are already running a fully functional sandbox environment (free for up to 10 users) that we have been using for testing. I will set you up with an administrator account so you can access what you need. Here are the instructions for obtaining the API key: https://support.jumpcloud.com/support/s/article/jumpcloud-apis1

If you are having issues getting into the portal for any reason, feel free to email\message me directly.

The project implementation plan was attached at the beginning of the thread, but we are looking at a rough deployment timeframe of mid-late July.

@chasemp let me know if you or anyone on your team would also like an admin account provisioned to peruse the dashboard.

chasemp assigned this task to HMarcus.Mar 2 2020, 7:17 PM

Yes, we will plan on providing notice six weeks before the product goes live (assuming it passes the other reviews).

You need to include us earlier; when there's a pilot installation of Jumpcloud running (but the old setup is not switched off), loop us in so that we can test the API integration before this goes live.
Also, when you a high level project time line ready, please let us know so that we can plan ahead, thanks.

We are already running a fully functional sandbox environment (free for up to 10 users) that we have been using for testing. I will set you up with an administrator account so you can access what you need. Here are the instructions for obtaining the API key: https://support.jumpcloud.com/support/s/article/jumpcloud-apis1
If you are having issues getting into the portal for any reason, feel free to email\message me directly.
The project implementation plan was attached at the beginning of the thread, but we are looking at a rough deployment timeframe of mid-late July.
@chasemp let me know if you or anyone on your team would also like an admin account provisioned to peruse the dashboard.

Yes please, for me is fine. I'm assigning this task to you @HMarcus and I believe this can be considered complete in scope once these credentials are granted to the PoC setup. SRE and Security Team can handle other related work in other tasks :)

@chasemp I've created an administrator account for you and you should've received a password reset email. Please sign in at console.jumpcloud.com when you get a chance (make sure you click administrator login"), and confirm once you're in and have the access you need.

@HMarcus works as intended. Thanks.

Thanks @chasemp

@MoritzMuehlenhoff please confirm you can access the admin dashboard, and I will go ahead and close this task out.

Thanks @chasemp
@MoritzMuehlenhoff please confirm you can access the admin dashboard, and I will go ahead and close this task out.

ping @MoritzMuehlenhoff

@HMarcus I'm going to pick this up and take a look are you able to provide me with credentials. thanks

@jbond Thank you, I've created an admin account for you and sent you a password reset email. Please sign in at console.jumpcloud.com (click administrator login at the top left), and confirm when you're in and have the access you need.

Change 585501 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] profile::mail::jumpcloud: add new class to manage jumpcloud aliases

https://gerrit.wikimedia.org/r/585501