Page MenuHomePhabricator

Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams
Closed, DeclinedPublic

Description

Opening a thread to be the center of discussion around finding a way to connect WMF Kafka securely to Wikimedia Enterprise's external cloud infrastructure.

Problem: Wikimedia Enterprise's data feeds are built on the HTTP EventStream APIs which are prone to connection time-outs and potential data loss. Whereas the team has built solutions to minimize the impact of this, see more in this library, it still represents a major risk of data transfer into the Wikimedia Enterprise systems.

Proposal: Create a strong bridge between "non-PII containing" Kafka streams to Wikimedia Enterprise's infrastructure to directly connect to the WMF event platform.

Next Steps: Put together a technical solution that can be passed around to relevant teams to map out the scope of work and timeline - @Ottomata, we have previously discussed this and had some thoughts, I can work with you to document them on this ticket!

Risks to consider / mitigate:

  • Security Implications - Tagging WMF Security on this ticket to be a part of the design to ensure that we are not creating undo risk on the PII containing streams
  • Reliability - As we scope out a technical solution, discovering who will be able to maintain the connection between the two infrastructures. To be decided how much work that will be.

Event Timeline

RBrounley_WMF renamed this task from WMF Kafka Integration to Securely connect Wikimedia Enterprise Infrastructure with WMF Kafka Streams.Apr 26 2021, 3:02 PM
RBrounley_WMF updated the task description. (Show Details)
RBrounley_WMF updated the task description. (Show Details)
Ottomata added subscribers: Bstorm, nskaggs.

Here's how this could possibly work:

  • Someone (SRE? Platform Eng? Cloud Services?) provisions and maintains a new Kafka cluster (in each DC) that will only contain non PII data and is intended to be useable by clients outside of WMF production. (If we wait long enough maybe the new Zookeeperless Kafka will be ready!)
  • Kafka MirrorMaker mirrors from kafka main clusters topics we want to expose. For now, the list is probably the same as those exposed by EventStreams.
  • External Kafka broker endpoints are exposed to the web, with some kind of authentication. I'm not sure how to do this exactly, but perhaps:
    • some kind of firewalling like we use for the ElasticSearch cluster that is exposed to Cloud VPS?
    • Using Kafka authentication and ACLs, probably using TLS certs.

This external Kafka cluster could be exposed for use by WME as well as Cloud VPS. People in Cloud VPS could then build more powerful tools that use Kafka clients directly rather than having to use HTTP SSE via EventStreams. Anyone in Cloud VPS could spin up whatever stream processing framework they like to do fancy stuff.

The tricky bit is that Kafka clients create some state on the Kafka brokers (consumer offset commits, etc.). We need to make sure that only trusted users could connect to this Kafka cluster.

Also, we'd enventually like to expose only compacted topics from which we can remove things like suppressed revision create events. This would have to be done by the event producers or perhaps by a stream processing app, so is slightly out of scope for this ticket, even if this work would benefit from it.

Lovely, thank you @Ottomata. @nskaggs is this potentially interesting to you all? tagging for line of sight.

A few follow-up questions:

  1. Are these steps needed together or different approaches?
  2. Does this imply the need to provision new hardware or is this something we could spin up on our existing infrastructure?
  3. How much total work does this seem to be?
  4. Any idea how much work this would be to maintain? I'm hopeful we can maintain this with our new Sr Software Eng rec, but not sure if this is easy or hard to maintain. Tagging @hnowlan here.

Are these steps needed together or different approaches?

Needed together

Does this imply the need to provision new hardware or is this something we could spin up on our existing infrastructure?

Needs new hardware

How much total work does this seem to be?

Could be set up by one engineer in a quarter (if hardware was available).

Any idea how much work this would be to maintain?

Probably not too hard. Although the authentication bits are a different than anything we've done before; if we get that wrong the internet trolls could break Kafka and then it'd be more work. :)

Actually, the networking bits are new too, so I'm not sure how hard that would be. We'd have to configure Kafka's advertised.listeners list differently than we've done before, and the routing from public interfaces would be different than we normally do for services. I know that Traffic usually doesn't (and doesn't like to) deal with public connections other than HTTP, so we'd have to talk with them about how to handle this (I expect resistance :) )

Following. This could be an interesting reference implementation for other uses / problems. I would be mindful of the networking requirements here and loop in I/F as well.

Update here - we are onboarding folks at the current moment - DevOps focused Sr Software Engineers. Some of which will take pick this ticket up, moving this to the backlog for now.

Aklapper added a subscriber: RBrounley_WMF.

Removing inactive task assignee. (Please do so as part of the team's offboarding steps - thanks.)

JArguello-WMF claimed this task.
Ottomata changed the task status from Resolved to Declined.Apr 18 2024, 11:34 AM

Hello! I don't think this task is resolved. Perhaps you meant to decline it?

Being bold and doing so, feel free to revert if I am wrong.

Could add a comment documenting why you are closing the task? Thank you!