Page MenuHomePhabricator

Kafka Client for MediaWiki
Closed, ResolvedPublic

Description

Step one will be to find or write a useable PHP Kafka Client.

I hacked on a possible PHP Kafka client at the Wikimania Hackathon. Doesn't look easy! I made a short attempt at HHVMizing https://github.com/EVODelavega/phpkafka, and failed. The other options is the native PHP https://github.com/nmred/kafka-php. However, it has a PHP C Zookeeper extension dependency.

We don't need Zookeeper to produce messages to Kafka. kafka-php uses it in the produce logic because it was written for a slightly older version of Kafka. It should be possible to fork kafka-php and edit out any of the zookeeper dependencies if we plan on using it only for producing messages.

Event Timeline

Ottomata claimed this task.
Ottomata raised the priority of this task from to Low.
Ottomata updated the task description. (Show Details)
Ottomata subscribed.
Ottomata renamed this task from Kafka Client for Mediawiki to {stag} Kafka Client for Mediawiki.Jul 19 2015, 6:07 AM
Ottomata renamed this task from {stag} Kafka Client for Mediawiki to Kafka Client for Mediawiki.
Ottomata set Security to None.

I took a quick look over the kafka-php codebase, along with the patch that removed zk from the producers dependencies in kafka[1]. As far as i can tell, each partition of a topic within kafka has a single leader and all write requests for that partition go to that leader? If thats true and we are ok hardcoding the (topic+partition -> brokerid) and (brokerid->host:port) maps into the mediawiki config I'm think I could manage that relatively easily by extending a couple classes in the kafka-php repo. It wouldn't be particularly beautiful, there would be a variety of functions and methods that don't work if used, but the producer pathway would be fine. It also looks like the information about brokers and partition mapping is also available from kafka directly but that would be a much more involved process.

Does that seem reasonable? I don't really know enough kafka to say. If i were to start hacking this up, whats the best way to get a local kafka install up and running to test with? I took a quick look in mw-vagrant but didn't see anything interesting.

[1] https://issues.apache.org/jira/browse/KAFKA-369

each partition of a topic within kafka has a single leader and all write requests for that partition go to that leader

This is true, but it can change at any time. The reason for using Zookeeper or more recently Kafka directly, is that clients can subscribe to leader changes for a topic-partition, and start producing to a new broker when the leader changes. I don't think we can hardcode this information.

It also looks like the information about brokers and partition mapping is also available from kafka directly but that would be a much more involved process.

Yes, I started to hack around this, to see if I could get kafka-php to use the broker information instead of Zookeeper, and I don't think it is that hard. What would be more difficult is implementing a change subscriber based on Kafka metadata. However, since the intention here is to use kafka-php to produce a small number of messages resulting from a single HTTP request to MediaWiki, we don't need to maintain long lived metadata state. I think it should be sufficient for either each PHP produce call to ask Kafka for the metadata, or perhaps to cache the metadata data for the first produce call, and not worry about metadata (leadership) changes.

If we do this, we might want to just fork kafka-php and strip out everything but produce capabilities, and heavily document the fact that it should be used for short lived processes that only produce a few messages.

Hang on, that's not quite right...

Hm, nevermind, I think you are right, I was remembering wrong what I worked on at the Hackathon. I think you are right. In order to do this properly we'd have to implement broker metadata support in the PHP client, which would mean we'd have to understand the Kafka protocol pretty well :/

I was also thinking about stripping everything but the production out of kafka-php, it would make something we can deploy without having random broken code sitting arround waiting to confuse someone.

I just looked a bit closer at the Kafka protocol[1] and compared to whats in kafka-php. It looks like the metadata request/response handling might already be written in \Kafka\Protocol\Encoder::metadataRequest and \Kafka\Protocol\Decoder::metadataResponse methods. These arn't called by anything, but if they already encode/decode as advertised just issuing the requests can't be too hard.

[1] https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Partitioningandbootstrapping

Not sure how i missed it before, but in the examples/ directory there is a script called MetaData.php Running that in prod while pointed to analytics1012.eqiad.wmnet results in:

array(2) {
  ["brokers"]=>
  array(3) {
    [18]=>
    array(2) {
      ["host"]=>
      string(25) "analytics1018.eqiad.wmnet"
      ["port"]=>
      int(9092)
    }
    [12]=>
    array(2) {
      ["host"]=>
      string(25) "analytics1012.eqiad.wmnet"
      ["port"]=>
      int(9092)
    }
    [21]=>
    array(2) {
      ["host"]=>
      string(25) "analytics1021.eqiad.wmnet"
      ["port"]=>
      int(9092)
    }
  }
  ["topics"]=>
  array(1) {
    ["test1"]=>
    array(2) {
      ["errCode"]=>
      int(0)
      ["partitions"]=>
      array(1) {
        [0]=>
        array(4) {
          ["errCode"]=>
          int(0)
          ["leader"]=>
          int(18)
          ["replicas"]=>
          array(3) {
            [0]=>
            int(18)
            [1]=>
            int(12)
            [2]=>
            int(21)
          }
          ["isr"]=>
          array(2) {
            [0]=>
            int(12)
            [1]=>
            int(18)
          }
        }
      }
    }
  }
}

So we might just be in business. In terms of making these metadata requests, even if we have to make the metadata query on every web request as long as we push it behind mw's DeferredUpdates system it will have no effect on end-user latency (at least in wmf prod).

AH Cool! That is what I was looking for! Awesome.

["topics"]=>
array(1) {
  ["test1"]=>

I hope you saw more than just “test1” in prod :)

Yes, then I think that will do it. Each MW produce will:

  • get metadata
  • randomly choose a partition for desired topic
  • lookup the hostname and port of the leader for that topic-partition in the metadata
  • produce message to that topic-partition

Change 229172 had a related patch set uploaded (by EBernhardson):
[WIP] Produce monolog messages through kafka avro

https://gerrit.wikimedia.org/r/229172

Change 229172 had a related patch set uploaded (by EBernhardson):
Produce monolog messages through kafka avro

https://gerrit.wikimedia.org/r/229172

In https://gerrit.wikimedia.org/r/229172, Eric wrote:

We might also want to think about Avro schema storage. For this first use case I'm ok with embedding the schema in the config files directly but moving forward that's not going to be our most manageable solution. EventLogging already has all the right tools for storing schemas on wiki, we might want to go down that road.

Indeed! @ori and I had talked about this a bit. We think it would be possible to write an Avro Schema ContentHandler (is that the right MW term?) that will work like the one does now for jsonschema. If we do this, we think it would be better to leverage Confluent's Schema Registry as the actual backend storage and to use MW as a GUI to it. This is mainly because the Schema Registry handles schema validation and evolution. If we wanted MW to do it, we'd have to implement a whole lot ourselves, I think.

But, I have a question! I didn't realize yall were thinking of using Avro for this! I think this is a good idea, and it will make many things easier in the analytics world, but also some things will be harder. For example, it won't be as trivial to just consume data from Kafka using simple tools like kafkacat or kafkatee. You'll also be introducing an Avro dependency for all downstream consumers of this data. In Hadoop, this will be fine, as Avro is very well supported there. But, if say @Ironholds wants to just collect some data and run some shell commands and R stuff on it, he will have a harder time than if you were just using JSON.

I think Avro is the right thing to do, but I'm not prepared to force it on anyone (yet :) ). I'm excited that yall want to use it, as it will help us get more real experience with it, but I just want to be sure that your decision to use it was made with these potential difficulties in mind.

Thankfully I don't :). If I want to consume it I plan to use Hive queries. I am going to start writing an avro client in R, in my spare time, though.

Legoktm renamed this task from Kafka Client for Mediawiki to Kafka Client for MediaWiki.Aug 5 2015, 6:41 PM

Thankfully I don't :). If I want to consume it I plan to use Hive queries. I am going to start writing an avro client in R, but that's a spare-time thing. Will this do anything to say, the consumability of data through Spark?

Will this do anything to say, the consumability of data through Spark?

Yes and no, It'll actually make things a little easier, especially if you are using Scala or Java. You'll be able to use the Avro schema to get a first class object, which will get you named and typed fields.

Change 229172 merged by jenkins-bot:
Produce monolog messages through kafka avro

https://gerrit.wikimedia.org/r/229172

EBernhardson claimed this task.