Page MenuHomePhabricator

Create job to deliver the eventlogging_CentralNoticeImpression topic
Closed, ResolvedPublic

Description

Can we get a new job on americium to pull the eventlogging_CentralNoticeImpression topic to files that we can deliver to the civi box? This is what will replace the kafkatee job that filters for /beacon/impression web hits, so the same delivery frequency and locations would be great.

This topic shouldn't require any filtering or reformatting though.

Event Timeline

Ejegg triaged this task as Normal priority.Mar 15 2018, 9:24 PM
Ejegg created this task.
DStrine moved this task from Triage to FR-Ops on the Fundraising-Backlog board.

This should be as simple as adding another input, but our current puppet module does not appear to support that. I see the upstream version has multiple inputs, but it looks like they are combined and would have to be separated (maybe that's fine). @Ottomata am I understanding that right?

Although I believe we are supposed to be replacing the current one entirely so maybe a hard cutover is easier.

It would probably be best to just fire up a second kafkatee instance. The most recent kafkatee puppet module supports this, something like:

# Configure and run a kafkatee instance consuming from the topic eventlogging_CentralNoticeImpression
kafkatee::instance { 'CentralNoticeImpression':
    kafka_brokers             => ['kafka-jumbo1001.eqiad.wmnet:9092'], # ..., 'kafka-jumbo1002:9092' ...],
    kafka_offset_store_method => 'broker',
    kafka_group_id            => 'fundraising-00',
    inputs                    => [
        {
            'topic'      => 'eventlogging_CentralNoticeImpression',
            'partitions' => '0',       
        }
    ]
}

# Set up some kafkatee outputs for this instance.
kafkatee::output { 'CentralNoticeImpression_log':
    # Instance name must match a declared kafkatee::instance.
    instance_name => 'CentralNoticeImpression',
    destination => '/path/to/CentralNoticeImpression.json',
}

The default is to just parse the input and output as a string, so if you don't need any filtering and formatting, you don't need to specify any encoding=json stuff anywhere.

Ejegg added a comment.Mar 27 2018, 3:51 PM

Yeah, another instance would be great. We'd like to run both in parallel for a while to compare the numbers.

Just curious though. Are you sure you want to just write this stuff to a log file? Are you just putting it there for a bit? The less hacky pipeline for yall would be kafka -> into MySQL directly.

Ejegg added a comment.Mar 27 2018, 5:14 PM

@Ottomata oh right, that might be the way to go now. Our current process does a little bit of filtering via a python script before dumping into mysql, mostly dumping bots. Can the kafka->mysql loader do some filtering?

@cwdent can americium talk to the db server?

@Ejegg, you would either:

  • A: write your own python (or whatever) based kafka consumer that read JSON messages and issued MySQL inserts, via your MySQL client of choice.
  • B: Use the EventLogging codebase's eventlogging-consumer with the mysql:// writer endpoint. This is how we insert EventLogging into the log MySQL database.

A is probably better, because you'll have more control over what tables get inserted into. B would work if you don't care so much and just want the events in MySQL.

As part of the Event Data Platform project next year, we might get a JSON based Kafka Connector which would make integration between Kafka and different datastore much simpler. Next year though...

Happy to help explain more. Come find me in IRC or we can jump on a hangout call anytime!

I betcha A. wouldn't be so hard, as you already have the code that reads JSON from log files and inserts into MySQL, right? You'd just need to swap out the part that reads from files, and read from Kafka instead.

@Ottomata thank you for all the info! I think I agree with you that A sounds good. I tried updating the kafkatee puppet module but our puppet repo has diverged significantly from prod by now and we'd have to fork it. Writing something new can probably reduce overall complexity, and hopefully most of the code is already there.

Cool, at the moment I'd recommend https://github.com/dpkp/kafka-python. We maintain .deb packages for it.

cwdent added a comment.EditedMar 28 2018, 9:45 PM

@Ottomata thanks, looks legit. I see there are stock packages for stretch, and the codfw banner logger is already stretch, so I am going to try building it out there. Currently working on a test vm for that.

cwdent renamed this task from Create job on americium to deliver eventlogging streams to Create job to deliver the eventlogging_CentralNoticeImpression topic.May 23 2018, 7:25 PM

Thanks to a lot of help from @Ottomata it looks like this is working, alnitak:/srv/kafkatee/centralnotice-impressions

I believe the next step is importing into mysql

@Ejegg @AndyRussG these are now rotated out to /srv/banner_logs, doesn't appear that anything is using the topic yet

Quick question: are we currently doing any server-side sampling for this? I believe the previous pipeline sampled at 10%?

We do need to sample server-side rather than client-side, so that we get unsampled data in Hive.

Also, if it's possible to get randomness that's as really random as is realistic, that'd be great. (Not sure how we currently do random sampling--see T192685: CentralNotice: Truer random selection in JS for discussion of this on the client side.)

I'm not 100% sure how kafkatee does sampling, but I bet you it just filters for every Nth message. E.g. 1/100 would get you every 100th message.

A few notes from discussions on IRC...

Here are some ideas for nice-to-have changes, if they are easy to implement. (None are essential!)

  • Have separate directories for CentralNotice and LandingPage event logs.
  • For CentralNotice events, it might be more consistent to always include the sample rate in the file name, even if it's ever 100% (though it's never expected to reach that).
  • It would be nice, though it's not essential, to have the sample rate in the filename after the timestamp, to facilitate ordering filenames by timestamp.

CentralNotice event logs should be sampled at 10%, as was the case with the banner logs.

Finally, banner_logs doesn't seem like the best name for a directory for these logs, since LandingPage events don't come from banners, and CentralNotice events don't always indicate a banner display... Maybe user_event_logs?

Thanks!!!!

I'm not 100% sure how kafkatee does sampling, but I bet you it just filters for every Nth message. E.g. 1/100 would get you every 100th message.

Hmmm... Should that be changed? Seems like it would biased against types of events that come in short bursts? Would it be easy to make it random instead?

Not easy! It looks like the sampling is done here: https://github.com/wikimedia/analytics-kafkatee/blob/b120eb2955efc5c4e18232c26eee1b54257ecdf6/output.c#L64-L65

So this would need some kafkatee code changes.

Jgreen claimed this task.Aug 21 2018, 1:46 PM
Jgreen added subscribers: cwdent, Jgreen.

We ended up at collecting this as a flat file, json format, sampled 1:10.

Jgreen closed this task as Resolved.Aug 21 2018, 1:46 PM