Page MenuHomePhabricator

Create analytics-centric Cirrus logs and have them import into HDFS
Closed, ResolvedPublic

Description

To make it easier to perform both ad-hoc and daily analytics, it would be good to have Cirrus server-side logs in HDFS. Bob Flagg would be a great person to do this since he has an analytics background and it would provide an introduction to both our analytics infrastructure and our search infrastructure.

Tasks:

  1. Sit down with Oliver and work out what fields we want to log;
  2. Create a streaming format in Cirrus that outputs logs containing these fields in a way HDFS can consume;
  3. Work with Ottomata to integrate this stream into HDFS's input

Definition of "done" for this task:

  • Cirrus logs are available in Hadoop.

Related Objects

Event Timeline

Ironholds raised the priority of this task from to Needs Triage.
Ironholds updated the task description. (Show Details)
Ironholds added a project: Discovery-ARCHIVED.

Work with Ottomata to integrate this stream into HDFS's input

Cirrus search logs are all server side, right?
We should integrate a PHP Kafka producer into Mediawiki.

https://github.com/quipo/kafka-php looks fairly reasonable. We should be able to build out some glue code between monolog and kafka-php such that logs to the CirrusSearchRequests (or whichever) log group go to kafka instead of (or in addition to) fluorine.

What format should the log messages be in?

looking closer, the quipo/kafka-php is only valid through the 0.7 series of kafka and we are on 0.8. I'm seeing two options for producing logs to kafka 0.8 in php, but both require zend extensions written in C. One to talk to librdkafka (which kafkatee uses) directly, the other to talk to zookeeper (to determine which hosts to send messages about a particular topic/partition combination). Neither are labeled as stable implementations which leaves me a bit wary.

I'll spend some time and see if either of these play nicely under hhvm's zend-compat layer for extensions, but i'm not holding my breath.

@eberhnardson this is a task for Bob to orient him around our systems; obviously your help would be most appreciated but no claiming yet :D. I'd have assigned it to him but evidently nobody has told him to create a phabricator yet.

Hm, I think if we are just producing messages, and not consuming, we should not need zookeeper as a dependency. I'm not sure if that extension relies directly on zookeeper, but perhaps if we don't call any consume functions, it will just work?

librdkafka itself is stable; we are using it on all the frontend caches to produce webrequest logs to kafka. I'd likely go with that one.

As for message format, maybe we should meet to talk about this? If think in a year from now we will have a REST interface to Kafka that can be used to produce Avro messages to Kafka using a schema registery. This project will be tracked in T102082, and is firstly focused on analytics type data, which these logs are. I think that system can and should be used for application type messages too, but that is longer down the road, we will see.

Alternatively (and possibly less reliably), we are working on making Eventlogging more scalable by using Kafka as the backend for it. Depending on the volume you want to produce, Eventlogging MAY be able to support this in the near term (within a month?). But, I would be hesitant to promise this, and producing directly to Kafka will definitely work.

You could still use Avro now without T102082, which would make it easier to transition in the future. It would also make querying in Hadoop more efficient than JSON. Otherwise, just go with JSON and try to not change your schemas too much.

The stream here is around 200 million messages per day, probably a bit much for the EventLogging infrastructure.

I'll step back and let bflagg work on this, but feel free to pull me back in for anything.

@Ironholds It was proposed in the Cirrus backlog grooming that we put this back into the backlog. Can you give us some info on how essential this is for you?

Define "essential"? ;p.

It's not essential; nothing will break if we don't do it. But it will drastically cut down the amount of time it takes to add new things to the log processing. We're talking in order of turning a 4 day task into a 2 day task, here (this "task" is arbitrary but you get my point). There will be a non-trivial initial cost while I get up to speed on Oozie jobs and turn all the pythonic streaming functions into Java UDFs, but after that it'll save a lot of time.

FYI, I hacked on a possible PHP Kafka client at the Wikimania Hackathon. Doesn't look easy! I made a short attempt at HHVMizing https://github.com/EVODelavega/phpkafka, and failed. The other options is the native PHP https://github.com/nmred/kafka-php. However, it has a PHP C Zookeeper extension dependency.

We don't need Zookeeper to produce messages to Kafka. kafka-php uses it in the produce logic because it was written for a slightly older version of Kafka. It should be possible to fork kafka-php and edit out any of the zookeeper dependencies if we plan on using it only for producing messages. Let's track any future work on this here: T106256

This is blocked by T106256, which has a patch that is awaiting review.

I asked Chris about this before. The plan is still for this security review to be done by the end of this week.

Security review was completed, so this is ready to go out next week. In theory (i.e. provided there are no issues), this means that starting next week we'll have tons of search data in HDFS.

Still awaiting deployment. Also waiting on the schema definition, which is taking place in T112295.

Change 240041 had a related patch set uploaded (by EBernhardson):
Log in new format compatible with avro schema

https://gerrit.wikimedia.org/r/240041

The above patch should be the last code change that needs to be deployed. The logging is gated by a sampling parameter so we can start it off with something like 1 in 1k requests, to make sure we arn't going to get a flood of errors due doing something going wrong. Once confident we can set the sampling to 1 and capture everything.

When do you plan on getting this merged?

Change 240615 had a related patch set uploaded (by EBernhardson):
Refactor monolog handling to point to 1-N sources

https://gerrit.wikimedia.org/r/240615

I plan to get this all ready in time for next weeks branch cut. Really i wanted to get it out this week but last call for deploys is already tomorrow...will see how has a +2 button i can pester :)

Change 240041 merged by jenkins-bot:
Log in new format compatible with avro schema

https://gerrit.wikimedia.org/r/240041

So this is "resolved" but I have no idea where the data is or if it's even importing. Help?

So this is "resolved" but I have no idea where the data is or if it's even importing. Help?

@EBernhardson, this is a question for you.

The code is all writtena and merged but the train has not rolled forward in awhile. On Thursday we can send out the config patch that turns this on and see what happens.

Change 240615 merged by jenkins-bot:
Refactor monolog handling for kafka logs

https://gerrit.wikimedia.org/r/240615