Create analytics-centric Cirrus logs and have them import into HDFS
Closed, ResolvedPublic
Actions

Description

To make it easier to perform both ad-hoc and daily analytics, it would be good to have Cirrus server-side logs in HDFS. Bob Flagg would be a great person to do this since he has an analytics background and it would provide an introduction to both our analytics infrastructure and our search infrastructure.

Tasks:

Sit down with Oliver and work out what fields we want to log;
Create a streaming format in Cirrus that outputs logs containing these fields in a way HDFS can consume;
Work with Ottomata to integrate this stream into HDFS's input

Definition of "done" for this task:

Cirrus logs are available in Hadoop.

Details

	Subject	Repo	Branch	Lines +/-
	Refactor monolog handling for kafka logs	operations/mediawiki-config	master	+499 -46
	Log in new format compatible with avro schema	mediawiki/extensions/CirrusSearch	master	+80 -9

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T106257 Send raw server side events to Kafka using a PHP Kafka Client {oryx}
Declined	None	T112846 Display automata and humans separately on zero results rate graph
Resolved	EBernhardson	T103505 Create analytics-centric Cirrus logs and have them import into HDFS
Resolved	EBernhardson	T106256 Kafka Client for MediaWiki
Resolved	• csteipp	T109384 Security review of apache/avro and nmred/kafka-php
Resolved	bd808	T111851 Package the Avro PHP library for easier Composer usage
Resolved	Ironholds	T110618 Make sense of why the zero results rate is still going up in spite of us having tackled prominent zero results generators
Resolved	Ironholds	T112295 Design and agree on an Avro schema for cirrus search request logging to hadoop
Resolved	• Nuria	T113521 Setup pipeline for search logs to travel through kafka and camus into hadoop {hawk} [55 pts]
Resolved	EBernhardson	T115715 Update CirrusSearchRequestSet schema to have a timestamp field

Event Timeline

Ironholds created this task.Jun 23 2015, 1:34 PM

Ironholds raised the priority of this task from to Needs Triage.

Ironholds updated the task description. (Show Details)

Ironholds added a project: Discovery-ARCHIVED.

Ironholds added subscribers: Ironholds, • Manybubbles, • Deskana, Ottomata.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 23 2015, 1:34 PM

Ironholds moved this task from Needs triage to Search on the Discovery-ARCHIVED board.Jun 23 2015, 1:34 PM

Work with Ottomata to integrate this stream into HDFS's input

Cirrus search logs are all server side, right?
We should integrate a PHP Kafka producer into Mediawiki.

• Deskana updated the task description. (Show Details)Jun 23 2015, 5:16 PM

• Deskana added a project: Discovery-Search (Current work).

• Deskana set Security to None.

https://github.com/quipo/kafka-php looks fairly reasonable. We should be able to build out some glue code between monolog and kafka-php such that logs to the CirrusSearchRequests (or whichever) log group go to kafka instead of (or in addition to) fluorine.

What format should the log messages be in?

EBernhardson claimed this task.Jun 25 2015, 4:03 AM

EBernhardson moved this task from Incoming to not in use - please delete on the Discovery-Search (Current work) board.

looking closer, the quipo/kafka-php is only valid through the 0.7 series of kafka and we are on 0.8. I'm seeing two options for producing logs to kafka 0.8 in php, but both require zend extensions written in C. One to talk to librdkafka (which kafkatee uses) directly, the other to talk to zookeeper (to determine which hosts to send messages about a particular topic/partition combination). Neither are labeled as stable implementations which leaves me a bit wary.

I'll spend some time and see if either of these play nicely under hhvm's zend-compat layer for extensions, but i'm not holding my breath.

@eberhnardson this is a task for Bob to orient him around our systems; obviously your help would be most appreciated but no claiming yet :D. I'd have assigned it to him but evidently nobody has told him to create a phabricator yet.

Hm, I think if we are just producing messages, and not consuming, we should not need zookeeper as a dependency. I'm not sure if that extension relies directly on zookeeper, but perhaps if we don't call any consume functions, it will just work?

librdkafka itself is stable; we are using it on all the frontend caches to produce webrequest logs to kafka. I'd likely go with that one.

As for message format, maybe we should meet to talk about this? If think in a year from now we will have a REST interface to Kafka that can be used to produce Avro messages to Kafka using a schema registery. This project will be tracked in T102082, and is firstly focused on analytics type data, which these logs are. I think that system can and should be used for application type messages too, but that is longer down the road, we will see.

Alternatively (and possibly less reliably), we are working on making Eventlogging more scalable by using Kafka as the backend for it. Depending on the volume you want to produce, Eventlogging MAY be able to support this in the near term (within a month?). But, I would be hesitant to promise this, and producing directly to Kafka will definitely work.

You could still use Avro now without T102082, which would make it easier to transition in the future. It would also make querying in Hadoop more efficient than JSON. Otherwise, just go with JSON and try to not change your schemas too much.

The stream here is around 200 million messages per day, probably a bit much for the EventLogging infrastructure.

I'll step back and let bflagg work on this, but feel free to pull me back in for anything.

EBernhardson removed EBernhardson as the assignee of this task.Jun 25 2015, 5:46 PM

EBernhardson moved this task from not in use - please delete to Incoming on the Discovery-Search (Current work) board.

• Manybubbles triaged this task as High priority.Jul 2 2015, 4:45 PM

EBernhardson assigned this task to • bflagg.Jul 2 2015, 5:44 PM

• Deskana removed • bflagg as the assignee of this task.Jul 7 2015, 8:41 PM

@Ironholds It was proposed in the Cirrus backlog grooming that we put this back into the backlog. Can you give us some info on how essential this is for you?

Define "essential"? ;p.

It's not essential; nothing will break if we don't do it. But it will drastically cut down the amount of time it takes to add new things to the log processing. We're talking in order of turning a 4 day task into a 2 day task, here (this "task" is arbitrary but you get my point). There will be a non-trivial initial cost while I get up to speed on Oozie jobs and turn all the pythonic streaming functions into Java UDFs, but after that it'll save a lot of time.

FYI, I hacked on a possible PHP Kafka client at the Wikimania Hackathon. Doesn't look easy! I made a short attempt at HHVMizing https://github.com/EVODelavega/phpkafka, and failed. The other options is the native PHP https://github.com/nmred/kafka-php. However, it has a PHP C Zookeeper extension dependency.

We don't need Zookeeper to produce messages to Kafka. kafka-php uses it in the produce logic because it was written for a slightly older version of Kafka. It should be possible to fork kafka-php and edit out any of the zookeeper dependencies if we plan on using it only for producing messages. Let's track any future work on this here: T106256

Ottomata added a subtask: T106256: Kafka Client for MediaWiki.Jul 19 2015, 6:04 AM

Nemo_bis subscribed.Jul 21 2015, 6:26 PM

EBernhardson claimed this task.Aug 5 2015, 11:12 PM

EBernhardson moved this task from Incoming to Needs review on the Discovery-Search (Current work) board.

This is blocked by T106256, which has a patch that is awaiting review.

EBernhardson added a subtask: T109384: Security review of apache/avro and nmred/kafka-php.Aug 18 2015, 12:02 AM

• ksmith moved this task from Search to On Sprint Board on the Discovery-ARCHIVED board.Aug 27 2015, 8:33 PM

• ksmith added a project: Essential-Work.Sep 1 2015, 6:05 PM

I asked Chris about this before. The plan is still for this security review to be done by the end of this week.

• csteipp closed subtask T109384: Security review of apache/avro and nmred/kafka-php as Resolved.Sep 9 2015, 5:14 PM

Security review was completed, so this is ready to go out next week. In theory (i.e. provided there are no issues), this means that starting next week we'll have tons of search data in HDFS.

EBernhardson moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Sep 16 2015, 7:55 PM

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Sep 16 2015, 8:20 PM

Still awaiting deployment. Also waiting on the schema definition, which is taking place in T112295.

• Deskana added a subtask: T110618: Make sense of why the zero results rate is still going up in spite of us having tackled prominent zero results generators.Sep 17 2015, 8:25 PM

Ironholds added a subtask: T112295: Design and agree on an Avro schema for cirrus search request logging to hadoop.Sep 17 2015, 8:26 PM

• Deskana mentioned this in T110618: Make sense of why the zero results rate is still going up in spite of us having tackled prominent zero results generators.Sep 17 2015, 8:26 PM

EBernhardson moved this task from Needs review to not in use - please delete on the Discovery-Search (Current work) board.Sep 21 2015, 11:29 PM

Change 240041 had a related patch set uploaded (by EBernhardson):
Log in new format compatible with avro schema

https://gerrit.wikimedia.org/r/240041

gerritbot added a project: Patch-For-Review.Sep 22 2015, 6:59 AM

EBernhardson moved this task from not in use - please delete to Needs review on the Discovery-Search (Current work) board.Sep 22 2015, 5:04 PM

• Deskana closed subtask T112295: Design and agree on an Avro schema for cirrus search request logging to hadoop as Resolved.Sep 23 2015, 5:00 AM

• Deskana closed subtask T110618: Make sense of why the zero results rate is still going up in spite of us having tackled prominent zero results generators as Resolved.Sep 23 2015, 5:03 AM

The above patch should be the last code change that needs to be deployed. The logging is gated by a sampling parameter so we can start it off with something like 1 in 1k requests, to make sure we arn't going to get a flood of errors due doing something going wrong. Once confident we can set the sampling to 1 and capture everything.

Ottomata added a subtask: T113521: Setup pipeline for search logs to travel through kafka and camus into hadoop {hawk} [55 pts].Sep 23 2015, 8:29 PM

When do you plan on getting this merged?

EBernhardson closed subtask T106256: Kafka Client for MediaWiki as Resolved.Sep 23 2015, 11:12 PM

Change 240615 had a related patch set uploaded (by EBernhardson):
Refactor monolog handling to point to 1-N sources

https://gerrit.wikimedia.org/r/240615

I plan to get this all ready in time for next weeks branch cut. Really i wanted to get it out this week but last call for deploys is already tomorrow...will see how has a +2 button i can pester :)

• Deskana added a parent task: T112846: Display automata and humans separately on zero results rate graph.Sep 25 2015, 6:31 PM

• Deskana mentioned this in T112846: Display automata and humans separately on zero results rate graph.

bd808 mentioned this in T114733: Determine proper encoding for structured log data sent to Kafka by MediaWiki.Oct 6 2015, 3:27 AM

bd808 subscribed.Oct 14 2015, 6:30 PM

• kevinator closed subtask T113521: Setup pipeline for search logs to travel through kafka and camus into hadoop {hawk} [55 pts] as Resolved.Oct 15 2015, 4:02 PM

Change 240041 merged by jenkins-bot:
Log in new format compatible with avro schema

https://gerrit.wikimedia.org/r/240041

ReleaseTaggerBot added a project: MW-1.27-release (WMF-deploy-2015-10-27_(1.27.0-wmf.4)).Oct 19 2015, 11:00 AM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Oct 20 2015, 4:37 PM

• Deskana closed this task as Resolved.Oct 27 2015, 8:47 AM

• Deskana closed subtask T115715: Update CirrusSearchRequestSet schema to have a timestamp field as Resolved.

• Deskana moved this task from Needs Reporting to Resolved on the Discovery-Search (Current work) board.

So this is "resolved" but I have no idea where the data is or if it's even importing. Help?

In T103505#1756868, @Ironholds wrote:

So this is "resolved" but I have no idea where the data is or if it's even importing. Help?

@EBernhardson, this is a question for you.

The code is all writtena and merged but the train has not rolled forward in awhile. On Thursday we can send out the config patch that turns this on and see what happens.

Change 240615 merged by jenkins-bot:
Refactor monolog handling for kafka logs

https://gerrit.wikimedia.org/r/240615

fbstj mentioned this in Event-Platform.Dec 4 2015, 2:22 PM

Create analytics-centric Cirrus logs and have them import into HDFSClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Create analytics-centric Cirrus logs and have them import into HDFS
Closed, ResolvedPublic
Actions

Related Objects
Search...