Page MenuHomePhabricator

Modern Event Platform: Stream Intake Service: Migrate Mediawiki monolog Kafka uses to eventgate-analytics
Closed, ResolvedPublic0 Estimated Story Points

Description

Migrate MediaWiki event logging (currently happening via monolog to Kafka) to use the new stream event intake service described in T201963 and T201068.

At this time MediaWiki is logging events using monolog to Kafka via avro. These events are going to a kafka cluster that is solely maintained for this purpose. We want to be able to have MediaWiki logging to the new analytics jumbo cluster and we want that logging to happen via json, not avro per the decision take on the RFC referenced on this ticket: https://phabricator.wikimedia.org/T198256 This is so:

We can consolidate all our event logging to happen using the same transport protocol (json via http, not avro)
we can decommision the old analytics Kafka cluster that is solely maintained for the purpose of receiving these MediaWiki Avro events

Event Timeline

Ottomata triaged this task as Medium priority.Feb 23 2018, 8:16 PM
Ottomata created this task.

Change 413792 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add duplicate mediawiki avro Camus job to consume from Kafka jumbo and analytics

https://gerrit.wikimedia.org/r/413792

Change 413795 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Point Mediawiki Monolog at Kafka jumbo in deployment-prep

https://gerrit.wikimedia.org/r/413795

Change 413796 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Point Mediawiki Monolog at Kafka jumbo in production

https://gerrit.wikimedia.org/r/413796

Change 413795 merged by Ottomata:
[operations/mediawiki-config@master] Point Mediawiki Monolog at Kafka jumbo in deployment-prep

https://gerrit.wikimedia.org/r/413795

Change 413792 merged by Ottomata:
[operations/puppet@production] Add duplicate mediawiki avro Camus job to consume from Kafka jumbo and analytics

https://gerrit.wikimedia.org/r/413792

Mentioned in SAL (#wikimedia-operations) [2018-03-06T20:35:15Z] <ottomata> pointing mediawiki monolog kafka producers at kafka jumbo-eqiad cluster: T188136

Change 413796 merged by Ottomata:
[operations/mediawiki-config@master] Point Mediawiki Monolog at Kafka jumbo in production

https://gerrit.wikimedia.org/r/413796

Mentioned in SAL (#wikimedia-analytics) [2018-03-06T20:35:30Z] <ottomata> pointing mediawiki monolog kafka producers at kafka jumbo-eqiad cluster: T188136

Mentioned in SAL (#wikimedia-analytics) [2018-03-06T20:44:08Z] <ottomata> reverted change to point mediawiki monolog kafka producers at kafka jumbo-eqiad until deployment train is done T188136

Mentioned in SAL (#wikimedia-operations) [2018-03-06T20:44:29Z] <ottomata> reverted change to point mediawiki monolog kafka producers at kafka jumbo-eqiad until deployment train is done T188136

Mentioned in SAL (#wikimedia-operations) [2018-03-08T15:29:20Z] <ottomata> merging and then deploying mediawiki-config to point monolog avro kafka producer at new kafka jumbo cluster: https://phabricator.wikimedia.org/T188136

Mentioned in SAL (#wikimedia-operations) [2018-03-08T15:32:49Z] <otto@tin> Synchronized wmf-config/ProductionServices.php: Point Mediawiki Monolog at new Kafka jumbo-eqiad cluster: T188136 (duration: 01m 16s)

Change 417294 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove no longer needed camus mediawiki-analytics job

https://gerrit.wikimedia.org/r/417294

Change 417294 merged by Ottomata:
[operations/puppet@production] Remove no longer needed camus mediawiki-analytics job

https://gerrit.wikimedia.org/r/417294

Mentioned in SAL (#wikimedia-operations) [2018-03-08T21:15:18Z] <otto@tin> Synchronized wmf-config/ProductionServices.php: Revert: point monolog avro producer back at Kafka analytics. Too many TCP connections? T188136 (duration: 00m 58s)

@elukey, I do think the webrequest_text deploy on Tuesday also correlates: The MirrorMaker instance started flapping then:
https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?orgId=1&var-instance=main-eqiad_to_jumbo-eqiad&from=1520346460368&to=1520354880087

It was able to keep up though until today's deploy of mediawiki-avro.

Ah @elukey I looked more into this and remembered that this might actually cause TCP issues after all. I think we should not move this to jumbo yet, but first look into getting a new PHP Kafka client deployed. Let's hold on that though, and focus on the Kafka main and MirrorMaker stuff. This can be our last hold out Kafka analytics, and we don't need MirrorMaker for this, so it doesn't block us.

Ottomata added a subscriber: EBernhardson.

Hopefully @EBernhardson can help us out somehow with this one? We want to swap the Mediawiki Kafka producer to a more modern client, possibly https://github.com/weiboad/kafka-php or https://github.com/arnaud-lb/php-rdkafka

php-rdkafka would be our best bet, but unfortunately they do not support hhvm and we will not likely be rid of hhvm in this calendar year.

I spent some time with weiboad (which is apparently v2 of the library we already use, and has an unrelased php >= 7.1 v3 rewrite in the master branch) and running into some odd issues. I setup and tested against kafka 0.9.0.1 (default version in mw vagrant). Specifically i can produce messages from weiboad, but i cannot read them with kafkacat or the python kafka library (and increasing rdkafka logging to level 7 added nothing). Trying to read them with pykafka which is a python-only implementation blows up with some sort of message decoding errors.

Additionally i'm not sure if weibad would help the particular issue here, as it looks like the underlying socket implementation for open/read/write is still exactly the same.
The new cluster we are going to point these at is kafka 10 anyways, so i'm going to try that. But producing malformed messages to v9 is worrying.

Potential avenues to investigate:

  • The send timeout on mediawiki kafka is 10ms, and it will retry up to 3 times. We could try increasing the timeout? Although this should be more than enough.
  • Exceptions/errors in kafka responses/etc are currently logged to 'wfDebugLogFile' channel, but that channel looks unconfigured in our production logging so the messages are all thrown away. We could start logging that channel, or turn on a dedicated channel. Whatever errors it's emitting currently are being thrown away (there might also be some uncertainty about logging new messages while the app is shutting down and already flushing logs, hard to say).

@EBernhardson thanks for looking into this! I'd really like to defer to your best intuition here on what to do. You have a lot more experience in the PHP/mediawiki world, so I don't know much about what is best.

This does block T183303, but we can at least move the rest of the clients over to Kafka jumbo-eqiad. Once we've done that (hopefully finished with this in June), the mw-avro producer will be the only one that still uses Kafka analytics-eqiad. However, until ops begs us to free up the rack space, we aren't in a huge hurry.

@EBernhardson would you mind if I assigned this to you?

Assigning, feel free to assign back if you don't like it! :p

Ottomata renamed this task from Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo to Migrate Mediawiki Monolog Kafka producer to new Stream Intake Service (as part of Modern Event Platform).Sep 20 2018, 2:15 PM
Ottomata updated the task description. (Show Details)
Ottomata added subscribers: leila, Stas, daniel and 16 others.

Just edited the task description. If possible, we will try to get these logs off of Kafka analytics by eventually moving them to new Modern Event Platform components.

Ottomata renamed this task from Migrate Mediawiki Monolog Kafka producer to new Stream Intake Service (as part of Modern Event Platform) to Modern Event Platform: Stream Intake Service: Migrate Mediawiki monolog Kafka uses to EventGate.Dec 5 2018, 9:56 PM
Ottomata renamed this task from Modern Event Platform: Stream Intake Service: Migrate Mediawiki monolog Kafka uses to EventGate to Modern Event Platform: Stream Intake Service: Migrate Mediawiki monolog Kafka uses to eventgate-analytics.May 17 2019, 4:57 PM
Ottomata claimed this task.

WOW.