Page MenuHomePhabricator

Subscription: Trending service should be able to subscribe to edits in real time
Closed, ResolvedPublic3 Estimated Story Points

Description

The trending service use Kafka to subscribe to events.
(This has changed from the original Public EventStreams decision. As such we need to review what events will look like)

The trending service should be provided access to all the data specified in the event's schema.

A sample event:

{
  "comment": "",
  "database": "enwiki",
  "meta": {
    "domain": "en.wikipedia.org",
    "dt": "2016-09-14T14:49:42+00:00",
    "id": "788516b6-7a8a-11e6-9a2d-b083fecf1287",
    "request_id": "f9db46d3-c3d7-420f-bf23-cab88bfde5ff",
    "schema_uri": "mediawiki/revision/create/1",
    "topic": "mediawiki.revision-create",
    "uri": "https://en.wikipedia.org/wiki/Cry_to_Me_%28album%29"
  },
  "page_id": 37643813,
  "page_is_redirect": false,
  "page_namespace": 0,
  "page_title": "Cry_to_Me_(album)",
  "performer": {
    "user_groups": [
      "extendedconfirmed",
      "*",
      "user",
      "autoconfirmed"
    ],
    "user_id": 4363489,
    "user_is_bot": false,
    "user_text": "Night Time"
  },
  "rev_content_format": "wikitext",
  "rev_content_model": "wikitext",
  "rev_id": 739412400,
  "rev_len": 3748,
  "rev_minor_edit": true,
  "rev_parent_id": 739411852,
  "rev_sha1": "q090ey5jtsmplmqiqz2oytx0s96cj1u",
  "rev_timestamp": "2016-09-14T14:49:42Z"
}

To prove this is working we will only expose pages from the endpoint which have more than 20 edits in the last hour.

Details

Related Changes in Gerrit:

Event Timeline

Jdlrobson renamed this task from Trending service should be able to subscribe to edits in real time to Subscription: Trending service should be able to subscribe to edits in real time.Sep 13 2016, 8:36 PM
Jdlrobson set the point value for this task to 3.

This data will eventually be available publicly via socket.io as part of the public event streams project.

The ChangeProp rule for this is trivial:

trending_service:
  topic: mediawiki.revision-create
  match:
    performer:
      user_is_bot: false
    page_namespace: 0
  exec:
    method: post
    uri: ''  # to be determined
    headers:
      content-type: application/json
    body: '{{message}}'

EDIT: Updated the rule to match the requirements stated in T145554: Process: Trending service should process real time edits

This data will eventually be available publicly via socket.io as part of the public event streams project.

Yup, @Ottomata, but we don't want to mix production services (inside WMF prod env) and public-consumption-related ones. Additionally, it is more stable to have a push architecture here as it will all be running inside the same VLAN.

+1, think I misunderstood the ticket.

@Jdlrobson you may want to push this ticket back util we have the meeting to discus how we should ingest edits.

FTR @dr0ptp4kt said that this could be removed from Reading-Web-Sprint-85-💩 during the associated sprint kickoff meeting.

To clarify, I indicated my guess was that this particular task wasn't strictly required for making this deployable (my understanding is the existing service's subscription to events was sufficient), but @Fjalapeno would have a better sense (@Jdlrobson was out yesterday, he of course could have spoken to it if not out). I defer to @Fjalapeno and @Jdlrobson for the assessment of whether this particular task ought to be done in #reading-web-sprint-85 right now.

@mobrovac may have a better recollection of the meeting last week as well.

Follow up from the ingestion meeting:

  1. @Jdlrobson is going to set up an interface in the repo to return the amount of time to replay edits
  2. @Pchelolo will then setup the Kafka ingestion logic based on that amount of time

Lets check in over email later this week to see how things are going or if anyone needs anything.

I've just realised one more complication with using kafka directly - we'd need each worker to have a unique consumer group which should be stable across restarts so that each consumer could reuse offsets it has stored (they're stored per consumer group).

Could we construct stable IDs without managing them in puppet or something? Something like trending-service.${os.hostname()}.${worker_seq_number}?

Could we construct stable IDs without managing them in puppet or something? Something like trending-service.${os.hostname()}.${worker_seq_number}?

+1. service-runner can easily send the number of the worker.

Could we construct stable IDs without managing them in puppet or something? Something like trending-service.${os.hostname()}.${worker_seq_number}?

@Pchelolo sounds good me - prefer the we can deterministically create IDs rather than storing a GUID.

@Jdlrobson any thoughts?

Change 320628 had a related patch set uploaded (by Jdlrobson):
WIP: Surface edits via kafka

https://gerrit.wikimedia.org/r/320628

Not really sure what I'm doing yet, specifically how I connect to kafka, but I've started adding some scaffolding.

^ @Pchelolo your guidance would be much welcomed :)

@Jdlrobson So basically you're on the right path. I can probably pick up your patch if you want, might be faster and easier, what do you think?

@Pchelolo you just need a function from @Jdlrobson to return the delay in committing the offset, correct?

@Fjalapeno Not sure whether the delay should be just statically set in the service config of should we be able to vary it by message. I think the static option actually makes more sense since variable offset wouldn't actually work. So basically I can just implement the edit-stream that would emit the message event on every message and do all the kafka stuff internally. If @Jdlrobson is fine with that.

@Jdlrobson, do you use Mediawiki-Vagrant? If so, you can enable the eventbus role, provision, and work from there. That will get a you a Kafka Broker running in mediawiki vagrant on port 9092, as well as Mediawiki logging revision-create (and other) events via the EventBus service posting to Kafka. Basically, the whole thing.

You can then implement a kafka consumer that subscribes to the datacenter1.mediawiki.revision-create events. Edit a mw-vagrant page, and you get an event.

@Ottomata I don't have vagrant - it consumes too much of my cpu and is too slow to actually develop on my machine (thankfully i'm due a laptop upgrade soon). Is there a public service I can use instead?

Ideally I'd like to test on real data as it's going to be hard to write a trending service using local data given I'd need to simulate edits from different users/anons/sizes.

Also is the format documented anywhere? How are moves and deletions captured for example?

EventBus and Kafka are set up in deployment-prep for beta.wmflabs.org. If you have access to an instance in that project, you should be able to point your Kafka client to the brokers deployment-kafka04.deployment-prep.eqiad.wmflabs:9092,deployment-kafka05.deployment-prep.eqiad.wmflabs:9092. These are the 2 brokers in the main-deployment-prep Kafka cluster (defined in [[ deployment-kafka05.deployment-prep.eqiad.wmflabs | kafka_clusters hiera ]]).

The topics to subscribe to there are all prefixed with eqiad., e.g. eqiad.mediawiki.revision-create.

Also is the format documented anywhere? How are moves and deletions captured for example?

Schemas are here: https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki

Oh, and the topic to schema mapping is here:
https://github.com/wikimedia/mediawiki-event-schemas/blob/master/config/eventbus-topics.yaml

So, all events in the eqiad.mediawiki.page-move topic (prefixed with datacenter name) will be in the format of the mediawiki/page/move schema.

@Ottomata thanks for that! I may have to grab you some time next week (I am off tomorrow). I suspect I'm going to have to setup Vagrant as I can't get kafka up and running (node-gyp issue)

@Pchelolo I won't touch the patch until Monday now, so feel free to make any tweaks you want until then.

Hm, a node-gyp issue sounds like a problem with the node-rdkafka client, not with setting up a Kafka broker. Either way, ya, happy to help anytime!

@Jdlrobson Feel free to ping me with your node-gyp problems, I've spent quite some time figuring out the node-kafka build process so I might be able to help.

Thanks to @Pchelolo http://stackoverflow.com/a/24307810/1924141 fixed my node-kafka issue but now when I try and connect to the service using the WIP patch I get this error:

{"name":"trending-edits","hostname":"mediawiki-vagrant","pid":18944,"level":50,"err":{"message":"Local: Timed out","name":"service-trending-edits","stack":"service-trending-edits: Local: Timed out\n    at Function.createLibrdkafkaError [as create] (/vagrant/trending-edits/node_modules/node-rdkafka/lib/error.js:232:10)\n    at /vagrant/trending-edits/node_modules/node-rdkafka/lib/client.js:195:30\n    at /vagrant/trending-edits/node_modules/node-rdkafka/lib/client.js:326:9","code":-185,"errno":-185,"origin":"kafka","levelPath":"error"},"msg":"Local: Timed out","time":"2016-11-15T10:39:59.292Z","v":0}

vagrant@mediawiki-vagrant:/vagrant/trending-edits$ sudo service kafka status
kafka start/running, process 9987

I'm configuring broker_list to be localhost:9987

Any ideas?

Are other tools able to work with the broker at localhost:9987? Does kafkacat -L -b localhost:9987

Also, just curious, why 9987 instead of the usual 9092? Did you configure that?

Oh, just checking, are you running the trending service from your host machine or from within vagrant? If from host machine, localhost won't work (unless we set up some vagrant port forwarding). Not sure what your vagrant IP is, but I think it is usually 10.11.12.13. You could try 10.11.12.13:9987 as your broker_list. That is, IF you are running node-rdkafka from your host machine.

Isn't 9987 the pid of the kafka process? As far as I know we don't reconfigure the kafka port in vagrant, so it should be localhost:9092

When I use 9092 I get the same issue.. I've tried from both my local machine and within vagrant (within vagrant I get no timeout but I don't get any events here). When I edit at http://dev.wiki.local.wmftest.net:8080/w/index.php?title=Foo I'm not seeing any events. Do I need to install more things? Do I need to configure it differently? Edit at a different place?

Given I'm not familiar with the code styling conventions of the services repo I'd appreciate a review/merge of the jscs patch first to avoid introducing code style inconsistencies.

Then https://gerrit.wikimedia.org/r/#/c/320628/

Change 320628 merged by jenkins-bot:
Surface edits via kafka

https://gerrit.wikimedia.org/r/320628

@Jdlrobson any instructions on how to test this locally?

I'll put these on mediawiki.org when they stabilise and are a less of pain but for the time being...

You'll need vagrant
Enable the role eventbus.
Clone the repo in any folder in vagrant
vagrant ssh
npm install the contents
You'll need to update the config just as in this patchset https://gerrit.wikimedia.org/r/323238

To test you can either verify npm test passes or if you are less confident..

  1. Verify event bus is setup correctly - an edit on localhost should show up in sudo tail -f /var/log/upstart/eventlogging-service-eventbus.log
  2. Verify the service is running...
sudo service eventlogging-service-eventbus status
sudo service eventlogging-service-eventbus start
  1. Add the line console.log(edit); in lib/processor.js process method
  2. Edit the wiki (on localhost domain) and check that the edit gets processed by processor.