Page MenuHomePhabricator

Investigate sending Gerrit events to our data lake
Closed, ResolvedPublic

Description

Gerrit has an event stream which is available with appropriate permissions by using ssh -p 29418 gerrit.wikimedia.org gerrit stream-events. It generates a stream a json events and Zuul use that. https://gerrit.wikimedia.org/r/Documentation/cmd-stream-events.html

There is a Gerrit plugin, events-kafka, which can turn Gerrit into a Kafka producer generating events:
https://gerrit.googlesource.com/plugins/events-kafka/+/refs/heads/master/src/main/resources/Documentation

We should check how to set it up and whether we can send that to our data lake. From a brief discussion with analytics, they require some specific fields: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Required_fields

There is a JSONSchema mentioned, I don't think I have ever seen that in Gerrit core.

Would also need to read https://wikitech.wikimedia.org/wiki/Event_Platform

Event Timeline

From https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines, the events have to be extended to include a few more fields:

$schemaA URI identifying the JSONSchema for this event. This should match an event schema $id in a schema repository. E.g. /schema_name/1.0.0
metaEvent meta data sub-object. This field contains common meta data fields for an event.
meta.streamName of the stream/queue that this event belongs in. This is used for routing incoming events to specific streams and downstream 'datasets'. This will likely correspond with a Kafka topic and a Hive table.
meta.dtA date timestamp for the event, in ISO-8601 format. This should be the event ingestion time. This is a required property in schemas, but will be filled in by EventGate on the server side if a client does not set it. If you are using EventGate, you should leave this field blank in your event data and allow EventGate to set it to server side receive time.
dta date timestamp for the event, in ISO-8601 format. This should be set by your client and should represent the 'event time' of the event. That is, this is the actual time the event happened.

Inside Gerrit events are represented by Java class (ex: PatchSetCreatedEvent) which are serializable to json. I could not find the exact mechanism which is doing the serialization, I don't think we can out of the box configure it to add other field.

The event-kafka plugin listens to Gerrit events and serializes them to json as is with no option to add extra fields. I am guessing the Gerrit plugin would need to have code added to let us inject fields. The non varying one such as $.schema, meta.stream sounds straightforward.

The meta.dt is automatically filed by EventGate and I am assuming it is created on the fly and does not need to be generated from the source.

For dt the Gerrit java events all inherit from an Event class which has a long eventCreatedOn = TimeUtil.nowMs() / 1000L; which if I get it right is a Unix epoch number of seconds. Example when using the privileged command ssh -p 29418 gerrit.wikimedia.org gerrit stream-events:

{
  "project": "mediawiki/extensions/CodeMirror",
  "ref": "refs/changes/06/777806/meta",
  "targetNode": "gerrit2001.wikimedia.org",
  "status": "succeeded",
  "refStatus": "OK",
  "type": "ref-replicated",
  "eventCreatedOn": 1649414423
}

We would need the plugin to convert it to retrieve the Event.eventCreatdOn convert it to an ISO-8601 date and append it to the Kafka message.

The crafted Kafka message format is described at https://gerrit.googlesource.com/plugins/events-kafka/+/refs/heads/master/src/main/resources/Documentation/message.md

KeyCurrent time in nanoseconds.
PayloadPayload is JSON string. (same as gerrit-events)

The message is crafted via thanks to org.apache.kafka.clients.producer.ProducerRecord:

producer.send(new ProducerRecord<>(topic, "" + System.nanoTime(), messageBody));

The messageBody being the json serialized event:

public ListenableFuture<Boolean> publish(String topic, Event event) {
  return session.publish(topic, getPayload(event));
}

private String getPayload(Event event) {
  return gson.toJson(event);
}

My understanding from https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Required_fields is that the required fields should be injected in the payload. I am not quite sure where the code should be added for that though :-\

Change 791642 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit/jsonschemagenerator@master] Json schema from Gerrit Java event classes

https://gerrit.wikimedia.org/r/791642

Change 791644 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Add operations/software/gerrit/jsonschemagenerator

https://gerrit.wikimedia.org/r/791644

Change 791644 merged by jenkins-bot:

[integration/config@master] Add operations/software/gerrit/jsonschemagenerator

https://gerrit.wikimedia.org/r/791644

P27828 is a CommentAdded json event

P27829 is a schema generated from Gerrit Java code using https://victools.github.io/jsonschema-generator/

I have no idea how to validate the former with the later. An exercise for next week ;)

I have made a bit more progress today and managed to get a Json Schema which validates a comment added event from the Gerrit Java class!

An issue I had was that the upstream library would show a generic "type": "object" for a property such as:

public java.util.function.Supplier<AccountAttribute> author;

The reason is that kind of object solely has a get() method and no property and the lib consider it a basic object. I wrote some code on Friday to extract the supplied AccountAttribute, had a review over the week-end and completely overhauled based on upstream feedback. There are still a bit of unknowns though

https://github.com/victools/jsonschema-generator/pull/254

I will look at polishing that up. My aim is to find all Event children classes and generate a JSON schema for each of them.

My pull request has been merged and the maintainer addressed all the concerns directly via https://github.com/victools/jsonschema-generator/pull/255/ (merged as well).

The last field I have to tweak for CommentAddedEvent is:

[
  {
    instancePath: '/project',
    schemaPath: '#/properties/project/type',
    keyword: 'type',
    params: { type: 'object' },
    message: 'must be object'
  }
]

A project is a string but the schema has:

"project" : {
  "type" : "object",
  "properties" : {
    "name" : {
      "type" : "string"
    },
    "serialVersionUID" : {
      "type" : "integer",
      "const" : 1
    }
  }
},

I believe I have fixed the above issues. I have published the generated schemas at https://people.wikimedia.org/~hashar/T304947/schemas/

The comment-added schema now has:

{
  "title" : "com.google.gerrit.server.events.CommentAddedEvent",
  ...
  "properties":
    "project" : {
      "type" : "string"
    },
    "type" : {
      "const" : "comment-added"
    }
    ...
}

More or less a success

Change 807538 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit/jsonschemagenerator@master] Json schema from Gerrit Java event classes

https://gerrit.wikimedia.org/r/807538

Change 807538 abandoned by Hashar:

[operations/software/gerrit/jsonschemagenerator@master] Json schema from Gerrit Java event classes

Reason:

https://gerrit.wikimedia.org/r/807538

Change 809305 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Run bazel test on jsonschemagenerator

https://gerrit.wikimedia.org/r/809305

Change 809305 merged by jenkins-bot:

[integration/config@master] Run bazel test on jsonschemagenerator

https://gerrit.wikimedia.org/r/809305

Change 814725 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit/plugins/events-wikimedia@master] Json schema from Gerrit Java event classes

https://gerrit.wikimedia.org/r/814725

Change 791642 abandoned by Hashar:

[operations/software/gerrit/jsonschemagenerator@master] Json schema from Gerrit Java event classes

Reason:

I have renamed the repository and send a single change to the new one https://gerrit.wikimedia.org/r/c/operations/software/gerrit/plugins/events-wikimedia/+/814725

https://gerrit.wikimedia.org/r/791642

Change 814728 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Rename jsonschema-generator -> events-wikimedia

https://gerrit.wikimedia.org/r/814728

Change 814728 merged by jenkins-bot:

[integration/config@master] Rename jsonschema-generator -> events-wikimedia

https://gerrit.wikimedia.org/r/814728

Change 830654 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/software/gerrit/plugins/events-wikimedia@master] Json schema from Gerrit Java event classes

https://gerrit.wikimedia.org/r/830654

Gerrit 3.4.6 has been released and includes a change I have made in order to access the list of events currently registered in a Gerrit instance. I filed T319513 to upgrade our Gerrit instances.

Change 830654 abandoned by Hashar:

[operations/software/gerrit/plugins/events-wikimedia@master] Json schema from Gerrit Java event classes

Reason:

https://gerrit.wikimedia.org/r/830654

Change 814725 abandoned by Hashar:

[operations/software/gerrit/plugins/events-wikimedia@master] Implement REST API and Ssh commands

Reason:

https://gerrit.wikimedia.org/r/814725

This was a proof of concept from back in 2022. The idea was to write a plugin which react to events and send them to EventGate. I had a few interesting side tracks such as:

The JSON Schema part can probably be upstreamed to Gerrit, that might serve various purposes such as self documenting the API.

That was a great experiment overall, I have learned a lot in the process.

The series of changes is https://gerrit.wikimedia.org/r/q/project:operations%252Fsoftware%252Fgerrit%252Fplugins%252Fevents-wikimedia