Page MenuHomePhabricator

Improve schema update process on EventBus production instance
Closed, ResolvedPublic8 Estimated Story Points

Description

The current process of updating schema definitions needs a serious overhaul. Currently, the event-schemas repo is being automatically updated by Puppet. There is a Puppet rule that is supposed to reload EventBus, but it doesn't seem to be applied. Ultimately, when there is a schema change, the current process involves

  • waiting for Puppet to run and update the repo
  • restarting EventBus

The process is tedious and error-prone (cf T140848: Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted."). Ideally, EventBus should be gracefully restarted (i.e. reloaded) on each update.

Event Timeline

For emergencies, we should also ensure that common administrative tasks like service restarts & deployments are documented to a level that allows random roots to perform those tasks. I believe right now there are several bits missing in the docs at https://wikitech.wikimedia.org/wiki/EventBus/Administration. Lets fill those in, so that opsens can do the needful in emergencies.

For emergencies, we should also ensure that common administrative tasks like service restarts & deployments are documented to a level that allows random roots to perform those tasks. I believe right now there are several bits missing in the docs at https://wikitech.wikimedia.org/wiki/EventBus/Administration. Lets fill those in, so that opsens can do the needful in emergencies.

{{done}} today, credits to @elukey, thnx a lot!

For emergencies, we should also ensure that common administrative tasks like service restarts & deployments are documented to a level that allows random roots to perform those tasks. I believe right now there are several bits missing in the docs at https://wikitech.wikimedia.org/wiki/EventBus/Administration. Lets fill those in, so that opsens can do the needful in emergencies.

{{done}} today, credits to @elukey, thnx a lot!

Restarts are indeed documented now. Thanks, @elukey!

Shall we add deploy documentation as well, so that code changes can be deployed?

Nuria added subscribers: Ottomata, Nuria.

Seems that one action item here is to document deployment to EventBus (cc @Ottomata )

Is services working on the restart problem? (cc @GWicke )

Nuria renamed this task from Improve schema update process to Improve schema update process on EventBus production instance.Jul 20 2016, 3:10 PM

Seems that one action item here is to document deployment to EventBus (cc @Ottomata )

Is services working on the restart problem? (cc @GWicke )

I honestly think this is up to Analytics to handle. The HTTP proxy service has been developed and is maintained by your team, according to the initial agreement between our teams. That said, of course we will assist and help wherever possible (we discovered the problem and gave hints to the solution, after all).

@mobrovac: do you have now ssh access to eventbus machines? (your team should)

@mobrovac: do you have now ssh access to eventbus machines? (your team should)

Nope:

$ ssh kafka1001.eqiad.wmnet
Password: 

AFAIK, nobody from Services has access to it.

Hi! Will be working on this today. There are two problems here (aside from documentation and access).

  1. eventlogging-service did not properly reload schemas on SIGHUP.
  2. puppet git pull of event-schemas.

I'll look into 1. today and see if I can figure out what's going on. As for 2, I spent a lot of time trying to figure out an elegant way to configure scap to deploy event-schemas without coupling the scap config with eventlogging-service. event-schemas should be independent of any particular service. This should be possible using scap environments, but I had a really hard time getting it to do what was needed. I worked with Tyler on this, and we never got to anything satisfactory. At the time, puppet was the easiest/best way to do this. Since all schemas should be backwards compatible[1], an auto deploy via puppet of event-schemas for eventlogging-service should be fine. That is, IFF eventlogging-service properly reloads schemas after the deployment. Since in this case it didn't, it broke.

If we can find a better and satisfactory way to deploy event-schemas other than a naive puppet git pull, I'm all for it. :)

[1] well, except for this one time.

@mobrovac Please file a request for access for eventbus services team should be able to: ssh to machines, do restart and do deployments (cc @GWicke )

The administration of eventbus should be similar to eventlogging (the other service we run) in that there is a group of administrators (that includes services and analytics folks) that has sudo on the machines for some commands (not all).

+1. Probably should have a new group eventbus-admins.

Ottomata set the point value for this task to 8.Jul 21 2016, 4:22 PM
Milimetric triaged this task as Medium priority.Aug 8 2016, 4:53 PM