Page MenuHomePhabricator

Regression: "Unable to deliver event: 400: 0 out of 1 events were accepted."
Closed, ResolvedPublicPRODUCTION ERROR

Description

Spotted in 1.28.0-wmf.11 deploy today, exclusively to test/mediawikiwiki. Digging down, the ultimate cause seems to be "Additional properties are not allowed (u'rev_by_bot' was unexpected)"

Fairly spammy already so I'd like to block group1 rollout tomorrow pending a fix.

Logstash view for the first few hours of wmf.11 being on goup0: https://logstash.wikimedia.org/goto/fcc51879ed52f7287c86c4e2c3daf6e0

Event Timeline

The problem might be that the schemas were not updated or the EventBus service was not restarted. I don't have access to the Event-Platform machine, so unable to verify.

Probably will be fixed in the EU morning.

cc @mobrovac @Ottomata

It looks like a schema change for eventbus was not actually properly deployed in time. We don't have access to those machines, so to fix this we need to find somebody with access rights for eventbus. Alternatively, we can roll back the change that introduced the extra attribute in the EventBus extension.

I am trying to reach @Ottomata, who has deploy access to the eventbus service. All roots should have the requisite access as well, but deploy documentation in https://wikitech.wikimedia.org/wiki/EventBus/Administration or the parent page looks rather thin.

I pinged @Ottomata per SMS and hangout. In the security channel, no roots were available to restart the eventbus service. Unless somebody with the rights intervenes, this will have to wait until @mobrovac wakes up.

Thanks for the quick diagnosis! It's not super urgent I don't think since we know what's going on, just needs to happen before tomorrow's deploy :)

Info from @Ottomata per SMS:

Schemas are cloned by puppet at either /etc or /srv event-schemas
Pretty sure if you push a new commit to event-schemas puppet will pull and then sighup eventbus to reload schemas
Logs are at /var/log/eventloghing-service-eventbus

Mentioned in SAL [2016-07-20T07:28:08Z] <elukey> restarting evenbus on kafka100[12] (T140848)

For the record I am adding in here what me and Marko did yesterday to manually force a SIGHUP to eventbus on kafka100[12].

ps aux | grep eventbus --> found leader process pid (Ss) --> kill -HUP pid

Checked logs:

Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]:
 (MainThread) Got SIGHUP, reloading topic_config and local schemas...

Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]:
 (MainThread) Loading local schemas from /srv/event-schemas/jsonschema
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading local schemas from /srv/event-schemas/jsonschema
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading schema from file:///srv/event-schemas/jsonschema/test/event/1.yaml
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading schema from file:///srv/event-schemas/jsonschema/error/1.yaml

Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading schema from file:///srv/event-schemas/jsonschema/resource_change/1.yaml

this one was the target ----^

Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading schema from file:///srv/event-schemas/jsonschema/change-prop/retry/1.yaml
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading schema from file:///srv/event-schemas/jsonschema/change-prop/continue/1.yaml
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]:
 (MainThread) Loading schema from file:///srv/event-schemas/jsonschema/mediawiki/revision_create/1.yaml
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]:
 (MainThread) Loading schema from file:///srv/event-schemas/jsonschema/mediawiki/page_move/1.yaml
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading schema from file:///srv/event-schemas/jsonschema/mediawiki/page_delete/1.yaml
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]:
 (MainThread) Loading schema from file:///srv/event-schemas/jsonschema/mediawiki/page_restore/1.yaml
Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]: 
(MainThread) Loading schema from file:///srv/event-schemas/jsonschema/mediawiki/user_block/1.yaml
 Jul 19 10:33:27 kafka1001 eventlogging-service-eventbus[1877]:
 (MainThread) Loading schema from

Today we restarted the eventbus http proxy and the validation errors disappeared.

mobrovac claimed this task.
mobrovac added a project: User-mobrovac.

After the restart, there are no more events being rejected, so I'm declaring victory in this instance. I.e. the train deployment is safe to continue.

I apologise for the inconvenience in this matter. We have a bug in the process of updating the EventBus system, and we'll address them ASAP. I'll create follow-up tickets.

Aye yai yai, sorry yall! Thanks for responding. Will be back at work tomorrow and will check it out and try to fix.

Everything looks good to me now. Train un-halted. Thanks everyone!

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:10 PM