Page MenuHomePhabricator

Empty body in EventBus request
Closed, ResolvedPublic5 Story Points

Description

From time to time I see the following logs in logstash under Event-Platform tag:

message: "Unable to deliver event: 400: Must provide body in request"
responce: {
  "body": ""
}

Looks like this log line is written from here. The reason is not clear. We need to investigate this issue.

Details

Related Gerrit Patches:
mediawiki/extensions/EventBus : wmf/1.28.0-wmf.23Add logging and check for empty JSON encoded body
mediawiki/extensions/EventBus : masterDon't set binary data for page property values
mediawiki/extensions/EventBus : wmf/1.29.0-wmf.1Add logging and check for empty JSON encoded body
mediawiki/extensions/EventBus : masterAdd logging and check for empty JSON encoded body
mediawiki/extensions/EventBus : wmf/1.28.0-wmf.23Log more EventBus HTTP request/response context for HTTP errors
mediawiki/extensions/EventBus : wmf/1.29.0-wmf.1Log more EventBus HTTP request/response context for HTTP errors
mediawiki/extensions/EventBus : masterLog more EventBus HTTP request/response context for HTTP errors

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 14 2016, 10:06 PM

tcpdumped some headers:

14:23:07.612856 IP mw1168.eqiad.wmnet.47056 > eventbus.svc.eqiad.wmnet.8085: Flags [P.], seq 643239481:643239656, ack 3060863324, win 58, options [nop,nop,TS val 235210544 ecr 220201869], length 175
E...
@.?...
@ &
.......&W.9.q.\...:.c.....
...0. ..POST /v1/events HTTP/1.1
Host: eventbus.svc.eqiad.wmnet:8085
Accept: */*
content-type: application/json
user-agent: wikimedia/multi-http-client v1.0
Content-Length: 0


14:23:07.614114 IP eventbus.svc.eqiad.wmnet.8085 > mw1168.eqiad.wmnet.47056: Flags [P.], seq 1:170, ack 175, win 59, options [nop,nop,TS val 220201869 ecr 235210544], length 169
E.....@.@...
...
@ &.....q.\&W.....;7N.....
. .....0HTTP/1.1 400 Must provide body in request.
Date: Mon, 31 Oct 2016 14:23:07 GMT
Content-Length: 0
Content-Type: text/html; charset=UTF-8
Server: TornadoServer/4.2

Something from MediaWiki is sending these pretty regularly.

Most of these are from commonswiki, see logstash

It seems like the rate of these is constantly growing over the last month: https://logstash.wikimedia.org/goto/ac2d10ece7027df2a14fd15732078256

Also of note, these seem to be from

/rpc/RunJobs.php?wiki=commonswiki&type=refreshLinksPrioritized&maxtime=60&maxmem=300M

Change 319407 had a related patch set uploaded (by Ottomata):
Log more EventBus HTTP request/response context for HTTP errors

https://gerrit.wikimedia.org/r/319407

Ottomata triaged this task as Medium priority.Nov 2 2016, 9:02 PM
Ottomata added a project: Analytics-Kanban.
Ottomata set the point value for this task to 5.

This seems to have gotten more noisy with MW-1.29-release (WMF-deploy-2016-11-01_(1.29.0-wmf.1))

That's strange, the last change deployed to the extension came with wmf/1.28.0-wmf.22.

Change 319407 merged by jenkins-bot:
Log more EventBus HTTP request/response context for HTTP errors

https://gerrit.wikimedia.org/r/319407

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Nov 3 2016, 3:04 PM

Change 319586 had a related patch set uploaded (by Ottomata):
Log more EventBus HTTP request/response context for HTTP errors

https://gerrit.wikimedia.org/r/319586

Change 319587 had a related patch set uploaded (by Ottomata):
Log more EventBus HTTP request/response context for HTTP errors

https://gerrit.wikimedia.org/r/319587

Change 319587 merged by jenkins-bot:
Log more EventBus HTTP request/response context for HTTP errors

https://gerrit.wikimedia.org/r/319587

Change 319586 merged by jenkins-bot:
Log more EventBus HTTP request/response context for HTTP errors

https://gerrit.wikimedia.org/r/319586

Mentioned in SAL (#wikimedia-operations) [2016-11-03T18:19:51Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.1/extensions/EventBus/EventBus.php: SWAT: [[gerrit:319587|Log more EventBus HTTP request/response context for HTTP errors (T148251)]] (duration: 00m 49s)

Mentioned in SAL (#wikimedia-operations) [2016-11-03T18:21:16Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.23/extensions/EventBus/EventBus.php: SWAT: [[gerrit:319586|Log more EventBus HTTP request/response context for HTTP errors (T148251)]] (duration: 00m 52s)

Change 319646 had a related patch set uploaded (by Ottomata):
Add logging and check for empty JSON encoded body

https://gerrit.wikimedia.org/r/319646

I'm going to be monitoring this closely when I deploy MW-1.29-release (WMF-deploy-2016-11-01_(1.29.0-wmf.1)) to group2, if the frequency increases it could cause a huge logging spike and may need further attention.

Unless the bug itself starts happening more often, this logging change shouldn't increase the number of logs. It may increase the size of the log $context, as I'm including the stack trace. We can remove this if you are worried.

Change 319646 merged by jenkins-bot:
Add logging and check for empty JSON encoded body

https://gerrit.wikimedia.org/r/319646

@Ottomata no I'm not worried about what you've been doing. I'm only worried about the increase in frequency that I noticed yesterday. Something elsewhere started calling this codepath more often (from job queues, as far as I could see)

Ah, yeah. @Pchelolo mayyybe can comment more, but this error is connected to the LinksUpdateComplete hook, which I think is fired by a job in the job queue. If there are more links to update, it will happen more often.

We've recently deployed the page-properties-change event that's fired from the JobQueue. That happened around October 5 - that mush be the major contributor to the increase in the error rate.

Change 319660 had a related patch set uploaded (by Ottomata):
Add logging and check for empty JSON encoded body

https://gerrit.wikimedia.org/r/319660

Change 319661 had a related patch set uploaded (by Ottomata):
Add logging and check for empty JSON encoded body

https://gerrit.wikimedia.org/r/319661

Change 319661 merged by jenkins-bot:
Add logging and check for empty JSON encoded body

https://gerrit.wikimedia.org/r/319661

Mentioned in SAL (#wikimedia-operations) [2016-11-03T23:23:33Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.1/extensions/EventBus/EventBus.php: SWAT: [[gerrit:319661|Add logging and check for empty JSON encoded body (T148251)]] (duration: 00m 47s)

Hm so ok. Stuff has been SWATed. The 400 errors are no longer happening, but I would expect to see some more error information from the direct error logging with a json encoding error in logstash. Nothing yet. Hm.

Will wait a bit and see I guess...

Pchelolo added a comment.EditedNov 4 2016, 10:40 PM

Some new log entries appeared in the logs. That's indeed related to json_encode errors when serialising page_properties to JSON. There're errors for graph_spec, templatedata and other page properties. Investigating.

The logs suggest that the problem is JSON_ERROR_UTF8 - so presumably the page_prop contains some binary data that's malformed according to UTF-8.

In the kartographer page property case it's value is the gzipped JSON, so, it's indeed binary data incorrectly treated like UTF8

So, ya, here's a minimal example that illustrates what's happening:

$data = base64_decode( 'H4sIAAAAAAAAA6uuBQBDv6ajAgAAAA==' );
echo( json_encode( $data ));
echo( json_last_error() );

The data we're trying to JSON-encode is gzipped {} string, and looks like PHP JSON doesn't support encoding binary content, only UTF8 strings. Not sure how to proceed here. On one hand it would be nice to get rid of these errors, on the other hand we don't really care about cases like this. However, there're examples in the logs when there're multiple page_props changed at the same time, event delivery fails because of one of the properties, while another one was relevant to us (like page_image).

Pchelolo added a comment.EditedNov 4 2016, 11:29 PM

I think the best option to proceed is to set the JSON_PARTIAL_OUTPUT_ON_ERROR flag to simply ignore the binary properties. That way we wouldn't loose the relevant events at least

@Ottomata @mobrovac What's your opinion here?

This is a good avenue to pursue, but we need to figure out what to do if/when the field in question is a required one, as that would still result in a rejected event.

This definitely implies that we have to json_decode() every event separately, not the whole array before sending them. On the event level, perhaps we could introduce a field missing_fields which would list the fields which were originally present but could not be decoded. That way the HTTP proxy service could inspect it and not reject the event if a specific field does not have a value.

HmmM, interesting!

Why are we getting gzipped data? Can we just manually exclude that data? Which field is this in?

JSON encoding each event separately is probably a good idea.

perhaps we could introduce a field missing_fields which would list the fields which were originally present but could not be decoded

Could we just set the field to a null or some other value that would indicate that the field was not encodable? Then we wouldn't need any special logic on the validation side.

Could we just set the field to a null or some other value

Oh, will JSON_PARTIAL_OUTPUT_ON_ERROR do this?

Could we just set the field to a null or some other value

Oh, will JSON_PARTIAL_OUTPUT_ON_ERROR do this?

Yes, that's exactly what it does.

So, As I understand, the plan here is the following:

  • Think about how to encode each event separately and send them separately to the Event-Platform service. A simpler option is to basically make 1 request per event, it wouldn't create any overhead in production since we don't create multiple events from a single hook normally. Any other ideas?
    • Update all schemas to have some errors of warnings enum array and set json_serialization_failed property to it? Seems more generic then just a boolean missinng_fields property. Unfortunately, we wouldn't be able to find out which properties exactly are missing, because PHP doesn't have any way to giving us this info. We could potentially decode the event back on error and diff it with the original event array, but I'm not sure it worths the effort.
    • Update the Event-Platform service to handle null values on missing fields. I'd like make it a bit more intelligent then just allowing nulls everywhere if there's a sign of failed serialisation - that would basically destroy the whole idea of schema validation. Could we perhaps somehow whitelist the properties where serialisation could fail in the schemas? Explicitly allow nulls?
    • Update the extension to use the JSON_PARTIAL_OUTPUT_ON_ERROR flag and set the serialisation failure flag.

Does this sound like a plan?

Hm, you know, the more I think about it, the more I wonder if we should consider events that fail JSON encoding as just invalid from the start. As long as we don't block unrelated events (by encoding each event individually), I'd be ok with this. Unless missing the events with the binary data would be problematic for you guys. Which field on page-properties-change is this in? Could we just add a check in the hook for binary data in this field, before the JSON encoding happens?

@Ottomata it's in the added_properties or removed_properties. The only possible problem is when 2 different properties has been changed during one run of the UpdateLinks job. In that case we might care about a non-binary property, while event sending would fail because of a binary property we don't care about (and probably never will, because we can't deserialise it in ChangeProp anyway).

Since we know the fields that might have binary data, can we just check for it while building the event in the hook, and then overwrite it with something that indicates that it has been removed? Something like:

if (is_binary($attrs['field'])) {
  $attrs['field'] = '_BINARY_DATA_REMOVED_';
}

?

Change 320264 had a related patch set uploaded (by Ppchelko):
Don't set binary data for page property values

https://gerrit.wikimedia.org/r/320264

I actually like the idea to replace those values, created a patch, let's discuss on the patch

Since we know the fields that might have binary data, can we just check for it while building the event in the hook, and then overwrite it with something that indicates that it has been removed? Something like:

if (is_binary($attrs['field'])) {
  $attrs['field'] = '_BINARY_DATA_REMOVED_';
}

?

How about base64-encoding the field instead of removing the data entirely?

Not a bad idea either! +1

Maybe with some little prefix to indicate that's base64? like base64:axssdcasdcascsc?

Or an extra meta field indicating if and which fields have been encoded?

@mobrovac eh... Adding this generic extra field sounds less nice then following the base64: prefixing path. Prefixing is a pretty common thing to do I think..

Indeed. Let's just go with the prefix for now. We can implement other strategies at a later point easily enough.

I've update the https://gerrit.wikimedia.org/r/320264 to send a normal data URI instead of the binary data.

Change 320264 merged by Mobrovac:
Don't set binary data for page property values

https://gerrit.wikimedia.org/r/320264

Change 319660 abandoned by Ottomata:
Add logging and check for empty JSON encoded body

Reason:
Unneeded on this branch

https://gerrit.wikimedia.org/r/319660

Pchelolo closed this task as Resolved.Nov 14 2016, 7:07 PM

Fix deployed, now occurrences of this log error over last week, resolving.