Page MenuHomePhabricator

Malformed HTTP message in EventBus logs
Closed, ResolvedPublic3 Estimated Story Points

Description

There's a message Malformed HTTP message from 10.64.16.68: Content-Length too long in the Event-Platform logs. Apparently some jobs are even larger then Tornado max_buffer_size. We should increase the buffer size to match Kafka max message size.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
fdans set the point value for this task to 3.

Pretty sure this isn't caused by Tornado's max_buffer_size. From what I can tell, this defaults to 100M.

Also, I just tried to submitted some really large files to eventbus service on Vagrant:

# Find some big files laying around on vagrant:
vagrant@mediawiki-vagrant:/vagrant/srv/eventlogging/config$ sudo find / -type f -size +100M -exec ls -lh {} \;
# ...
-r--r--r-- 1 vagrant www-data 151M Jun  9 14:19 /vagrant/cache/composer/vcs/https---github.com-squizlabs-PHP-CodeSniffer.git/objects/pack/pack-c6823f8c1a9e90cc560cc187bc0a2bda82cdf583.pack
-r--r--r-- 1 vagrant www-data 239M Jun  7 19:35 /vagrant/mediawiki/.git/objects/pack/pack-6713204c27d1ef90b83214cae55c60fe68910254.pac
# Send a 151M file:
curl -H 'Content-Type: application/json' -X POST -d@/vagrant/cache/composer/vcs/https---github.com-squizlabs-PHP-CodeSniffer.git/objects/pack/pack-c6823f8c1a9e90cc560cc187bc0a2bda82cdf583.pack  localhost:8085/v1/events

# eventbus logs after 151M file
Oct 26 19:07:39 mediawiki-vagrant eventlogging-service[10862]: 2017-10-26 19:07:39,625 [10862] (MainThread) UnicodeError while utf-8 decoding
Oct 26 19:07:39 mediawiki-vagrant eventlogging-service[10862]: 2017-10-26 19:07:39,626 [10862] (MainThread) 400 POST /v1/events (::1) 134.67m
# Send a 239M file:
curl -H 'Content-Type: application/json' -X POST -d@/vagrant/mediawiki/.git/objects/pack/pack-6713204c27d1ef90b83214cae55c60fe68910254.pack  localhost:8085/v1/events

# eventbus logs after 239M file
Oct 26 19:08:44 mediawiki-vagrant eventlogging-service[10862]: 2017-10-26 19:08:44,804 [10862] (MainThread) UnicodeError while utf-8 decoding
Oct 26 19:08:44 mediawiki-vagrant eventlogging-service[10862]: 2017-10-26 19:08:44,805 [10862] (MainThread) 400 POST /v1/events (::1) 134.45ms
Oct 26 19:09:12 mediawiki-vagrant eventlogging-service[10862]: 2017-10-26 19:09:12,307 [10862] (MainThread) Malformed HTTP message from ::1: Content-Length too long

So, I'm not totally sure what this value is in prod, but our Kafka max message size in prod is 4K, way under whatever Tornado is accepting.

So, we have an event that's > 100 Mb of serialized JSON... Wonderful. Do you think it's possibly to dump it somewhere on the filesystem to find out what is that? It's probably creating some issues in the current JobQueue and MediaWiki as well, it's even more abnormal then previously found 17 Mb events. Perhaps temporary increase the max_buffer_size to something like 200 Mb so that tornado can at least accept the event and log it?

Is it possible to get this from MediaWiki instead? If EventBus returns a 400, MediaWiki logs the error somewhere, right? Does it possibly send the event that failed to logstash?

@Ottomata pushing > 100 megs into log stash? Don't think that's a good idea.

Actually digging a bit deeper, we're indeed trying to log these giant events - but that makes PHP fail with out of memory (T179280) I will fix that bug, but that means trying to log these events on MediaWiki side is practically impossible.

Pushing > 100 megs to log stash is infeasible even without that bug, but since it crashes in the logging code even dumping it locally to some file is impossible either.

Right now I'm planning to change that logger to only try to log meta property of each event and omit the "meat" of the event, let's see where that brings us.

Obviously, we need to get rid of such huge jobs, so one temporary solution that ought to be safe enough is to use the EventBus service to log to a file any message crossing a reasonable threshold (I'm thinking a couple of MBs at most) and then work on resolving individual jobs' problems as they arise.

Obviously, we need to get rid of such huge jobs, so one temporary solution that ought to be safe enough is to use the EventBus service to log to a file any message crossing a reasonable threshold (I'm thinking a couple of MBs at most) and then work on resolving individual jobs' problems as they arise.

I'm not strongly opposed to doing this, but I think logging this from MediaWiki (even if just the meta) is more appropriate. EventBus does log invalid/erroneous events to a log file, but this event is denied by Tornado before it gets to EventBus code. In order to accept it properly, we'd likely have to modify the file size limit accepted by Tornado, and then add specific logic in EventBus to detect a huge event, and choose not to log it locally instead of producing it to Kafka. I'd rather not add this specific debugging logic to EventBus in prod to do that.

logging this from MediaWiki (even if just the meta) is more appropriate.

Let's see where it gets us when my change is deployed (please have a look, would be nice to get it on this week's train). If it's not enough we will continue digging.

add specific logic in EventBus to detect a huge event, and choose not to log it locally instead of producing it to Kafka. I'd rather not add this specific debugging logic to EventBus in prod to do that.

We don't actually need any additional logic - the Gargantua event will be rejected by Kafka with MESSAGE_SIZE_TO_LONG exception (we limit at 4 megs) and then our current failed events logging code will kick in.

We don't actually need any additional logic

Oh hm, that is probably true.

How often do these events happen? I don't want to fill up local disks with giant event error logs.

Let's see where it gets us when my change is deployed

Agree, let's try this first and see what we learn before modifying the Tornado setting.

I think we can close this. Reopen if not.