Page MenuHomePhabricator

EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?)
Closed, ResolvedPublic

Description

Currently in EL, the MultimediaViewerNetworkPerformance schema receives a remarkable number of events that do not pass the validation. Around 600 per hour (in high activity periods).

The invalid logs seem to be complete and correct, all the fields in the schema are there and are from the expected type. The only thing I could spot they have in common is the type. 99,9% of them are of type 'image'.

The correct events' distribution is the following (in one hour of traffic):

1417 type": "imageinfo"
 727 type": "image"
 432 type": "filerepoinfo"
 124 type": "userinfo"
  38 type": "thumbnailinfo"

So such a high percentage of 'image' errors is not expected. There must be a specific problem with those kind of events.

Example (all invalid logs are truncated to 1k, but the event part is complete):

eventlogging_processor-client-side-events.log.1.gz:2015-03-01 22:00:00,999 (MainThread) Unable to decode: ?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22contentHost%22%3A%22pl.wikipedia.org%22%2C%22isHttps%22%3Afalse%2C%22total%22%3A492%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1063%20miss%20(0)%2C%20cp3016%20hit%20(1)%2C%20cp3018%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1063%22%2C%22varnish1hits%22%3A0%2C%22varnish2%22%3A%22cp3016%22%2C%22varnish2hits%22%3A1%2C%22varnish3%22%3A%22cp3018%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%22265764709%2C%20583884279%20576752145%2C%204160384575%22%2C%22contentLength%22%3A111608%2C%22age%22%3A3968%2C%22timestamp%22%3A1425247200%2C%22lastModified%22%3A1383480710%2C%22redirect%22%3A0%2C%22dns%22%3A0%2C%22tcp%22%3A38%2C%22request%22%3A221%2C%22response%22%3A205%2C%22cache%22%3A28%2C%22uploadTimestamp%22%3A%2220070823000000%22%2C%22imageWidth%22%3A1024%2C%22country%22%3A%22PL%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22pl.wikipedia.org%2        cp3020.esams.wikimedia.org      66124967        2015-03-01T22:00:00     77.255.103.93   Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36

Correspondent decoded json:

{
  "type": "image",
  "contentHost": "pl.wikipedia.org",
  "isHttps": false,
  "total": 492,
  "urlHost": "upload.wikimedia.org",
  "status": 200,
  "XCache": "cp1063 miss (0), cp3016 hit (1), cp3018 frontend miss (0)",
  "varnish1": "cp1063",
  "varnish1hits": 0,
  "varnish2": "cp3016",
  "varnish2hits": 1,
  "varnish3": "cp3018",
  "varnish3hits": 0,
  "XVarnish": "265764709, 583884279 576752145, 4160384575",
  "contentLength": 111608,
  "age": 3968,
  "timestamp": 1425247200,
  "lastModified": 1383480710,
  "redirect": 0,
  "dns": 0,
  "tcp": 38,
  "request": 221,
  "response": 205,
  "cache": 28,
  "uploadTimestamp": "20070823000000",
  "imageWidth": 1024,
  "country": "PL"
}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mforns raised the priority of this task from to Needs Triage.Mar 3 2015, 12:41 AM
mforns added a project: Analytics-EventLogging.
mforns added a subscriber: mforns.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 3 2015, 12:41 AM
Tgr added a subscriber: Tgr.Mar 3 2015, 1:06 AM

Image is a CORS request, all others (with the possible exception of userinfo) are local AJAX requests, so the headers are different due to CORS restrictions (and different across browsers, depending on their level of CORS support).

Still, this seems perfectly valid;

var data = {...};
mw.loader.using('schema.MultimediaViewerNetworkPerformance', function () {
    mw.eventLog.validate(data, mw.eventLog.schemas.MultimediaViewerNetworkPerformance.schema);
} );

yields no errors.

Nuria set Security to None.
Nuria added a subscriber: Nuria.Mar 3 2015, 1:22 AM

I "suspect" this is URL decoding not working as well as it should.

mforns added a comment.Mar 3 2015, 9:47 PM

On the user agent:

tl:dr
I think the errors are not bound to any particular browser.

Long version:
From a total of 122k errors found in the last week...

Chrome has a share of 62k, the most common version being v40.

19678 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
 8552 Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
 7277 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
 4116 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
 1516 Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
 1125 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36 OPR/27.0.1689.76
 1048 Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
 1015 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
  659 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36
  650 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 YaBrowser/14.12.2125.10034 Safari/537.36

Firefox has 34k, mainly from v35 and v36:

8751 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0
4750 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0
2940 Mozilla/5.0 (Windows NT 6.3; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0
2735 Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0
2648 Mozilla/5.0 (Windows NT 6.1; rv:35.0) Gecko/20100101 Firefox/35.0
1668 Mozilla/5.0 (Windows NT 6.3; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0
1307 Mozilla/5.0 (Windows NT 6.1; rv:36.0) Gecko/20100101 Firefox/36.0
1306 Mozilla/5.0 (Windows NT 5.1; rv:36.0) Gecko/20100101 Firefox/36.0
 797 Mozilla/5.0 (Windows NT 6.0; rv:35.0) Gecko/20100101 Firefox/35.0
 488 Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:35.0) Gecko/20100101 Firefox/35.0

MSIE has 3k, mostly v10:

1113 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)
 878 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)
 253 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)
  44 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)
  43 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; ASU2JS)
  35 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; MAARJS)
  27 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Win64; x64; Trident/6.0)
  26 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; MALNJS)
  24 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; MASMJS)
  22 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; MATMJS)

These 3 browser only get ~100k from 122k, so it seems to me that the proportions are roughly analog to the browser's market share. So this makes me think that the validation errors are not bound to any particular browser.

Nuria added a comment.Mar 3 2015, 10:17 PM

These 3 browser only get ~100k from 122k, so it seems to me that the proportions are roughly analog to the browser's market share.

We definitely have more traffic from IE11 than 10 so I wouldn't say this represents market share.

FYI, IE10 does not support sendBeacon, the other two do.

Are we sure events are not getting "cut" on the raw stream? (i e. they have ";" at the end on the raw stream)

I managed to reproduce the error from the browser.

Setting this beforehand in the browser's console disables sampling (Thanks Gergo):
mw.config.get('wgMultimediaViewer').networkPerformanceSamplingFactor = 1.0;

When loading or paginating images, events of type='image' are sent.
I took one and inserted the keyword 'debugTest' and resent it.

After that, it appeared as a validation error by grepping 'debugTest' in the eventlogging_processor-client-side-events.log.

The thing is, the string sent by the browser is parsed without problems by python's urllib.unquote,
which is the method EL uses for decoding URLs.

So, still no conclusion...

Sorry, I forgot the examples:

Here is the url from the browser:

https://bits.wikimedia.org/event.gif?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22contentHost%22%3A%22debugTest%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A682%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1062%20miss%20(0)%2C%20cp3006%20hit%20(2)%2C%20cp3004%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1062%22%2C%22varnish1hits%22%3A0%2C%22varnish2%22%3A%22cp3006%22%2C%22varnish2hits%22%3A2%2C%22varnish3%22%3A%22cp3004%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%222477278816%2C%201301462374%201253852873%2C%203827936402%22%2C%22contentLength%22%3A38787%2C%22age%22%3A26185%2C%22timestamp%22%3A1425424440%2C%22lastModified%22%3A1382470976%2C%22redirect%22%3A0%2C%22dns%22%3A56%2C%22tcp%22%3A285%2C%22request%22%3A105%2C%22response%22%3A236%2C%22cache%22%3A1%2C%22uploadTimestamp%22%3A%2220120628000000%22%2C%22imageWidth%22%3A640%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22debugTest%22%2C%22wiki%22%3A%22enwiki%22%7D;

And here is the validation error log:

2015-03-03 23:25:14,362 (MainThread) Unable to decode: ?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22contentHost%22%3A%22debugTest%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A682%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1062%20miss%20(0)%2C%20cp3006%20hit%20(2)%2C%20cp3004%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1062%22%2C%22varnish1hits%22%3A0%2C%22varnish2%22%3A%22cp3006%22%2C%22varnish2hits%22%3A2%2C%22varnish3%22%3A%22cp3004%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%222477278816%2C%201301462374%201253852873%2C%203827936402%22%2C%22contentLength%22%3A38787%2C%22age%22%3A26185%2C%22timestamp%22%3A1425424440%2C%22lastModified%22%3A1382470976%2C%22redirect%22%3A0%2C%22dns%22%3A56%2C%22tcp%22%3A285%2C%22request%22%3A105%2C%22response%22%3A236%2C%22cache%22%3A1%2C%22uploadTimestamp%22%3A%2220120628000000%22%2C%22imageWidth%22%3A640%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22debugTest%22%2C%22wiki%	cp3019.esams.wikimedia.org	77219408	2015-03-03T23:25:14	91.198.174.89	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
Nuria added a comment.Mar 5 2015, 12:11 AM

Any updates on this?

@Nuria,

I reproduced the MultimediaViewerNetworkPerformance:image logs from Beta Labs.
However, I did not find any piece of code breaking within the forwarder.
The forwarder does not parse nor unquote anything, and seems to be fine handling unicode.

So I think the error is earlier in the pipeline. I would say the string is coming without semicolon from the udp stream.

At that point, I did not know clearly where to go next, so I switched to another task, hoping you might help me tomorrow.

Nuria added a comment.Mar 5 2015, 12:36 AM

Ok, let's talk tomorrow.

mforns added a comment.Mar 5 2015, 8:17 PM

I've tested again in Beta Labs, this time listening to the udp logs coming from varnish, the logs I sent from the browser came in the following form:

?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22contentHost%22%3A%22en.wikipedia.org%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A583%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1064%20hit%20(25)%2C%20cp3016%20hit%20(3)%2C%20cp3010%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1064%22%2C%22varnish1hits%22%3A25%2C%22varnish2%22%3A%22cp3016%22%2C%22varnish2hits%22%3A3%2C%22varnish3%22%3A%22cp3010%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%22598562358%20567337670%2C%201059408435%201051057839%2C%203889109054%22%2C%22contentLength%22%3A32166%2C%22age%22%3A26094%2C%22timestamp%22%3A1425572831%2C%22lastModified%22%3A1383188919%2C%22redirect%22%3A0%2C%22dns%22%3A67%2C%22tcp%22%3A255%2C%22request%22%3A92%2C%22response%22%3A168%2C%22cache%22%3A1%2C%22uploadTimestamp%22%3A%2220050406000000%22%2C%22imageWidth%22%3A320%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22en.wi  deploymen?%7B%22event%22%3A%7B%22version%22%3A1%2C%22action%22%3A%22ready%22%2C%22editor%22%3A%22visualeditor%22%2C%22platform%22%3A%22desktop%22%2C%22integration%22%3A%22page%22%2C%22page.id%22%3A43959%2C%22page.title%22%3A%22User%3AEtonkovidova%22%2C%22page.ns%22%3A2%2C%22page.revid%22%3A204069%2C%22page.length%22%3A-1%2C%22editingSessionId%22%3A%228d53a536bee5d7c4%22%2C%22user.id%22%3A4462%2C%22user.editCount%22%3A836%2C%22mediawiki.version%22%3A%221.25alpha%22%2C%22action.ready.timing%22%3A1557%7D%2C%22revision%22%3A11319708%2C%22schema%22%3A%22Edit%22%2C%22webHost%22%3A%22en.wikipedia.beta.wmflabs.org%22%2C%22wiki%22%3A%22enwiki%22%7D;        deployment-cache-bits01.eqiad.wmflabs   251032  2015-03-05T19:15:56     198.73.209.1    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36

If you Ctrl+F for 'deploymen', you'll see that the MultimediaViewer log is truncated before it ends (after 'en.wi') and then some whitespace and the word 'deploymen' and imediately after that the next log begins (without end of line in between!).

Another example with a shorter log from the browser:

?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A583%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1064%20hit%20(25)%2C%20cp3016%20hit%20(3)%2C%20cp3010%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1064%22%2C%22varnish1hits%22%3A25%2C%22varnish2%22%3A%22cp3016%22%2C%22varnish2hits%22%3A3%2C%22varnish3%22%3A%22cp3010%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%22598562358%20567337670%2C%201059408435%201051057839%2C%203889109054%22%2C%22contentLength%22%3A32166%2C%22age%22%3A26094%2C%22timestamp%22%3A1425572831%2C%22lastModified%22%3A1383188919%2C%22redirect%22%3A0%2C%22dns%22%3A67%2C%22tcp%22%3A255%2C%22request%22%3A92%2C%22response%22%3A168%2C%22cache%22%3A1%2C%22uploadTimestamp%22%3A%2220050406000000%22%2C%22imageWidth%22%3A320%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D  deploymen?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A583%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1064%20hit%20(25)%2C%20cp3016%20hit%20(3)%2C%20cp3010%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1064%22%2C%22varnish1hits%22%3A25%2C%22varnish2%22%3A%22cp3016%22%2C%22varnish2hits%22%3A3%2C%22varnish3%22%3A%22cp3010%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%22598562358%20567337670%2C%201059408435%201051057839%2C%203889109054%22%2C%22contentLength%22%3A32166%2C%22age%22%3A26094%2C%22timestamp%22%3A1425572831%2C%22lastModified%22%3A1383188919%2C%22redirect%22%3A0%2C%22dns%22%3A67%2C%22tcp%22%3A255%2C%22request%22%3A92%2C%22response%22%3A168%2C%22cache%22%3A1%2C%22uploadTimestamp%22%3A%2220050406000000%22%2C%22imageWidth%22%3A320%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D deploymen?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A583%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1064%20hit%20(25)%2C%20cp3016%20hit%20(3)%2C%20cp3010%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1064%22%2C%22varnish1hits%22%3A25%2C%22varnish2%22%3A%22cp3016%22%2C%22varnish2hits%22%3A3%2C%22varnish3%22%3A%22cp3010%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%22598562358%20567337670%2C%201059408435%201051057839%2C%203889109054%22%2C%22contentLength%22%3A32166%2C%22age%22%3A26094%2C%22timestamp%22%3A1425572831%2C%22lastModified%22%3A1383188919%2C%22redirect%22%3A0%2C%22dns%22%3A67%2C%22tcp%22%3A255%2C%22request%22%3A92%2C%22response%22%3A168%2C%22cache%22%3A1%2C%22uploadTimestamp%22%3A%2220050406000000%22%2C%22imageWidth%22%3A320%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D deploymen?%7B%22event%22%3A%7B%22xhrSupported%22%3Atrue%2C%22xdomainSupported%22%3Afalse%2C%22imgAttributeSupported%22%3Atrue%2C%22scriptAttributeSupported%22%3Atrue%2C%22scriptLoaded%22%3Atrue%2C%22samplingFactor%22%3A1%2C%22isHttps%22%3Afalse%2C%22isAnon%22%3Atrue%2C%22country%22%3A%22US%22%7D%2C%22revision%22%3A10884476%2C%22schema%22%3A%22ImageMetricsCorsSupport%22%2C%22webHost%22%3A%22en.m.wikipedia.beta.wmflabs.org%22%2C%22wiki%22%3A%22enwiki%22%7D;     deployment-cache-bits01.eqiad.wmflabs   251899  2015-03-05T20:01:48     162.222.73.15   Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36

Here the query string is complete (not truncated), but the same 'deploymen' keyword and right after that the next log starts. Note that this happens 3 consecutive times.

This happened with every event I sent (MultimediaViewerNetworkPerformance), but it seems to happen also with other logs:

?%7B%22event%22%3A%7B%22xhrSupported%22%3Atrue%2C%22xdomainSupported%22%3Afalse%2C%22imgAttributeSupported%22%3Atrue%2C%22scriptAttributeSupported%22%3Atrue%2C%22scriptLoaded%22%3Atrue%2C%22samplingFactor%22%3A1%2C%22isHttps%22%3Afalse%2C%22isAnon%22%3Atrue%2C%22country%22%3A%22US%22%7D%2C%22revision%22%3A10884476%2C%22schema%22%3A%22ImageMetricsCorsSupport%22%2C%22webHost%22%3A%22en.wiki?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A583%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1064%20hit%20(25)%2C%20cp3016%20hit%20(3)%2C%20cp3010%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1064%22%2C%22varnish1hits%22%3A25%2C%22varnish2%22%3A%22cp3016%22%2C%22varnish2hits%22%3A3%2C%22varnish3%22%3A%22cp3010%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%22598562358%20567337670%2C%201059408435%201051057839%2C%203889109054%22%2C%22contentLength%22%3A32166%2C%22age%22%3A26094%2C%22timestamp%22%3A1425572831%2C%22lastModified%22%3A1383188919%2C%22redirect%22%3A0%2C%22dns%22%3A67%2C%22tcp%22%3A255%2C%22request%22%3A92%2C%22response%22%3A168%2C%22cache%22%3A1%2C%22uploadTimestamp%22%3A%2220050406000000%22%2C%22imageWidth%22%3A320%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D   deploymen?%7B%22event%22%3A%7B%22xhrSupported%22%3Atrue%2C%22xdomainSupported%22%3Afalse%2C%22imgAttributeSupported%22%3Atrue%2C%22scriptAttributeSupported%22%3Atrue%2C%22scriptLoaded%22%3Atrue%2C%22samplingFactor%22%3A1%2C%22isHttps%22%3Afalse%2C%22isAnon%22%3Atrue%2C%22country%22%3A%22US%22%7D%2C%22revision%22%3A10884476%2C%22schema%22%3A%22ImageMetricsCorsSupport%22%2C%22webHost%22%3A%22en.wikipedia.beta.wmflabs.org%22%2C%22wiki%22%3A%22enwiki%22%7D;       deployment-cache-bits01.eqiad.wmflabs   251794  2015-03-05T19:54:20     162.222.73.11   Mozilla/5.0 (X11; Linux i686; rv:35.0) Gecko/20100101 Firefox/35.0

The first log belongs to the schema ImageMetricsCorsSupport and is also truncated after 'en.wiki', and right after that, the next log.

Thoughts?
Maybe something to do with upd packet loss (that has a higher probability to affect long logs like MultimediaViewerNetworkPerformance ones)?

Nuria added a comment.Mar 5 2015, 8:33 PM

Things to check would be:

When/If we know our limits we can talk to mediaviewer team to "shorten" their events.

Trying another way to read the UDP logs from varnishncsa I outputed the read lines from inside the forwarder.

The '2 events in one line' issue has disappeared, so I suppose it was a problem of (my) netcat call. Sorry for that :/

But the truncation of the logs continues, and I found out that the actual limit is 1014 bytes.

Whenever the query string is >1014 chars (including leading '?' and trailing ';'), the corresponding log comes truncated from varnishncsa. If the query string is <=1014, the log comes correctly.

1014-char long query string (I added some random letters to make it the exact size):

?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22conte1234567hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhntHost%22%3A%22en.wikipedia.org%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A276%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1061%20hit%20(17)%2C%20cp3007%20hit%20(74)%2C%20cp3010%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1061%22%2C%22varnish1hits%22%3A17%2C%22varnish2%22%3A%22cp3007%22%2C%22varnish2hits%22%3A74%2C%22varnish3%22%3A%22cp3010%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%222044642878%202018120993%2C%201173698881%201001102349%2C%203973102219%22%2C%22contentLength%22%3A26877%2C%22age%22%3A112169%2C%22timestamp%22%3A1425594727%2C%22lastModified%22%3A1397651528%2C%22uploadTimestamp%22%3A%2220100114000000%22%2C%22imageWidth%22%3A320%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D;	deployment-cache-bits01.eqiad.wmflabs	252847	2015-03-05T22:46:05	79.154.217.8	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36

1015-char long query string:

?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22conte12345687hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhntHost%22%3A%22en.wikipedia.org%22%2C%22isHttps%22%3Atrue%2C%22total%22%3A276%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1061%20hit%20(17)%2C%20cp3007%20hit%20(74)%2C%20cp3010%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1061%22%2C%22varnish1hits%22%3A17%2C%22varnish2%22%3A%22cp3007%22%2C%22varnish2hits%22%3A74%2C%22varnish3%22%3A%22cp3010%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%222044642878%202018120993%2C%201173698881%201001102349%2C%203973102219%22%2C%22contentLength%22%3A26877%2C%22age%22%3A112169%2C%22timestamp%22%3A1425594727%2C%22lastModified%22%3A1397651528%2C%22uploadTimestamp%22%3A%2220100114000000%22%2C%22imageWidth%22%3A320%2C%22country%22%3A%22BR%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22en.wikipedia.org%22%2C%22wiki%22%3A%22enwiki%22%7D	deployment-cache-bits01.eqiad.wmflabs	252848	2015-03-05T22:46:19	79.154.217.8	Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36

Note that the semicolon isn't there any more.
This behavior repeats itself with longer logs.

mforns added a comment.EditedMar 5 2015, 11:09 PM

This person had the same problem, no answer:
https://www.varnish-cache.org/forum/topic/126

Tgr added a comment.Mar 5 2015, 11:14 PM

Per P365 it seems MultimediaViewerNetworkPerformance has the largest payload amongst frequent events, but this bug will affect other events as well, given that the main source of length fluctuation is probably the UA string which can add several hundred bytes (or more in extreme cases).

Tgr renamed this task from Some events not validating for MultimediaViewerNetworkPerformance to EventLogging query strings are truncated to 1014 bytes by varnishncsa.Mar 5 2015, 11:17 PM
Nuria added a comment.Mar 5 2015, 11:36 PM

@Tgr, not sure it i s a bug, rather a limit we need to abide to given varnish logging.

Will consult with ops on wether this can be changed but i suspect is probably not that easy.

And looking at varnishncsa code there seems to be some code that trims the query string and as a side effect truncates it, if I see it well...
https://www.varnish-cache.org/trac/browser/bin/varnishncsa/varnishncsa.c?rev=79c2d962221bb7ce3582caa1a0be5df4841e0832#L271

Nuria renamed this task from EventLogging query strings are truncated to 1014 bytes by varnishncsa to EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?).Mar 6 2015, 12:00 AM
mforns claimed this task.Mar 6 2015, 12:05 AM
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.
mforns added a subscriber: Ottomata.Mar 6 2015, 4:55 PM
mforns reassigned this task from mforns to Ottomata.Mar 12 2015, 2:19 PM
Nuria added a comment.EditedMar 18 2015, 3:38 PM

We are going to enable our event-ingesting pipeline to use varnishkafka rtaher than varnishncsa.

Also, please see: https://phabricator.wikimedia.org/T91918 events over a certain length will not be sent by the client as there is no point in sending events we cannot process.

So, we have implemented a varnishkafka eventlogging endpoint that works well and dandy. But! It does not solve this problem. Both varnishkafka and varnishncsa query strings are truncated to the same length. I suspect this is an issue with the varnish share memory logs, but I am not sure.

summoning @BBlack ! he might have an idea whether varnish truncates query strings for logging purposes

fgiunchedi triaged this task as Normal priority.Apr 1 2015, 11:05 AM
BBlack added a subscriber: ori.Apr 1 2015, 1:30 PM

I think @ori and I both looked at various parts of this on IRC once sometime in the past couple of weeks. I think we confirmed that varnish's shm log isn't inherently truncating this at the source, and found a possible varnishkafka config parameter to try changing (which is set to 2K whereas the observed trunc is closer to 1K, but still...). I don't recall hearing back from @Ottomata if that experiment worked out.

BBlack added a comment.Apr 1 2015, 1:32 PM

The parameter was tag.size.max, which defaults to 2048 and was set explicitly to 2048 in some observed configs.

Nuria added a comment.Apr 1 2015, 2:34 PM

We looked at this in detail and:

  1. it's varnish, not varnishncsa who is "cutting" the request , there is no config for log size on varnish thus it is likely related to varnish in-memory buffer size.
  1. since varnish is cutting the request whether you populate EL from varnishncsa or kafka there is no difference when it comes to the requests being "cut"
  1. we have since then made changes so events we know are too long are not sent by clients and report an error.

I did try increasing tag.size.max for varnishkafka, but it didn't change anything. The fact that the exact same behavior is exhibited by both varnishkafka and varnishncsa indicates to me that this is more likely somewhere above those clients. Not sure though.

What is the status of this bug? Is any further work planned, and if so, is there a rough schedule for it?

Nuria added a comment.Apr 9 2015, 2:06 PM

No further work is planned. The limitation on the length of events is on varnish.Thus far, lengthy events will not be sent.

kevinator moved this task from Paused to Done on the Analytics-Kanban board.Apr 27 2015, 11:59 PM
Tgr reopened this task as Open.Apr 28 2015, 4:00 PM

Sending a warning message when the logging call fails does not fix the issue of the logging call failing, by any reasonable definition. Predicting the length of a log message is practically impossible, given that most of it is going to be taken up by data the developer has no control over (the UA string, predominantly). Feel free to lower the priority if the Analytics team does not feel this is important to fix, but that does not make the bug into a feature.

kevinator lowered the priority of this task from Normal to Low.Apr 30 2015, 3:44 PM

There seems to be some conflict in the investigations here, as I did say earlier "I think we confirmed that varnish's shm log isn't inherently truncating this at the source". I can dig through the varnish source again and look harder, but what's the basis for thinking "it's varnish, not varnishncsa who is cutting the request" ? My theory on the correlation back then was simply that whatever varnishncsa's code was doing that caused the truncation, varnishkafka copied the same basic code/technique.

Oh, I guess it's amazing sometimes what a fresh look can reveal. I thought we had investigated this fairly comprehensively before, but today I was pretty quickly able to find that there's a varnish parameter controlling this, and it is set to 1024 (it's a buffer size, the other 10 bytes are probably other overhead). We could raise this parameter, but there are doubtless many tradeoffs on logging buffer size and performance and all that, so we certainly can't just set it to a crazy value like 65536. Is there a reasonable cutoff value greater than 1024 that brings some pragmatic benefit?

ori added a comment.Apr 30 2015, 6:01 PM

Oh, I guess it's amazing sometimes what a fresh look can reveal. I thought we had investigated this fairly comprehensively before, but today I was pretty quickly able to find that there's a varnish parameter controlling this, and it is set to 1024 (it's a buffer size, the other 10 bytes are probably other overhead). We could raise this parameter, but there are doubtless many tradeoffs on logging buffer size and performance and all that, so we certainly can't just set it to a crazy value like 65536. Is there a reasonable cutoff value greater than 1024 that brings some pragmatic benefit?

4096 ought to be enough. Doubling that gives us 8192 bytes, which will definitively be enough.

kevinator moved this task from Done to Next Up on the Analytics-Kanban board.Apr 30 2015, 11:57 PM

We probably want to bump up the shm_workspace as well (which already occasionally has overflow-flush behavior at the current default of 8192). There's potentially some issues with getting it bumped without daemon restarts, probably (see old ticket info of unknown validity here: https://www.varnish-cache.org/trac/ticket/1055 ). At this point I'm thinking of maybe doubling shm_workspace to 16K and then setting shm_reclen to 8192. What makes 8192 definitive anyways?

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 10 2015, 4:39 PM
awight added a subscriber: awight.Aug 10 2015, 11:37 PM

It seems like wasted effort to increase the length limit beyond 2,000 chars. See this discussion and survey,
http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers

Tgr added a comment.Aug 10 2015, 11:57 PM

It seems like wasted effort to increase the length limit beyond 2,000 chars. See this discussion and survey,
http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers

That probably made sense in 2006, when the article that SO post is based on was written. These days, IE 7-9 combined (the three affected browsers, given that we don't support javascript at all for IE 6) form about 2.5% of our traffic. (And IE 8-9 actually support long URLs as long as you don't use certain settings such as proxies.)

That probably made sense in 2006, when the article that SO post is based on was written. These days, IE 7-9 combined (the three affected browsers, given that we don't support javascript at all for IE 6) form about 2.5% of our traffic. (And IE 8-9 actually support long URLs as long as you don't use certain settings such as proxies.)

Good point. EventLogging traffic is not essential, and doesn't degrade the user experience if it's broken, so we can relax the 99%-supported rule of thumb that I'm familiar with.

Hehe, we could assume that if URLs are getting truncated by your browser, then sendBeacon is definitely missing as well. Provide a callback to the EventLogging AJAX, which expects the 204 response code, but if the server sends back an error, for example when the EL record doesn't validate, we can set a compatibility cookie which causes future long messages to be sent in multiple packets, or as a POST.

Even if browsers allow >2K URLs, they seem like a poor idea in general. Even a 1K URL seems insane to me.

Tgr added a comment.Aug 11 2015, 7:15 AM

So why don't we just use POST? sendBeacon actually does that, we just abuse it currently by putting the payload into the url instead of the POST data. Most browsers that we care about support either CORS or XDomainRequest, except IE 7 and Opera Mini which are together around 1.3% of total traffic and it's probably fine if we just make them drop large messages.

Probably because beacon is used with the analytics pipeline rather than the appserver code, which doesn't log POST data.

Tgr added a comment.Aug 12 2015, 1:30 AM

Would it be difficult to change that? (Also, is it documented somewhere what those things (analytics pipeline + appserver code) are? I found Analytics/Cluster/EventLogging_Pipeline but that's not really useful.)

I did a quick 5 minute googling, and it doesn't look like the varnishlog API has access to POST data :(

I don't think we'd want to even if we could, TBH. The privacy implications are bigger in the general case.

We are eventually going to need a way to get more than 1014 bytes of data from clients. Not sure how yet, maybe it won't be via varnish.

Tgr added a comment.Aug 12 2015, 8:08 PM

So how about bumping the limit @BBlack found in T91347#1249751 to 8192, raising the limit in https://gerrit.wikimedia.org/r/#/c/196599/ to 2000, closing this ticket (if the varnish fix indeed works), opening a reminder ticket about the browser length issue and revisiting it whenever a limit of 2000 bytes becomes a problem? Right now it probably isn't, and with some luck IE 7-8 dies by that time and there that changes.

We are eventually going to need a way to get more than 1014 bytes of data from clients. Not sure how yet, maybe it won't be via varnish.

In a URL? What kind of data are considering eventlogging here that's so long?

In T91347#1532517, @Tgr wrote:

So how about bumping the limit @BBlack found in T91347#1249751 to 8192, raising the limit in https://gerrit.wikimedia.org/r/#/c/196599/ to 2000, closing this ticket (if the varnish fix indeed works), opening a reminder ticket about the browser length issue and revisiting it whenever a limit of 2000 bytes becomes a problem? Right now it probably isn't, and with some luck IE 7-8 dies by that time and there that changes.

Raising the limit to 8192 in varnish is non-trivial. It carries some serious risks and possible tradeoffs (because the log buffering is implemented at a very low level in the varnish code, the kind of level where incorrect tuning has potential for raw segfaults and performance issues - we've had assert() panics from related tunables in the past). I'd like to at least see some real examples of where and why we need it, and a limit that isn't completely arbitrary. Also, 2000-ish is a whole lot more palatable than 8192.

Trying to answer this for myself, the original truncated URL from the top of this ticket's data is:

{"event":{"type":"image","contentHost":"pl.wikipedia.org","isHttps":false,"total":492,"urlHost":"upload.wikimedia.org","status":200,"XCache":"cp1063 miss (0), cp3016 hit (1), cp3018 frontend miss (0)","varnish1":"cp1063","varnish1hits":0,"varnish2":"cp3016","varnish2hits":1,"varnish3":"cp3018","varnish3hits":0,"XVarnish":"265764709, 583884279 576752145, 4160384575","contentLength":111608,"age":3968,"timestamp":1425247200,"lastModified":1383480710,"redirect":0,"dns":0,"tcp":38,"request":221,"response":205,"cache":28,"uploadTimestamp":"20070823000000","imageWidth":1024,"country":"PL"},"revision":11030254,"schema":"MultimediaViewerNetworkPerformance","webHost":"pl.wikipedia.org

A lot of this seems like it could be compressed/reduced, and/or duplicates information available in the headers of the original request and response.

In a URL? What kind of data are considering eventlogging here that's so long?

Not necessarily in a URL, just in general we will. If we can get POST from VSL, then something else perhaps. We will see!

A lot of this seems like it could be compressed/reduced, and/or duplicates information available in the headers of the original request and response.

Welcome to JSON! :)

Tgr added a comment.Aug 12 2015, 8:33 PM

Also, 2000-ish is a whole lot more palatable than 8192.

Sure, 2048 would be fine - if we have a limit of 2000 on the client side, it makes no difference.

A lot of this seems like it could be compressed/reduced, and/or duplicates information available in the headers of the original request and response.

Actually all of that except the webHost comes from the headers of a different request. The user agent (already cut off in your example) could be taken from the headers though, and it can be quite long. (See also T108757.) And sure, we could do things like replacing enums with numbers, but that seems to be the wrong level to tackle this problem.

In T91347#1532587, @Tgr wrote:

Actually all of that except the webHost comes from the headers of a different request.

Yeah, but they're there in the original request we're logging about right? Couldn't we just correlate the two with an ID number to get most of it? Maybe the existing Varnish transaction ID is unique enough within a certain window.

The user agent (already cut off in your example) could be taken from the headers though, and it can be quite long.

That seems like a sensible thing to do.

Tgr added a comment.Aug 12 2015, 10:18 PM

Yeah, but they're there in the original request we're logging about right? Couldn't we just correlate the two with an ID number to get most of it? Maybe the existing Varnish transaction ID is unique enough within a certain window.

That sounds possible, although it would make development much more complicated (unless this is somehow provided as an easily available option that is supported by the EventLogging core).

BBlack closed this task as Resolved.Sep 24 2015, 1:25 PM

The above commit should be live on all prod + beta varnishes now, raising the varnish log shm limit to ~2K.

BBlack moved this task from Triage to Done on the Traffic board.Sep 24 2015, 1:26 PM

Awesome, thanks @BBlack!