Page MenuHomePhabricator

Eventlogging should transparently split large event payloads
Closed, DeclinedPublic

Description

We've written a workaround in CentralNotice for the arbitrary message length truncation (T91347) affecting client-side EventLogging. Ideally, we'd be free to choose any reasonable message size, and EventLogging should include a continuation mechanism that kicks off transparently when we risk going over the truncation limit du jour. This would improve our data quality and decrease calling code complexity.

Event Timeline

awight created this task.Sep 29 2015, 5:50 AM
awight assigned this task to Ottomata.
awight raised the priority of this task from to Normal.
awight updated the task description. (Show Details)
awight added subscribers: AndyRussG, awight, Matanya and 9 others.
Nuria added a comment.Sep 29 2015, 3:03 PM

Please note that varnish logging limits have been increased and that the limit is 2000 (https://gerrit.wikimedia.org/r/#/c/240922/). While I see a use case for somewhat larger messages there are two points to that:

  • message size will never be unbounded, these messages use user's bandwidth and we should think about whether we need to log as much, this is specially true in mobile where this type of messaging just drains user's battery.
  • we have to evaluate whether the complexity needed to split messages is worth it. If larger payloads are 1% of our use cases I am not sure is worth to complicate the client for those.

BTW, cc @trg as he had a ticket on this regard I believe.

Tgr added a comment.Sep 29 2015, 8:37 PM

BTW, cc @trg as he had a ticket on this regard I believe.

T108757 has some effect on event sizes, but not much.

For more context, we're sending data very occasionally, one message on far less than 1% of pageviews. There's no concern about decreasing bandwidth for the reader, I think the increased code complexity is the main question here. I think we need to test for message truncation anyway, so the new code would consist of:

  • If message is over the size limit, split and send as multiple messages.
  • On the receiving end, piece continuation messages back together. That seems like the tricky part, since messages may arrive out of order.

I'm with Nuria in that we have to evaluate whether the complexity in the back-end is worth it.

Right now, we Analytics team are working on deploying a new EventLogging processor, which will work in parallel. This parallelism will really help in scaling the system. Having to piece messages together in this scenario would be even more difficult. I'd say that the fact that events are one-piece now helps in the performance and scalability of EventLogging.

BBlack added a comment.EditedSep 30 2015, 3:20 PM

Logging events and stats without having significant complexity or perf issues in the way of normal request flow is always going to be tricky. it would be better if we could simply focus on sending shorter messages in all cases. Looking at the example from the previous ticket for jumping to the 2K limit, the URL data and then decoded JSON from that was:

eventlogging_processor-client-side-events.log.1.gz:2015-03-01 22:00:00,999 (MainThread) Unable to decode: ?%7B%22event%22%3A%7B%22type%22%3A%22image%22%2C%22contentHost%22%3A%22pl.wikipedia.org%22%2C%22isHttps%22%3Afalse%2C%22total%22%3A492%2C%22urlHost%22%3A%22upload.wikimedia.org%22%2C%22status%22%3A200%2C%22XCache%22%3A%22cp1063%20miss%20(0)%2C%20cp3016%20hit%20(1)%2C%20cp3018%20frontend%20miss%20(0)%22%2C%22varnish1%22%3A%22cp1063%22%2C%22varnish1hits%22%3A0%2C%22varnish2%22%3A%22cp3016%22%2C%22varnish2hits%22%3A1%2C%22varnish3%22%3A%22cp3018%22%2C%22varnish3hits%22%3A0%2C%22XVarnish%22%3A%22265764709%2C%20583884279%20576752145%2C%204160384575%22%2C%22contentLength%22%3A111608%2C%22age%22%3A3968%2C%22timestamp%22%3A1425247200%2C%22lastModified%22%3A1383480710%2C%22redirect%22%3A0%2C%22dns%22%3A0%2C%22tcp%22%3A38%2C%22request%22%3A221%2C%22response%22%3A205%2C%22cache%22%3A28%2C%22uploadTimestamp%22%3A%2220070823000000%22%2C%22imageWidth%22%3A1024%2C%22country%22%3A%22PL%22%7D%2C%22revision%22%3A11030254%2C%22schema%22%3A%22MultimediaViewerNetworkPerformance%22%2C%22webHost%22%3A%22pl.wikipedia.org%2        cp3020.esams.wikimedia.org      66124967        2015-03-01T22:00:00     77.255.103.93   Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36
{
  "type": "image",
  "contentHost": "pl.wikipedia.org",
  "isHttps": false,
  "total": 492,
  "urlHost": "upload.wikimedia.org",
  "status": 200,
  "XCache": "cp1063 miss (0), cp3016 hit (1), cp3018 frontend miss (0)",
  "varnish1": "cp1063",
  "varnish1hits": 0,
  "varnish2": "cp3016",
  "varnish2hits": 1,
  "varnish3": "cp3018",
  "varnish3hits": 0,
  "XVarnish": "265764709, 583884279 576752145, 4160384575",
  "contentLength": 111608,
  "age": 3968,
  "timestamp": 1425247200,
  "lastModified": 1383480710,
  "redirect": 0,
  "dns": 0,
  "tcp": 38,
  "request": 221,
  "response": 205,
  "cache": 28,
  "uploadTimestamp": "20070823000000",
  "imageWidth": 1024,
  "country": "PL"
}

Just in terms of the data in that structure, for instance, we have a few obvious things:

  1. isHttps is mostly unnecessary now. We have a few corner cases where non-https traffic still flows into wikis, but they're rare and well-known and on the chopping block to be eliminated.
  2. The fields like varnish1, varnish1hits, etc are clearly redundant and created from the XCache data. The decoding of XCache should be done server-side after event reception to save multiple keys and values here.
  3. XVarnish - is analytics actually looking at this? It correlates to some internal varnish transaction IDs, but AFAIK there's not much else to correlate them with in practice.

Issues with uneccessary data within the structure aside, we can also look at simply encoding the data more-efficiently. A basically-random (from an encoding POV) set of keys and values encoded to JSON then url-encoded is nowhere near as space-efficient as we could be with this stuff.

The first and most-obvious thing we could do is build a dictionary (exported to the eventlogging JS and the server-side decoding) of the keys so that they can be sent in a minimal form, such as just integers. key 1 is "contentHost", etc. This is easy to expand and update over time, and cuts over half of the raw JSON data in each message. Similarly, you may want to use dictionaries for some of the values in the cases where there is only a small distinct set of possible values. For instance, if type is only one of image, html, or json, those could again be a dictionary of integer values known to both sides.

Another thing to look at is whether there's a more-efficient way to encode the data than JSON here. JSON requires a bunch of characters that aren't URL-clean and thus expand to 3 characters, e.g. " becoming %22, : becoming %3A (which I don't think is actually necessary - the colon is legal in a URL), etc. We could use an alternate data structure formatting here which is natively nearly or completely URL-clean without so much expansion of meta-characters.

While today the first two key:value pairs in the example structure "type": "image", "contentHost": "pl.wikipedia.org",
end up URL-encoded as:

%22type%22%3A%22image%22%2C%22contentHost%22%3A%22pl.wikipedia.org%22%2C

They could be something like this directly in URL-legal characters (ignoring previous points about integer dictionary keys and values):
type:image@contentHost:pl.wikipedia.org@
as the JSON-quoting isn't necessary for most of the data here. You just need to decide on metacharacters that are URL-legal and make sense for the data. If you really need quoting, you could use [ ] as URL-legal bracketing characters that don't require %-encoding as well.

With or without any of the techniques above, you could also look at running an actual compression algorithm over the data before URL-encoding. However, I think once techniques like the above are applied, probably the excess %-encoding of binary bytes would outweigh the gains of a generic compression algorithm.

Truncation for rare cases will probably always be something your protocol has to account for, but that problem is pretty simple in itself: the client-side should know what the safe limit is, and should simply replace the whole string with a flag like "DataTruncated" in that case.

Nuria added a comment.Sep 30 2015, 5:11 PM

There are many good points on @BBlack reply.

We could give a little more thought about how to most efficiently set up schemas such the most information can be emitted on the allotted limit.
Rather than encoding , I think the main driving factor for how big is the payload is the schema setup. For example: there are several instances of 'catch it all' schemas that will benefit from being split up. We see this often when an schema is used to follow actions on a pipeline.

In the particular case of the CentralNoticeBannerHistory schema, I see the field names have been reduced to one-char strings.
That will help in reducing the size of the event payload. Maybe it would help also to, instead of storing:

{"b":"BANNER_NAME","c":"CAMPAIGN_NAME","t":"TIMESTAMP","s":"STATUS"}

for each history item, we could store just a string with banner, campaign and status ids instead of names, like:

"B_ID|C_ID|TIMESTAMP|S_ID"

Depending on the length of banner, campaign and status names, this could make the schema hold 2x, 3x, or more history items.

By the way, I think having one-character names for history items is OK, but maybe is not needed in the case of root-level schema fields, because these will be the column names in the database tables, and may be confusing to someone not knowing the schema.

In the case of this schema: https://meta.wikimedia.org/wiki/Schema:CentralNoticeBannerHistory the array of items (banner history) should probably not be unbounded, Seems that the client should establish what is a reasonable amount of items to transmit and analyze, the field on mysql where these items will end up also has a size limit.

What is happening at this time is that events that are oversized are not getting truncated, rather, they are not transmitted (they are not leaving the client)

We can help if you need help splitting the schema on events so rather that sending the whole history for one client (which seems to imply you will be sending repeated data as history is growing) you can send "viewing events" 1 by 1.

Tgr added a comment.Oct 1 2015, 2:02 AM

Issues with uneccessary data within the structure aside, we can also look at simply encoding the data more-efficiently.

That's one low hanging fruit (there is typically a 18-byte overhead per key-value pair, while a more reasonable format could get away with three or four bytes) The other easy win is referring to the property names and enumerated values by index instead of by name.

With or without any of the techniques above, you could also look at running an actual compression algorithm over the data before URL-encoding. However, I think once techniques like the above are applied, probably the excess %-encoding of binary bytes would outweigh the gains of a generic compression algorithm.

Theoretically you could compress and then base64-encode, or just do Huffman coding with the base64 set as the target alphabet. In practice we probably don't want to incur any significant processing cost on the client machine.

Rather than encoding , I think the main driving factor for how big is the payload is the schema setup. For example: there are several instances of 'catch it all' schemas that will benefit from being split up. We see this often when an schema is used to follow actions on a pipeline.

As long as EL maps schemas to DB tables directly, splitting up is going to be complicated as it rearranges your database completely and makes certain types of queries more difficult.

Anyone mind if I decline this task?

Tgr added a comment.Dec 14 2015, 10:48 PM

Is there an alternative path towards making logging non-tiny amounts of data work? I'm not partial to this solution but I don't think the problem itself should be ignored.

ori added a comment.Dec 14 2015, 10:50 PM

I don't think it's a problem; excessively large event payloads are a misuse of the API, as far as I'm concerned.

Tgr added a comment.Dec 14 2015, 11:17 PM

"Excessively" seems to be currently defined as "more than 500 characters" (the example in T114078#1689560 has a JSON payload of 581 chars).

The JSON payload size isn't what's important here, just the 2K URL limit. We should be able to fit within that for "logging" purposes if we're careful about what we log and how we format the data. The comment you linked to talks all about how inefficient our current stuff is (both the data and the encoding of it).

A sane solution would be at accept events via POST instead of encoded query params via varnish logs.

Tgr added a comment.Dec 15 2015, 12:59 AM

@BBlack, yes, which is why I am saying something should be done to improve current limits. That could be packet splitting, packet compression, a more efficient serialization format, a better encoding... but the status quo is a problem IMO-

@Ottomata AIUI that would mean not relying on Varnish at all since it does not see into POST data. Also not sure how that would work with browsers that don't support the Beacon API.

Yeah, it would have to be a special endpoint, and it'd likely would only work for direct POSTs. I guess this wouldn't work in your use cases?

Nuria added a comment.Dec 15 2015, 8:03 PM

@Tgr: Sorry but I certainly disagree that we want to allow events of *any* size, and 2000 url limit seems plenty.
We do not want to optimize the API for a non common case and the reality is that most events fit the payload.

Can you give any other examples besides the very verbose MediaViewer events?
(FYI, that the centralBannerHistory is a poor example cause that event logging post is NOT sending an event but a set of them bundled together as one, which we discouraged FR to do)

Milimetric closed this task as Declined.Feb 11 2016, 6:16 PM
Milimetric added a subscriber: Milimetric.

We decided to not split and merge events, this would be too complex. We're happy to increase the limit or think of another solution for the CentralNotice use case.