Page MenuHomePhabricator

Publish tls related info to webrequest via varnish
Closed, ResolvedPublic

Description

Publish TLS related info to webrequest via varnish so it can be correlated to the rest of webrequest info:

There are about 5 values that we want to track that all concatenated are probably smaller than a 255 string. We can later refine these comma separated value of similar into a map that we can query similar to how we do for user agent. This data is currently in graphana but it will be much more useful if it lived along webrequest data.

https://grafana.wikimedia.org/d/000000452/tls-ciphers-by-data-center?orgId=1
https://grafana.wikimedia.org/d/000000458/tls-ciphersuite-explorer?orgId=1

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
herron triaged this task as Medium priority.Sep 25 2019, 3:46 PM
herron added a project: observability.
Milimetric raised the priority of this task from Medium to High.Sep 30 2019, 4:29 PM
Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

Ping @BBlack to give us some priorities around this work

From looking at the dashboards, it looks like the entire set of values we wasnt to collect is what is currently displayed in TLS Cyphersuit Explorer graph.
I wonder how that data is presented by varnish: a single string (as in the explorer graph) or one string per cyphersuite parameter.

@JAllemandou it's currently split like this:

-   VCL_Log        CP-TLS-Version: TLSv1.2
-   VCL_Log        CP-TLS-Session-Reused: 0
-   VCL_Log        CP-Key-Exchange: X25519
-   VCL_Log        CP-Auth: ECDSA
-   VCL_Log        CP-Cipher: CHACHA20-POLY1305-SHA256
-   VCL_Log        CP-Full-Cipher: ECDHE-ECDSA-CHACHA20-POLY1305

Thanks @Vgutierrez - I think representing those values in a map (or an array) is probably the easiest and most flexible way.
@elukey : Will you teach me varnish_kafka so that I can those values from VCL_Log into a new field?

Sure! The JSON format of what we collect from Varnish for webrequest is in profile::cache::kafka::webrequest:

format                       => "%{fake_tag0@hostname?${::fqdn}}x %{@sequence!num?0}n ${timestamp_formatter} %{Varnish:time_firstbyte@time_firstbyte!num?0.0}x %{X-Client-IP@ip}o %{X-Cache-Status@cache_status}o %{@http_status}s %{@response_size!num?0}b %{@http_method}m %{Host@uri_host}i %{@uri_path}U %{@uri_query}q %{Content-Type@content_type}o %{Referer@referer}i %{User-Agent@user_agent}i %{Accept-Language@accept_language}i %{X-Analytics@x_analytics}o %{Range@range}i %{X-Cache@x_cache}o %{Accept@accept}i %{Server@backend}o",

Given https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L906-L907 the new fields should be available with something like %{VCL_Log:CP-TLS-Version@TLS-Version}x (that should translate to TLS-Version: TLSv1.2 in the final payload)

I set up a test webrequest.conf on cp2001, and confirmed that the solution works!

Side note - varnishkafka set with output = stdout still needs some kafka-specific configs due to https://github.com/wikimedia/varnishkafka/blob/master/varnishkafka.c#L1969, that is a bug..

Some maths about datasize increase using approximated ratios for values TLS-Version, Key-Exchange, Auth and Cipher from https://grafana.wikimedia.org/d/000000458/tls-ciphersuite-explorer?orgId=1. I have made assumptions that we need full-cipher, and took values from cipher as an approximation.

Currently, one day of webrequest is ~9.18 billion lines averaging at ~880bytes per line.
Adding full TLS data with long and full json names (we can do better) will average grow every webrequest line by~180 bytes (20%).
Assuming compression will be similarly efficient as it now, 1day of webrequest will grow from 1.78Tb to 2.15Tb.
As we keep one month of raw webrequest in the cluster, this means growing from 45Tb to 54Tb.

I can't really do the math for refined webrequest as they are stored in parquet, but I think impact will be a lot less given the columnar nature of the file format.

Ping @Ottomata for an opinion on potential impact on kafka.

I don't expect addition of these fields to really impact Kafka or much else! :)

I think representing those values in a map (or an array) is probably the easiest and most flexible way.

I don't think sub-objects or arrays are supported by varnishkafka. We'll have to set each one as a different field.

Or, perhaps a some VCL code can combine these into a single var or header, kinda like x-cache or x-analytics? If we could encode this in the same way as x-analytics, it'd be more easy to refine into a map in wmf.webrequest.

Or...just put these into X-Analytics? Maybe not...:)

If we could encode this in the same way as x-analytics, it'd be more easy to refine into a map in wmf.webrequest.

+1 to this, makes sense to add a TLS field perhaps but certainly not 4 columns. Let's please do changes in VCL to consolidate these fields to a map

Hmm maybe it could make sense to store some TLS data like the ciphersuite, version or elliptic curve as integers assuming that all of them have constants assigned in OpenSSL.. that would reduce the storage requirements significantly

Hmm maybe it could make sense to store some TLS data like the ciphersuite, version or elliptic curve as integers assuming that all of them have constants assigned in OpenSSL..

I think the size of the strings is not such a big problem and unless @JAllemandou disagrees I think we should aim to use a map as this data logically belongs together, similar to data in X-analytics.

@Nuria - what you're asking for is something like a combined TLS field with separators? e.g. we contruct a 4-part semicolon-delimited string like: V=1.2;K=X25519;A=ECDSA;C=CHACHA20POLY1305, and then set that in webrequest as the single field TLS ?

I think we don't have strong opinions about the output format (more up to you about storage and processing), so long as the net result is we can use some turnilo dashboard to slice and dice on all of these axes independently and correlate them to other webrequest data (e.g. ask questions like "What percent use K=X25519 && V=1.2?" or "Show me the User-Agent values which use C=AES128-SHA && V!=1.0" or "Show me the percentage of V=1.0 && A=RSA for Asia", etc.

@BBlack Right, Varnish would send items to varnishkafka similar to how it is done for x-analytics: https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/analytics.inc.vcl.erb#L221

These values are latter parsed and stored as a map. Once that is done those can be parsed by code that ingests that data into turnilo but even in hive you could query like this:

select user_agent_map from webrequest where tls['CP-Full-Cipher'] = "blah" and year=blah and month=blah and day=blah;

Using the example above the field on varnish sent to VCL would look like:
CP-TLS-Version: TLSv1.2;CP-TLS-Session-Reused: 0;CP-Key-Exchange: X25519;CP-Auth: ECDSA;CP-Cipher: CHACHA20-POLY1305-SHA256;CP-Full-Cipher: ECDHE-ECDSA-CHACHA20-POLY1305;

We probably don't need to send the reused value (it's not that useful for analysis at this level, IMHO), and we don't need to send the full-cipher value either (that's the original string from which some of these fields are extracted, but doesn't cover the Key-Exchange part fully or the Version at all). All we need is the 4 derived fields: Version, Key-Exchange, Auth, and Cipher. The string "CP-" isn't really descriptive and just an implementation detail. Also I'm assuming from how X-Analytics is set up that the format is k1=v1;k2=v2;.... (equal sign rather than colon).

Does it matter if the labels carry meaningful but redundant "TLS-" prefixes, or is the name of the overall field enough (or does that make things confusing later in turnilo?). In other words, given a webrequest field named TLS, which of these is preferable?

Version=1.2;KeyExchange=X25519;Auth=ECDSA;Cipher=CHACHA20POLY1305
TLS-Version=1.2;TLS-KeyExchange=X25519;TLS-Auth=ECDSA;TLS-Cipher=CHACHA20POLY1305

Also I'm assuming from how X-Analytics is set up that the format is k1=v1;k2=v2;.... (equal sign rather than colon).

yes, correct.

I think this is fine and it is more concise:
Version=1.2;KeyExchange=X25519;Auth=ECDSA;Cipher=CHACHA20POLY1305

It is similar to how we do it for the user agent, once ingested in turnilo fields in user_agent_map are prefixed with "UA" for clarity.

Thanks for the fast loop over format @Nuria and @BBlack.
Indeed having a single field named TLS formatted as described above is best (we could even go for smaller keys, but it makes it less human readable).
Hadoop will then process that into an explicit map, and we'll load them in turnilo as separate fields.
Something to keep in mind: Turnilo is sampled (1 over 128), so for instances with very small probability of appearance, full data processing in Spark will be needed.

@BBlack: we can take a stab at modifying code on VCL if you can CR since that needs to happen before the varnishkafka changes

Hm, another idea...

I don't think sub-objects or arrays are supported by varnishkafka. We'll have to set each one as a different field.

Perhaps we could just use varnishkafka's string formatter and manually format the JSON string? Then we could add sub-objects just by adding { } in the right places.

Not sure if that is better or worse.

Pros:

  • No new custom VCL stuff
  • No parsing of custom key=val;key=val string needed in analytics cluster
  • Could use a struct or a map type in Hive (not 100% sure we can use map easily with our Hive load job, but I think we can?)

Cons:

  • Changes the serialization code path we've been using to produce webrequest for years
  • Changes the serialization code path we've been using to produce webrequest for years

We discussed this in Analytics standup today. We think this is a big enough con to keep us from pursing this idea. We'll have to move away from X-Analytics when we move to ATS one day anyway. We'll wait to spend time working this formatting problem until that happens.

Let's go with a new field and key=value;key=value string.

Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep in mind: "one day" is really really soon (like, we'll probably be migrating varnishkafka stuff to ATS next quarter).

Change 548443 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] TLS Analytics: send to VCL log in condensed form

https://gerrit.wikimedia.org/r/548443

Change 548444 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] TLS Analytics: pick up VCL Log in webrequest

https://gerrit.wikimedia.org/r/548444

Patches above look sane? I went ahead and shortened the key names down to the minimum to prevent bloat at these layers. We can always give them better descriptive names when they're pulled back to e.g. Turnilo in queries. V is Version, K is Key Exchange, A is Auth, C is Cipher. Too short?

Hm, would be ok with me, but likely whatever we choose we'll be stuck with forever. I tend to prefer descriptive names in general, but Joseph might more concerned with the efficiency. Let's use lowercase though please! :)

Hm, would be ok with me, but likely whatever we choose we'll be stuck with forever. I tend to prefer descriptive names in general, but Joseph might more concerned with the efficiency.

Whichever is fine with me, too. I'm not sure if there's limitations on overall length of the varnishkafka inputs/outputs.

Let's use lowercase though please! :)

Do you mean in the "TLS" label at the varnishkafka level, or the 4x subkey names? (or both?)

Nevermind, I see it in the gerrit comments

I'm not sure if there's limitations on overall length of the varnishkafka inputs/outputs.

Shouldn't be from varnishkafka, but we've got Kafka brokers configured at a 4MB message size limit.

I checked for message length in one day of webrequest, and we top at 4916 bytes.
I think Kafka will be fine as per message-size, and I assume it'll also be fine on global data growth.
Same for hadoop.
Let's make this happen.

Change 548443 merged by BBlack:
[operations/puppet@production] TLS Analytics: send to VCL log in condensed form

https://gerrit.wikimedia.org/r/548443

@BBlack: once we deploy the VCL/varnish-kafka chnages we need to change our refine pipeline to read these values, when we deploy those changes values will be available in webrequest table, after that we ill re-do the indexing of the webrequest dataset into turnilo

Change 548444 merged by BBlack:
[operations/puppet@production] TLS Analytics: pick up VCL Log in webrequest

https://gerrit.wikimedia.org/r/548444

Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep in mind: "one day" is really really soon (like, we'll probably be migrating varnishkafka stuff to ATS next quarter).

Probably this is not the best place to talk about this, but next quarter seems really close :) Who is going to replace Varnishkafka? Is there any plan from Traffic or should we (as Analytics) schedule time for it? I know that with fifo-log-demux the job of the new "atskafka" should be relatively easy (read from a socket and push to kafka) but we really need metrics (like we have now) to build monitoring on top of them. If the migration is so close, can we open a task (if there is not one) and start the discussion? :)

Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep in mind: "one day" is really really soon (like, we'll probably be migrating varnishkafka stuff to ATS next quarter).

Probably this is not the best place to talk about this, but next quarter seems really close :) Who is going to replace Varnishkafka? Is there any plan from Traffic or should we (as Analytics) schedule time for it? I know that with fifo-log-demux the job of the new "atskafka" should be relatively easy (read from a socket and push to kafka) but we really need metrics (like we have now) to build monitoring on top of them. If the migration is so close, can we open a task (if there is not one) and start the discussion? :)

Yes, next quarter is closer every day :) We probably should at least start talking informally about it and deciding who can line up time and making a task (if you'd like to work on it and have time that'd be awesome, but otherwise we will because it will be our next blocker on that front).

@BBlack: once we deploy the VCL/varnish-kafka chnages we need to change our refine pipeline to read these values, when we deploy those changes values will be available in webrequest table, after that we ill re-do the indexing of the webrequest dataset into turnilo

I think we're ready for this, data's flowing from our end.

Change 548736 had a related patch set uploaded (by Joal; owner: Joal):
[analytics/refinery@master] Add TLS information to webrequest hive/druid

https://gerrit.wikimedia.org/r/548736

Agreed, let's not go down that road right here (because we have a burning need for this data pronto), but side note to keep in mind: "one day" is really really soon (like, we'll probably be migrating varnishkafka stuff to ATS next quarter).

Probably this is not the best place to talk about this, but next quarter seems really close :) Who is going to replace Varnishkafka? Is there any plan from Traffic or should we (as Analytics) schedule time for it? I know that with fifo-log-demux the job of the new "atskafka" should be relatively easy (read from a socket and push to kafka) but we really need metrics (like we have now) to build monitoring on top of them. If the migration is so close, can we open a task (if there is not one) and start the discussion? :)

Yes, next quarter is closer every day :) We probably should at least start talking informally about it and deciding who can line up time and making a task (if you'd like to work on it and have time that'd be awesome, but otherwise we will because it will be our next blocker on that front).

I'd love to, but I'd need to check with @Nuria and my team first of course!

Change 548736 merged by Joal:
[analytics/refinery@master] Add TLS information to webrequest hive/druid

https://gerrit.wikimedia.org/r/548736

Done! Example:

spark2-shell --master yarn --driver-memory 4G --executor-memory 8G --executor-cores 4 --conf spark.dynamicAllocation.maxExecutors=32 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar

spark.sql("select tls_map['ciph'], count(1) as c from wmf.webrequest where webrequest_source = 'text' and year = 2019 and month = 11 and day = 7 and hour = 0 group by tls_map['ciph']").show(100, false)

Change 550932 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add protocol to TLS analytics fields

https://gerrit.wikimedia.org/r/550932

Change 550933 had a related patch set uploaded (by BBlack; owner: BBlack):
[operations/puppet@production] Add session to TLS analytics fields

https://gerrit.wikimedia.org/r/550933

Change 550933 abandoned by BBlack:
Add session to TLS analytics fields

Reason:
squished elsewhere

https://gerrit.wikimedia.org/r/550933

Change 550932 abandoned by BBlack:
Add prot and sess fields to TLS analytics

Reason:
squished elsewhere

https://gerrit.wikimedia.org/r/550932

Change 550932 restored by BBlack:
Add prot and sess fields to TLS analytics

Reason:
err hmmm don't abandon this one

https://gerrit.wikimedia.org/r/550932

Change 550932 merged by BBlack:
[operations/puppet@production] Add prot and sess fields to TLS analytics

https://gerrit.wikimedia.org/r/550932