Page MenuHomePhabricator

Augment NEL reports with GeoIP country code and network AS number
Closed, ResolvedPublic

Description

It's going to be hard to make much use of this data without both of these, especially AS number.

@Ottomata what's the best mechanism to get these attached? Is the short-term answer different than the long-term answer (which might be stream processing stuff?)

In the very short term, I don't think it would be too hard to have the Traffic layer attach both of these pieces of data as a header that it inserts before sending the request upstream, and then we could insert it into a field from within eventgate-wikimedia...

Event Timeline

The long-term answer (which might be stream processing stuff?)

is stream processing stuff

In the very short term, I don't think it would be too hard to have the Traffic layer attach both of these pieces of data as a header that it inserts before sending the request upstream, and then we could insert it into a field from within eventgate-wikimedia...

Right now, eventgate-wikimedia does set some values in the events from HTTP headers, however which headers are set is hardcoded.

I recently had an idea though!
T263466: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data

I think that would help with your use case: you have the traffic layer set the headers, and then if those properties are explicitly set in http.(request|response)_headers in your schema, eventgate-wikimedia will add them into your event.
It requires a bit of coding work to make sure having both properties and additionalProperties set in the same schema object doesn't break some of our ingestion logic, but it shouldn't be hard to do.

I cannot think of case of any piece of data we hold that includes an IP where we would not want the geo localization. So even doing that by default at all times makes sense. I know we want to be as explicit as possible with the data schemas but going from IP-> geo seems a very sane default so maybe geo data could be an "optional" field in "properties" or "additionalproperties" ?

Nuria, except in T262626: Remove http.client_ip from EventGate default schema (again), this is exactly what Timo wants. He thinks that we should remove IPs from the data, and use the GeoIP country header that varnish sets for geolocation purposes by default, and only include http.client_ip in rare cases where it is really needed.

Additionally, Chris' NEL is going to eventgate-logging-external and into logstash. He wants to have these fields in logstash, and we currently only do auto geocoding of IPs as part of the Refine step for Hive ingestion.

The long-term answer (which might be stream processing stuff?)

is stream processing stuff

In the very short term, I don't think it would be too hard to have the Traffic layer attach both of these pieces of data as a header that it inserts before sending the request upstream, and then we could insert it into a field from within eventgate-wikimedia...

Right now, eventgate-wikimedia does set some values in the events from HTTP headers, however which headers are set is hardcoded.

I recently had an idea though!
T263466: EventGate idea: use presence of schema properties in http.(request|response)_headers to automatically set header values in event data

I think that would help with your use case: you have the traffic layer set the headers, and then if those properties are explicitly set in http.(request|response)_headers in your schema, eventgate-wikimedia will add them into your event.
It requires a bit of coding work to make sure having both properties and additionalProperties set in the same schema object doesn't break some of our ingestion logic, but it shouldn't be hard to do.

All makes sense, thanks! This sounds like the right angle and fairly generic/elegant. Let me know if I can help with the eventgate-wikimedia patch or other work.

I'll work on the Traffic angle here -- that layer doesn't have immediate access to the ASN, but hopefully it won't be too hard to modify what MMDB stuff is already there.

Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster be sufficient?

Question on the need for data @CDanis : Is the data augmentation needed in stream, or would refinement on the cluster be sufficient?

Currently these reports are going to Logstash; I don't think there's any refinement possible there?

We also value near-realtime for this stream, which I'm not sure but I think complicates any Hive/refinement discussion?

Happy to be wrong ofc :)

Currently these reports are going to Logstash; I don't think there's any refinement possible there?

Not the refinement we do usually on the cluster indeed.

We also value near-realtime for this stream, which I'm not sure but I think complicates any Hive/refinement discussion?

Correct - We use hourly batch in spark, but also wait for late data. near real-time implies stream-processing :)

Thanks for helping my understanding of your usage :)

@JAllemandou I think adding geo info (or rather swapping IP by Geo info ) is something that would need to happen in this case (in the absence of stream processing before logstash) in EventGate upon ingestion /validation and that seems a choice we agree on , right @Ottomata?

He thinks that we should remove IPs from the data, and use the GeoIP country header that varnish sets for geolocation purposes by default, and only include http.client_ip in rare cases where it is really needed.

Right, I agree with this idea sorry my comment was not clear. I think in almost all cases what we want is geo info and not IPs.

Change 630314 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] geoip VCL: init/free functions are now reusable

https://gerrit.wikimedia.org/r/630314

Change 630315 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] geoip VCL: add a 'which' param to get_geo_xcip

https://gerrit.wikimedia.org/r/630315

Change 630316 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] VCL: Attach a variety of GeoIP info as bereq headers

https://gerrit.wikimedia.org/r/630316

Hey @Ottomata, I meant to get around to this last quarter but didn't. Would very much like to get some mechanism in place soon -- do you have any ideas on the best path forward? Maybe we should set up a quick meeting to chat through options?

The long term solution here is still not clear and is very tied up with some other yet undefined long term projects, like Data Governance.

Let's just make this happen in the short term just like we have for client ip and user agent.

You can set headers, right? If you like, make a patch to eventgate-wikimedia adding an entry to http.request_headers (or http.response_headers if this is a response header?) just like this.

The long term solution here is still not clear and is very tied up with some other yet undefined long term projects, like Data Governance.

Let's just make this happen in the short term just like we have for client ip and user agent.

Sounds good to me!

You can set headers, right? If you like, make a patch to eventgate-wikimedia adding an entry to http.request_headers (or http.response_headers if this is a response header?) just like this.

Perfect, yes. https://gerrit.wikimedia.org/r/630316 does exactly this from inside VCL. I'll work on getting that merged.

Thanks!

Change 630314 merged by CDanis:
[operations/puppet@production] geoip VCL: init/free functions are now reusable

https://gerrit.wikimedia.org/r/630314

Change 630315 merged by CDanis:
[operations/puppet@production] geoip VCL: add a 'which' param to get_geo_xcip

https://gerrit.wikimedia.org/r/630315

Mentioned in SAL (#wikimedia-operations) [2021-02-02T19:52:52Z] <cdanis> ❌cdanis@cumin1001.eqiad.wmnet ~ πŸ•’β˜• sudo cumin A:cp 'disable-puppet "cdanis deploying I7003b7b6 and Idd0e124f5 T263496"'

Mentioned in SAL (#wikimedia-operations) [2021-02-02T20:12:18Z] <cdanis> βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•’β˜• sudo cumin A:cp 'enable-puppet "cdanis deploying I7003b7b6 and Idd0e124f5 T263496"' # test on cp2027 looks good, perhaps slightly-increased Varnish CPU consumption but hard to be sure

Mentioned in SAL (#wikimedia-operations) [2021-02-04T18:28:27Z] <cdanis> βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•œβ˜• sudo cumin A:cp 'disable-puppet "cdanis deploying I498a0c4af T263496"'

Change 630316 merged by CDanis:
[operations/puppet@production] VCL: Attach a variety of GeoIP info as bereq headers; test GeoIP

https://gerrit.wikimedia.org/r/630316

Mentioned in SAL (#wikimedia-operations) [2021-02-04T18:45:56Z] <cdanis> T263496 deployed I498a0c4af on cp2027 at 18:29; now deploying on cp3060

Mentioned in SAL (#wikimedia-operations) [2021-02-04T19:06:18Z] <cdanis> βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•œβ˜• sudo cumin A:cp 'enable-puppet "cdanis deploying I498a0c4af T263496"'

Mentioned in SAL (#wikimedia-operations) [2021-02-16T15:11:38Z] <cdanis> βœ”οΈ cdanis@cumin1001.eqiad.wmnet ~ πŸ•™β˜• sudo cumin 'A:cp-upload and A:eqsin' 'disable-puppet "cdanis deploying Iab4d211 T263496"'

@Ottomata just one more question for you!

The long term solution here is still not clear and is very tied up with some other yet undefined long term projects, like Data Governance.

Let's just make this happen in the short term just like we have for client ip and user agent.

Sounds good to me!

You can set headers, right? If you like, make a patch to eventgate-wikimedia adding an entry to http.request_headers (or http.response_headers if this is a response header?) just like this.

Perfect, yes. https://gerrit.wikimedia.org/r/630316 does exactly this from inside VCL. I'll work on getting that merged.

That patch is now merged and live.

Would you prefer I special-cased the headers involved here? Or would you prefer that I allow schema users to 'opt in' to these headers by adding them as properties under their http.request_headers definitions? FWIW I've already implemented the latter, but could easily go back and implement the former.

Let's do the former, I think doing the latter (using schemas to configure included data) is NOT going to be the right solution after all. So, special case these headers just like the other ones. Thanks!

(Sorry, just edited ^, somehow a very important 'not' did not make it through my typing fingers)

Change 667948 had a related patch set uploaded (by CDanis; owner: CDanis):
[eventgate-wikimedia@master] Add 'headers_to_fields' config option

https://gerrit.wikimedia.org/r/667948

Change 667948 merged by Ottomata:
[eventgate-wikimedia@master] Add 'http_request_headers_to_fields' config option

https://gerrit.wikimedia.org/r/667948

Change 670281 had a related patch set uploaded (by CDanis; owner: CDanis):
[eventgate-wikimedia@master] use defaultsDeep to construct eventgate-wikimedia options

https://gerrit.wikimedia.org/r/670281

Change 670281 merged by CDanis:
[eventgate-wikimedia@master] use defaultsDeep to construct eventgate-wikimedia options

https://gerrit.wikimedia.org/r/670281

Change 670288 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/deployment-charts@master] bump eventgate-logging-external & attach geoip

https://gerrit.wikimedia.org/r/670288

Change 670288 merged by jenkins-bot:
[operations/deployment-charts@master] bump eventgate-logging-external & attach geoip

https://gerrit.wikimedia.org/r/670288

ASN, ISP/organization, country, & subdivision are now visible in Logstash!