Page MenuHomePhabricator

Remove http.client_ip from EventGate default schema (again)
Closed, ResolvedPublic

Description

It looks like this was inherited from EventLogging for compatibility, but it'd be great if we could phase this out sooner rather than later to avoid technical debt from accumulating that would need additional work and migration in the future.

In the EventLogging system, we were actually very close to removing it. We used a hashed IP for any legacy use cases and for compat, but to my knowledge all uses of have been long migrated by now.

For example, for events that need to be associated by country, instrumentions generally send the Geo.country value directly as part of the schema. (Schema example, Instrumentation JS example).

In fact, while I didn't realise this until now, it seems we actually did remove it from the EL infra in 2016:

I guess it slipped back in at some point to facillitate compatibility for derivative fields that in the new system are generated in the eventgate-wikimedia layer, and I guess it was just natural/easier to pass the varnishkafka intake onto the stream consumers unchanged.

Would be great to phase this out again and document easy ways to achieve any related use cases as they come up.

Outcome criteria

For the "default" output of EventGate streams as seen in Kafka (such as those consumed by Logstash for client-errors, and by webperf/navtiming in the future), to not contain any http.client_ip fields.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We should coordinate with @jlinehan and @Mholloway on the client error logging change, but yea I think that would work.

We should coordinate with @mpopov and Product Analytics to remove http.client_ip from the analytics common schema, so that IPs are not sent for analytics instrumentation events by default.

jlinehan triaged this task as High priority.
jlinehan moved this task from Inbox to Next on the Product-Data-Infrastructure board.

@mpopov If you prepare a schema patch I'll CR and make a patch to bump the version on the clientError instrument.

In this case...I think this will be a backwards incompatible change, so we should probably bump the major schema version number...or force it through Jenkins. I _think_ things will be ok in Hive....not 100% though.

In this case...I think this will be a backwards incompatible change, so we should probably bump the major schema version number...or force it through Jenkins. I _think_ things will be ok in Hive....not 100% though.

Okay, I'll be interested to see since this kind of thing is bound to happen again. By the way, what is the code that loads data from Kafka into Hive?

We should coordinate with @mpopov and Product Analytics to remove http.client_ip from the analytics common schema, so that IPs are not sent for analytics instrumentation events by default.

secondary:/fragment/analytics/common references primary:/fragment/http and since client_ip can't be removed from that fragment, I guess we'll have to duplicate all fields except that one from the http fragment into analytics common and then re-materialize all the schemas in secondary, which I can do but I want make sure we're all in agreement that this is how it should be done

secondary:/fragment/analytics/common references primary:/fragment/http and since client_ip can't be removed from that fragment, I guess we'll have to duplicate all fields except that one from the http fragment into analytics common and then re-materialize all the schemas in secondary

No objection but it does seem like something we should be able to do more naturally. Perhaps we should revisit this at some point between now and the start of mass schema migration, unless there are other solutions or folks are willing to take it up now, which will block this in the meantime.

FYI @CDanis is working on T257527: automatically collect network error reports from users' browsers (Network Error Logging API) which has expects to have http.client_ip in logstash.

It doesn't necessarily *have* to be logstash, but, we want this data in some system that allows for easy ad-hoc aggregation, as well as allowing to view 'full' individual events. AFAIK Logstash/Kibana is the only such system we have that fits?

AFAIK Logstash/Kibana is the only such system we have that fits?

Superset + Presto/Hive?

Probably not quite as easy as logstash though; have to write some SQL to do it.

Change 627600 had a related patch set uploaded (by Bearloga; owner: Bearloga):
[schemas/event/secondary@master] Remove http.client_ip field

https://gerrit.wikimedia.org/r/627600

jlinehan lowered the priority of this task from High to Low.Sep 21 2020, 3:19 PM
jlinehan moved this task from Doing to Reviewing on the Product-Data-Infrastructure board.

Patch is out there but requires further discussion.

Sorry for the drive-by comment and apologies if I'm missing something here, but it seems if we drop IPs we'll lose the ability to filter client side errors impacting one IP address. The IPs have been really useful for distinguishing between "this is impacting one user who is running a faulty gadget" and "this is impacting lots of users"

See T264453 and T265131 for more background on this product need.

http.client_ip was not intentionally added to this schema. It was inherited from an unrelated default for more restricted or programmatic usecases. It would likely not have been deployed if it was specifically asked for. It was removed from EventLogging years ago for similar reasons, and I don't think we should go back on this, that's not pariticularly mission aligned imho.

See T264453#6561910 for various ways individual use cases can be addressed in more sensitive ways.

I understand and I'm not saying keep IPs. I'm just pointing out this will create challenges for us with diagnosing client-side errors effectively using tooling that people are already relying on (T265131#6533078) and I'd love if we could find solutions to those problems before pushing forward with this change.

@Jdlrobson and @Krinkle - what about hashing the IPs? A hashed IP would still tell you how many IPs are involved, without revealing any individual IP.

A hashed IP would still tell you how many IPs are involved, without revealing any individual IP

For it to be truly not revealing on an 2^32 space it will probably needs to be salted.

If the intent is to decide whether all errors are from same user you can send the number of errors for that session of that type and that would tell you the piece of info you want to know. A count of 1 that appears 100 times would tell you errors are widespread, a count of say, 20 that you see a few times would tell you errors are concentrated on a few users. Makes sense?

Change 635304 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/primary@master] Remove http.client_ip from fragment/http, add new fragment/http/client_ip

https://gerrit.wikimedia.org/r/635304

Ok so in https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/635304 I've removed client_ip from fragment/http, made a new fragment/http/client_ip schema, and $refed them both in the ref-ing schemas (except for mediawiki/client/error) to keep them them same. I've effectively removed http.client_ip from mediawiki/client/error, which should be ok in this case, because logstash does not care about the schema, and we don't ingest the data anywhere else.

@CDanis @Krinkle this does leave http.client_ip in the w3c/reportingapi/network_error events, which do get ingested into logstash. Should we remove it there too, or leave it?

If the intent is to decide whether all errors are from same user you can send the number of errors for that session of that type and that would tell you the piece of info you want to know.

Flushing this a bit more. The number of errors for a device does not need to be per session but rather can be a tally:

  1. error A happens
  2. compute some error identification (md5 of trace?)
  3. increment local counter of this error on session storage
  4. send error with local counter

Just like session id a browser re -start will clear error counters

@CDanis @Krinkle this does leave http.client_ip in the w3c/reportingapi/network_error events, which do get ingested into logstash. Should we remove it there too, or leave it?

For the network error events, I would really prefer that we leave IP addresses.

At the most basic level of troubleshooting data, until we figure out the mechanism we use to fix T263496, we don't have important data like GeoIP country or AS number in logstash, so looking that data up afterwards based on IP address is our only option to get it.

Even once we fix that -- when diagnosing weird network-level issues, it's often necessary to either gather packet captures filtering on the specific IP, and more commonly to perform traceroutes to the specific IPs experiencing trouble from multiple different locations (often both from your own servers/routers, and also from the routers of some intermediate networks).

The specific IPs are often necessary because the address-based hashing performed by techniques like ECMP means that a different IP on the same subnet might not traverse the same paths (and thus, not expose the same issues).

Alternatively, instead of producing these to eventgate-logging-external and to logstash, we could produce them to eventgate-main or eventgate-analytics-external, and you could query them with SQL in Hive and in Superset. If necessary, we could also mirror these to kafka logging (we don't have any mirroring for kafka logging atm though) and ingest into logstash with a custom logstash filter removing the IPs.

Krinkle raised the priority of this task from Low to Medium.Oct 22 2020, 8:44 PM

Raising priority per parent task.

This task is about the default and by extent the client error schema.

Should we create a separate task for network_error/reportingapi? I suppose that could indeed be addressed in various ways, either by partial or complete omission from Logstash and using more restricted stores instead, or by e.g. adding the data Geo and ASN data in a processing step between edge intake to EventGate and Kafka/Logstash. E.g. a simple python processor like we have for Webperf events could presumably add any and all properties we want fairly trivially, including to perhaps preserve a very loose/small hash of the IP (one that has a small address space, but would e.g. help to quickly confirm or analyse events from likely the same IP within a narrow timeframe).

If the intent is to decide whether all errors are from same user you can send the number of errors for that session of that type and that would tell you the piece of info you want to know.

Flushing this a bit more. The number of errors for a device does not need to be per session but rather can be a tally:

  1. error A happens
  2. compute some error identification (md5 of trace?)
  3. increment local counter of this error on session storage
  4. send error with local counter

I think for client errors in particular, this would not be practical. There is too much variance between logically equivalent errors, and storing the data would also consume a lot of space over time. Noting in particular that browsers still lack, short of cookies, any form of native LRU or TTL-based store, and that other stores tend to be slow to access and persist quite long including through restarts due to session restoring. There's also to consider the added cost of the logic shipped on all pages for powering this.

However, using a random ID locally could certainly help us group things on the backend in Logstash, similar to what we unintentionally ended up doing with IP addresses. Either a random ID persisted for a short time for this purpose, or a small address space (fnv32, 8 bytes?) hash over the effective UserID/UserIP, seems fine from my perspective.

The value of mw.user.sessionId() would work for my needs.

Ok, next week I'll work on merging https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/635304, making sure it disables client_ip for mediawiki/client/error, and then also removing it from the analytics/common schemas in the schemas/event/secondary repo so that it isn't enabled by default for new schemas.

We'll examine what it would mean to remove it for existing event streams and schemas after that.

Change 635304 merged by Ottomata:
[schemas/event/primary@master] Remove http.client_ip from fragment/http, add new fragment/http/client_ip

https://gerrit.wikimedia.org/r/635304

Change 636656 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Bump reference to fragment/http to 1.2.0 to remove client_ip field

https://gerrit.wikimedia.org/r/636656

Change 636656 abandoned by Ottomata:
[schemas/event/secondary@master] Bump reference to fragment/http to 1.2.0 to remove client_ip field

Reason:
Reusing change-id at https://gerrit.wikimedia.org/r/c/schemas/event/secondary/ /627600, abandoning this one

https://gerrit.wikimedia.org/r/636656

Change 636659 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[schemas/event/secondary@master] Remove http.client_ip from session_tick without bumping version

https://gerrit.wikimedia.org/r/636659

https://gerrit.wikimedia.org/r/635304 has been merged and a new 1.1.0 version of mediawiki/client/error has been created. Jason and team will also be creating a new 1.2.0 version to add a field for the tags naming conflict shortly. Instrumentation code needs to be updated to make use of either of these new schema versions, so we'll wait for Product Infra to make and update to the new mediawiki/client/error/1.2.0. Once that change has been made and deployed, http.client_ip should stop showing up in those events in logstash.

Once https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/627600 is merged, no new instrumentation schemas will automatically add http.client_ip. I've updated documentation for this here: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#http_information

For the few existing analytics schemas that have http.client_ip, I'd rather just leave them as they are for now, so we don't accidentally upset users. (I betcha SearchSatisfaction folks use this). This data only goes to Hadoop and not to logstash.

When I migrate legacy EventLogging schemas as part of T259163: Migrate legacy metawiki schemas to Event Platform, I will not include http.client_ip in them. These schemas currently DO collect this data via the ip field from eventcapsule, and the migration will end up dropping that data. I think in this case it is better to do this now rather than later, especially since we had meant to do this LONG ago in T126366: Add IP field only to schemas that need it. Remove it from EL capsule and do not collect it by default {mole}.

I suppose that could indeed be addressed in various ways, either by partial or complete omission from Logstash and using more restricted stores instead, or by e.g. adding the data Geo and ASN data in a processing step between edge intake to EventGate and Kafka/Logstash. E.g. a simple python processor like we have for Webperf events could presumably add any and all properties we want fairly trivially,

Yes, indeed; we've been discussing this in a mix of T263496, T263466, and a bit in T263672.

Alternatively, instead of producing these to eventgate-logging-external and to logstash, we could produce them to eventgate-main or eventgate-analytics-external, and you could query them with SQL in Hive and in Superset. If necessary, we could also mirror these to kafka logging (we don't have any mirroring for kafka logging atm though) and ingest into logstash with a custom logstash filter removing the IPs.

At this point I would be fine with abandoning our use of Logstash for NEL and using Hive/Druid/Superset/Turnilo instead. Perhaps we should just open a new task for that migration?

@Ottomata: Product-Analytics doesn't use client IPs when analyzing event logging data but would strongly prefer to keep geocoded data down to subdivision level (continent, country, subdivision). It's okay to scrub more granular stuff like postal code and city, but we have in the past relied on analyzing by subdivision (especially for countries like India).

Please feel free to follow up with me or @SNowick_WMF

Could we selectively choose when we need this data, rather than collecting it by default? Right now, if http.client_ip is a field in the schema, client_ip will be collected and then data will be geocoded. The change I'm making is that http.client_ip will not be included by default, but a schema author can still choose to include it manually.

I'd like to stop collecting client IPs for existing legacy EventLogging schemas as we migrate to event platform as part of T259163: Migrate legacy metawiki schemas to Event Platform. I guess: as I migrate, I will attempt to contact the schema owners to find out if client IPs and geocoded data are necessary for that schema, and if not, stop collecting it. How's that sound?

I'd like to stop collecting client IPs for existing legacy EventLogging schemas as we migrate to event platform as part of T259163: Migrate legacy metawiki schemas to Event Platform. I guess: as I migrate, I will attempt to contact the schema owners to find out if client IPs and geocoded data are necessary for that schema, and if not, stop collecting it. How's that sound?

That sounds very reasonable!

Change 627600 merged by Ottomata:
[schemas/event/secondary@master] Bump reference to fragment/http to 1.2.0 to remove client_ip field

https://gerrit.wikimedia.org/r/627600

Change 636659 merged by Ottomata:
[schemas/event/secondary@master] Remove http.client_ip from session_tick without bumping version

https://gerrit.wikimedia.org/r/636659

@Nuria
Just like session id a browser re -start will clear error counters

I just wanted to raise that this approach has some issues, because sessionStorage isn't "browser session"-local, it's "current tab"-local. So anyone who browses wikipedia using multiple tabs or by opening new windows will read to this approach as multiple people. (Krinkle's comment may have covered this, but it was sort of obliquely.)

I don't actually have any data on preferences for different styles here, I just know that I'd personally have my encountered-errors wildly overrepresented in their apparent spread in this paradigm. So, wait, let's do it. :D