Page MenuHomePhabricator

System administrator reviews API usage by client
Open, MediumPublic

Description

"As a System Administrator, I want to review the usage of the API for a given OAuth 2.0 client ID, to evaluate scale, investigate misuse or troubleshoot problems with API usage."

I think this mainly means making sure we log the OAuth 2.0 client ID when reporting via Kafka, and having a way to review those in the dashboard.

Event Timeline

eprodromou reassigned this task from eprodromou to Pchelolo.Jul 22 2020, 4:05 PM
eprodromou added a subscriber: Pchelolo.

I've moved this ticket from the Core REST API to the Wikimedia API Gateway initiative to make it clearer.

It would be good to also log the client IDs of Action API calls and Core REST API using OAuth 1.0 or 2.0 that don't go through the API Gateway, but that may be a separate discussion.

I've added @Pchelolo to help get this resolved. Hopefully the hardest part is reporting to Kafka and making sure there's a way to sort requests by Client ID in the analytics UI.

eprodromou added a subscriber: Ottomata.
fdans moved this task from Incoming to Event Platform on the Analytics board.Jul 27 2020, 3:48 PM
Pchelolo moved this task from Ready to Doing on the Platform Team Workboards (Green) board.EditedAug 5 2020, 5:54 PM

Where do we start from

API Gateway envoy instance is able to log the access logs to a file or stdout in JSON format, specified here. This is piped into syslog on the host, which is then shipped to kafka and eventually gets to logstash. When the rate of requests to the gateway reaches any significant numbers, we would need to turn off shipping the access logs to logstash because of where volume of the logs.

With the current, basic setup, a single kafka topic (rsyslog-notice) contains NOTICE-level logs for all hosts/applications, with the application specified in the log message.

Where we want to go

As a start, one one side we want to reach feature parity for request logging with action API. Thus, we want a schema-ed log to be delivered to kafka-jumbo cluster and injected into analytics infrastructure. Further analysis can be done at that point.

Additionally, we need to explore if we can attach a client_id label to prometheus metrics. Most likely this would be too much cardinality, but we should still check.

How do we get there

Originally I was envisioning mirroring the logs from logging kafka to jumbo kafka, but it seems like that would not be a very easy solution, since a lot of filtering will need to happen - logging kafka topics don't differentiate by application. Additionally, logging kafka topics for api-gateway would contain both request logs and application logs, which would require more sophisticated filtering. Changing the logging pipeline for this one-off doesn't seem like a good solution either.

We need to explore alternatives. One possibility we've discussed with @Ottomata was to create version of eventgate that would read input from a file instead of listening to HTTP, and deploy it as a sidecar to eventgate. From the unification of event injection this seems like a best option, but this brings a lot of dependencies to eventgate chart - we need a sidecar container with nodejs in it...

Alternatively, we could look for third-party (or in-house) software capable of redirecting a file into a kafka topic - that will be simpler, but we would bypass schema validation.

Alternatively, we could contribute support for directing access logs to kafka natively in envoy. That again will bypass schema validation, and is in general a very significant amount of work. On the bright side, offloading this feature to upstream will make our lives easier in the long run.

Alternatively.... <I guess I come up with more ideas, this is just a conversation starter>

Alternatively, we could create a sidecar that reads from a file/stdin and POSTs via HTTP to eventgate

Alternatively, we could look for third-party (or in-house) software capable of redirecting a file into a kafka topic

You mean kafkacat? :p

You mean kafkacat? :p

:) why not?

Pchelolo added a subscriber: Joe.Aug 6 2020, 3:40 PM

@Joe @Ottomata I would really appreciate your view on the general approach to ingesting events from gateway T251812#6363665

I kinda of like the idea of a simple

`stdin/file > http_poster https://eventgate-analytics.discovery.wmnet:4592/v1/events`

Process in a side car. Or > kafkacat, not sure which is simpler.

Nuria added a subscriber: Nuria.Aug 6 2020, 4:28 PM

Not sure if you de-estimated this idea (or if it is really what you called above out as "sophisticated filtering") but we could have a custom kafka consumer of the topic rsyslog-notice that filters the log events we are interested in and post those (via HTTP) to EventGate Analytics (with a schema similar to the one you have linked above). Seems easier to think of a kafka consumer/producer that pipes to another topic that creating a whole new path to post events to EvetGate,

This producer/consumer consumes from kafka via kafka protocol but produces to event gate analytics via HTTP so it interacts with kafka directly just in the consumer side.

Nuria added a comment.Aug 6 2020, 4:36 PM

I think we will implement measures such us throttling (and blocking) in event gate so, with that in mind i think we want to avoid scenarios of directly producing to kafka but rather encourage that all clients produce to eventgate (via http file or whatever means is needed)

Change 619341 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] WIP: Modify api-gateway access logging to conform to schema

https://gerrit.wikimedia.org/r/619341

Ok... unexpected complication - envoy JSON access_log formatter currently only supports single-level JSON (e.g. no nested objects) so trying to produce anything conforming our schema is impossible. I'm going to investigate what the reasoning behind it is and report with a plan.

Ok. I've submitted https://github.com/envoyproxy/envoy/issues/12582 to support nested format structures - that would be the most clean way of moving forward.

In the meantime, we can have some workarounds.

We can flatten our schema for envoy with some separator, like _, and use jq upon sending the json structure with a filter like reduce to_entries[] as $kv ({}; setpath($kv.key|split("_"); $kv.value)) to 'unflatten' it back before sending to eventgate. This will be very prone to errors but I believe it will be a reasonable workaround until we get a proper solution.

Change 619512 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/docker-images/production-images@master] Create api-gateway-logstream image.

https://gerrit.wikimedia.org/r/619512

Change 619512 merged by Giuseppe Lavagetto:
[operations/docker-images/production-images@master] Resurrect fluent-bit image

https://gerrit.wikimedia.org/r/619512

Change 619341 merged by jenkins-bot:
[operations/deployment-charts@master] Modify api-gateway access logging to conform to schema

https://gerrit.wikimedia.org/r/619341

Change 621308 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] Enhancements to access logging for api-gateway

https://gerrit.wikimedia.org/r/621308

Change 621308 merged by jenkins-bot:
[operations/deployment-charts@master] Enhancements to access logging for api-gateway

https://gerrit.wikimedia.org/r/621308

Change 621329 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] Filter out null values using fluent-bit

https://gerrit.wikimedia.org/r/621329

Change 621329 merged by jenkins-bot:
[operations/deployment-charts@master] Filter out null values using fluent-bit

https://gerrit.wikimedia.org/r/621329

Boom!

kafkacat -b kafka-jumbo1001.eqiad.wmnet -t staging.api-gateway.request -p 0 -c 1
{"date":1597860892.193604,"$schema":"/api-gateway/request/1.0.0","route":"core","total_time_ms":55,"meta":{"dt":"2020-08-19T18:14:52Z","domain":"en.wiktionary.org","stream":"api-gateway.request","uri":"/core/v1/wiktionary/en/page/cat?a=d","request_id":"1d085724-e34a-4bcf-8d74-1c09b15d588a","id":"8a4a7c98-609e-4f65-b977-3c0cb2e6fb1a"},"http":{"method":"GET","status_code":200,"client_ip":"10.64.0.247","protocol":"HTTP/1.1","request_headers":{"user-agent":"curl/7.52.1"}}}

Note that the example event does not contain the client_id because it's an anon request with curl. After jumping through a lot of loops and hoops, deployed to staging. Authenticated requests will contain client_id hopefully. That is something to verify when deployed to production.

Next we need to set up TLS for fluent-bit -> eventgate request.

So, we have access logs conforming the schema shipped to Kafka-jumbo in staging and production.

But this ticket is about using those events or what? what else needs to be done here?

I think if we are getting the data into analytics, we're good. I'll follow up on getting it visible in Turnilo or some other way.