Page MenuHomePhabricator

System administrator reviews API usage by client
Closed, ResolvedPublic

Description

"As a System Administrator, I want to review the usage of the API for a given OAuth 2.0 client ID, to evaluate scale, investigate misuse or troubleshoot problems with API usage."

I think this mainly means making sure we log the OAuth 2.0 client ID when reporting via Kafka, and having a way to review those in the dashboard.

Event Timeline

eprodromou added a subscriber: Pchelolo.

I've moved this ticket from the Core REST API to the Wikimedia API Gateway initiative to make it clearer.

It would be good to also log the client IDs of Action API calls and Core REST API using OAuth 1.0 or 2.0 that don't go through the API Gateway, but that may be a separate discussion.

I've added @Pchelolo to help get this resolved. Hopefully the hardest part is reporting to Kafka and making sure there's a way to sort requests by Client ID in the analytics UI.

Where do we start from

API Gateway envoy instance is able to log the access logs to a file or stdout in JSON format, specified here. This is piped into syslog on the host, which is then shipped to kafka and eventually gets to logstash. When the rate of requests to the gateway reaches any significant numbers, we would need to turn off shipping the access logs to logstash because of where volume of the logs.

With the current, basic setup, a single kafka topic (rsyslog-notice) contains NOTICE-level logs for all hosts/applications, with the application specified in the log message.

Where we want to go

As a start, one one side we want to reach feature parity for request logging with action API. Thus, we want a schema-ed log to be delivered to kafka-jumbo cluster and injected into analytics infrastructure. Further analysis can be done at that point.

Additionally, we need to explore if we can attach a client_id label to prometheus metrics. Most likely this would be too much cardinality, but we should still check.

How do we get there

Originally I was envisioning mirroring the logs from logging kafka to jumbo kafka, but it seems like that would not be a very easy solution, since a lot of filtering will need to happen - logging kafka topics don't differentiate by application. Additionally, logging kafka topics for api-gateway would contain both request logs and application logs, which would require more sophisticated filtering. Changing the logging pipeline for this one-off doesn't seem like a good solution either.

We need to explore alternatives. One possibility we've discussed with @Ottomata was to create version of eventgate that would read input from a file instead of listening to HTTP, and deploy it as a sidecar to eventgate. From the unification of event injection this seems like a best option, but this brings a lot of dependencies to eventgate chart - we need a sidecar container with nodejs in it...

Alternatively, we could look for third-party (or in-house) software capable of redirecting a file into a kafka topic - that will be simpler, but we would bypass schema validation.

Alternatively, we could contribute support for directing access logs to kafka natively in envoy. That again will bypass schema validation, and is in general a very significant amount of work. On the bright side, offloading this feature to upstream will make our lives easier in the long run.

Alternatively.... <I guess I come up with more ideas, this is just a conversation starter>

Alternatively, we could create a sidecar that reads from a file/stdin and POSTs via HTTP to eventgate

Alternatively, we could look for third-party (or in-house) software capable of redirecting a file into a kafka topic

You mean kafkacat? :p

You mean kafkacat? :p

:) why not?

@Joe @Ottomata I would really appreciate your view on the general approach to ingesting events from gateway T251812#6363665

I kinda of like the idea of a simple

`stdin/file > http_poster https://eventgate-analytics.discovery.wmnet:4592/v1/events`

Process in a side car. Or > kafkacat, not sure which is simpler.

Not sure if you de-estimated this idea (or if it is really what you called above out as "sophisticated filtering") but we could have a custom kafka consumer of the topic rsyslog-notice that filters the log events we are interested in and post those (via HTTP) to EventGate Analytics (with a schema similar to the one you have linked above). Seems easier to think of a kafka consumer/producer that pipes to another topic that creating a whole new path to post events to EvetGate,

This producer/consumer consumes from kafka via kafka protocol but produces to event gate analytics via HTTP so it interacts with kafka directly just in the consumer side.

I think we will implement measures such us throttling (and blocking) in event gate so, with that in mind i think we want to avoid scenarios of directly producing to kafka but rather encourage that all clients produce to eventgate (via http file or whatever means is needed)

Change 619341 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] WIP: Modify api-gateway access logging to conform to schema

https://gerrit.wikimedia.org/r/619341

Ok... unexpected complication - envoy JSON access_log formatter currently only supports single-level JSON (e.g. no nested objects) so trying to produce anything conforming our schema is impossible. I'm going to investigate what the reasoning behind it is and report with a plan.

Ok. I've submitted https://github.com/envoyproxy/envoy/issues/12582 to support nested format structures - that would be the most clean way of moving forward.

In the meantime, we can have some workarounds.

We can flatten our schema for envoy with some separator, like _, and use jq upon sending the json structure with a filter like reduce to_entries[] as $kv ({}; setpath($kv.key|split("_"); $kv.value)) to 'unflatten' it back before sending to eventgate. This will be very prone to errors but I believe it will be a reasonable workaround until we get a proper solution.

Change 619512 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/docker-images/production-images@master] Create api-gateway-logstream image.

https://gerrit.wikimedia.org/r/619512

Change 619512 merged by Giuseppe Lavagetto:
[operations/docker-images/production-images@master] Resurrect fluent-bit image

https://gerrit.wikimedia.org/r/619512

Change 619341 merged by jenkins-bot:
[operations/deployment-charts@master] Modify api-gateway access logging to conform to schema

https://gerrit.wikimedia.org/r/619341

Change 621308 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] Enhancements to access logging for api-gateway

https://gerrit.wikimedia.org/r/621308

Change 621308 merged by jenkins-bot:
[operations/deployment-charts@master] Enhancements to access logging for api-gateway

https://gerrit.wikimedia.org/r/621308

Change 621329 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] Filter out null values using fluent-bit

https://gerrit.wikimedia.org/r/621329

Change 621329 merged by jenkins-bot:
[operations/deployment-charts@master] Filter out null values using fluent-bit

https://gerrit.wikimedia.org/r/621329

Boom!

kafkacat -b kafka-jumbo1001.eqiad.wmnet -t staging.api-gateway.request -p 0 -c 1
{"date":1597860892.193604,"$schema":"/api-gateway/request/1.0.0","route":"core","total_time_ms":55,"meta":{"dt":"2020-08-19T18:14:52Z","domain":"en.wiktionary.org","stream":"api-gateway.request","uri":"/core/v1/wiktionary/en/page/cat?a=d","request_id":"1d085724-e34a-4bcf-8d74-1c09b15d588a","id":"8a4a7c98-609e-4f65-b977-3c0cb2e6fb1a"},"http":{"method":"GET","status_code":200,"client_ip":"10.64.0.247","protocol":"HTTP/1.1","request_headers":{"user-agent":"curl/7.52.1"}}}

Note that the example event does not contain the client_id because it's an anon request with curl. After jumping through a lot of loops and hoops, deployed to staging. Authenticated requests will contain client_id hopefully. That is something to verify when deployed to production.

Next we need to set up TLS for fluent-bit -> eventgate request.

So, we have access logs conforming the schema shipped to Kafka-jumbo in staging and production.

But this ticket is about using those events or what? what else needs to be done here?

I think if we are getting the data into analytics, we're good. I'll follow up on getting it visible in Turnilo or some other way.

Adding Platform Engineering as Platform Team Workboards (Green) was archived and as open tasks should have an active project tag

Change #1051407 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/docker-images/production-images@master] Revert "Resurrect fluent-bit image"

https://gerrit.wikimedia.org/r/1051407

4 years later, we don't see any data flowing in the kafka topic created back then. This feature apparently has never been used. But it is costing us in maintenance efforts as the image is on buster and we wanna to remove those images from the registry. Hence, after some discussions in #wikimedia-serviceops IRC channel, we have decided to disable the functionality from api-gateway and delete the fluentbit docker image from our repo as this pipeline is the only user of it. If anyone ever reaches this task and comment and is interested in the functionality implemented during work on this task, it can always be resurrected, assuming it's properly resourced.

akosiaris claimed this task.

I am resolving the task given comments from 4 years ago. However, repeating that the functionality added in the course of this task 4 years ago is going to be removed since it's unused and causes maintenance burden.

Change #1051407 merged by Alexandros Kosiaris:

[operations/docker-images/production-images@master] Revert "Resurrect fluent-bit image"

https://gerrit.wikimedia.org/r/1051407

Change #1052314 had a related patch set uploaded (by Alexandros Kosiaris; author: Alexandros Kosiaris):

[operations/deployment-charts@master] api-gateway: Remove eventgate logging support

https://gerrit.wikimedia.org/r/1052314

Change #1052314 merged by jenkins-bot:

[operations/deployment-charts@master] api-gateway: Remove eventgate logging support

https://gerrit.wikimedia.org/r/1052314