Page MenuHomePhabricator

redioscope: periodically publish top clients to the data lake
Open, LowPublic

Description

Redioscope can generate reports of the top n clients, it would be useful to be able to query that from superset and the stats servers.

Rationale:
While most of the relevant data is already present in the webrequests data stream, the user ID is not. So based on the webrequests data we could generate a list of the most active clients by user-agent or by ip-address, but not by user ID.

Note that putting the user ID (or the full rate limit key) into the webrequest data stream has been declined for privacy reasons (was discussed in the context of T417864). Per-user data would need to be pre-aggregated.

Implementation:
Data lake intake can be done via Kafka events or HDFS (needs Kerberos). HDFS REST is currently disabled.

Details

Event Timeline

daniel updated the task description. (Show Details)

@daniel can you provide background information how this supports KR or metrics work?

@daniel can you provide background information how this supports KR or metrics work?

It allows us to gauge and predict impact of rate limit changes on power users, enabling use to reach out to affected community members to avoid disruption. So far, I am collecting and evaluating this data manually. Doing so has enabled us to spot edge cases and e.g. reach out to Mike Peel before his bot got hit by rate limits.

On a higher level, our goal is to impact the community as little as possible. So the set of established users over the limit should be small. if it grows, we need to know and be able to take action. Knowing the individuals affected allows us to do so in a targeted way.

You can think of this as the unspoken second part of of the KR "reduce the percentage of unidentified automated traffic", which is "...without disrupting community processes".

Change #1285886 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[operations/deployment-charts@master] WIP: redioscope: add redioscate cron job

https://gerrit.wikimedia.org/r/1285886

Hm.

@daniel let's discuss! Just putting data in Kafka does not automatically get it into a queryable data lake table.

See: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/Hadoop_Event_Ingestion_Lifecycle

In general, if you follow https://wikitech.wikimedia.org/wiki/Event_Platform/Producer_Requirements#Events, they will automatically be ingested into a Hive table.

If you can't set all the required fields, there may be a workaround. But an event schema and a stream declared in EventStreamConfig are probably required.

Also, you can produce the JSON to Kafka via kcat, but alternatively you could just POST it to an eventgate.

Happy to help!

Thank you for looking into this @Ottomata!

I can use kcat, and I can add the required fields (assuming I can figure out the authentication bit). I can also come up with a trivial schema... how hard is it to update that?

Also, in order to get ingested into a Hive table, what topic would I need to publish too, and what do I put into the stream field?

And finally, how would I access that data in that table?

The best docs I have for you are unfortunately focused on EventLogging analytics instrumentation. They are still relevant! Just ignore anything EventLogging specific.

I can also come up with a trivial schema

Creating a schema: https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#Creating_a_new_schema

what do I put into the stream field?

Declaring a stream: https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#Configuring_the_stream

how hard is it to update that?

Rules for schema changes: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Backwards_compatible_modifications_only

Often when we are developing new schemas & streams, we put schemas in a development/ namespace directory, and then name the streams suffixed e.g. .dev0 or .rc0. Whenever we make a backwards incompatible change during development, we can just make a new major schema version and use a new stream name.

See: https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning

The meta.stream field should be set to the exact stream name you declare in EventStreamConfig.

what topic would I need to publish too

Since you are producing directly with kcat, you'll need to produce to a datacenter prefixed topic. E.g. eqiad.<my_stream_name_here> It also sort of depends on which Kafka cluster you are producing to.

Q: are you sure you don't want to just POST the events? These could probably go to eventgate-analytics. You could just do a

redioscan ... | jq ... | curl -d @- https://eventgate-analytics.discovery.wmnet:4592/v1/events
# or something like that

And you could skip all the kerberos Kafka auth stuff, or worrying about which topic to produce to.

daniel triaged this task as Low priority.Mon, Jun 1, 9:38 PM