Redioscope can generate reports of the top n clients, it would be useful to be able to query that from superset and the stats servers.
Rationale:
While most of the relevant data is already present in the webrequests data stream, the user ID is not. So based on the webrequests data we could generate a list of the most active clients by user-agent or by ip-address, but not by user ID.
Note that putting the user ID (or the full rate limit key) into the webrequest data stream has been declined for privacy reasons (was discussed in the context of T417864). Per-user data would need to be pre-aggregated.
Implementation:
Data lake intake can be done via Kafka events or HDFS (needs Kerberos). HDFS REST is currently disabled.