Design hourly aggregate tables that can be used to generate monthly reports on Action API usage requested in T102079:
- Number of user agents coming from Labs or third party services, on a monthly basis + all time (DevRel, to check whether our APIs are increasing adoption)
- Volume of API requests coming from Labs or third party services, on a monthly basis (DevRel, to check the trend of usage of our APIs)
- Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)
- Ranking of most requested actions/parameters, on a monthly basis + all time (DevRel, to help identifying usage of our APIs and check against our documentation, APIs we should promote...)
Retaining the full request information similar to api.log in Hadoop to support 30 day reporting is undesirable for several reasons:
- Large amount of data to store and query
- Possibility of leaking private/sensitive information in the event of a data breach
- Possible to correlate events in ways that create privacy issue (e.g. distinct user reading history)
Using aggregate tables with hourly aggregate data can allow us to answer questions by summing simple counts and separate data in such a way that we can avoid allowing undesirable correlations.
Once we have the reporting tables designed we can work backwards to determine what source data we need to capture to populate the aggregate tables and further restrict the possibility of unwanted correlations by not logging data that is unneeded for the current reporting.