Page MenuHomePhabricator

Design aggregate tables to drive Action API reports
Closed, ResolvedPublic

Description

Design hourly aggregate tables that can be used to generate monthly reports on Action API usage requested in T102079:

  • Number of user agents coming from Labs or third party services, on a monthly basis + all time (DevRel, to check whether our APIs are increasing adoption)
  • Volume of API requests coming from Labs or third party services, on a monthly basis (DevRel, to check the trend of usage of our APIs)
  • Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)
  • Ranking of most requested actions/parameters, on a monthly basis + all time (DevRel, to help identifying usage of our APIs and check against our documentation, APIs we should promote...)

Retaining the full request information similar to api.log in Hadoop to support 30 day reporting is undesirable for several reasons:

  • Large amount of data to store and query
  • Possibility of leaking private/sensitive information in the event of a data breach
  • Possible to correlate events in ways that create privacy issue (e.g. distinct user reading history)

Using aggregate tables with hourly aggregate data can allow us to answer questions by summing simple counts and separate data in such a way that we can avoid allowing undesirable correlations.

Once we have the reporting tables designed we can work backwards to determine what source data we need to capture to populate the aggregate tables and further restrict the possibility of unwanted correlations by not logging data that is unneeded for the current reporting.

Related Objects

Event Timeline

bd808 claimed this task.
bd808 raised the priority of this task from to Medium.
bd808 updated the task description. (Show Details)
bd808 added subscribers: Tnegrin, Spage, bd808, Qgil.

I have intentionally excluded "Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)" which is mentioned on T102079. We may ultimately be able to provide a solution for that use case but it seems to be much more of an operational concern than an analytics concern and may need to be solved with a different toolset.

Thanks!

Retaining the full request information similar to api.log in Hadoop to support 30 day reporting is undesirable

Fine. How long will the data in T108618: Publish detailed Action API request information to Hadoop be available somewhere?

Ranking of most requested actions/parameters, on a monthly basis

How will we bucket API parameters for this? Presumably we won't count every combination of API parameters, as the number of buckets will be vast and most of it unnecessary. For example, we might want to know what props are most popular for action=parse, but not which titles are parsed the most.

As a minimal effort we need to separately aggregate query submodules (the list, meta, and prop values)

Thanks!

Retaining the full request information similar to api.log in Hadoop to support 30 day reporting is undesirable

Fine. How long will the data in T108618: Publish detailed Action API request information to Hadoop be available somewhere?

TBD, but I would hope ~3days or less. Basically we just need to hold the raw data long enough to ensure that we have run the ETL job needed to aggregate the data into the summary tables that will drive reports.

The problem with hanging on to this data in an easily queriable format is that it leaks sensitive data. Using the combination of IP and user-agent along with the api request details would allow for example reconstructing the browsing history of any native mobile device user. This is obviously user tracking and we currently don't do user tracking.

Ranking of most requested actions/parameters, on a monthly basis

How will we bucket API parameters for this? Presumably we won't count every combination of API parameters, as the number of buckets will be vast and most of it unnecessary. For example, we might want to know what props are most popular for action=parse, but not which titles are parsed the most.

As a minimal effort we need to separately aggregate query submodules (the list, meta, and prop values)

"We might want to know" needs to go out the window and be replaced with "these are the exact reports we need". We are talking about creating KPI tracking reports here and not expansive data mining systems. Again, the raw data that will feed these aggregations is toxic data to hold on to. If it were leaked due to a network breach or a bad actor with insider access it would reveal things about our users that could be damaging.

I agree with bd808 -- we need to decide what data we need to support our
use cases and record just that, subject to normal privacy/ethical concerns
of course. We should take this approach across the board.

-Toby

Draft plan at https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics

Currently proposed schema:

CREATE TABLE IF NOT EXISTS action_request (
  dt         STRING             COMMENT 'Timestamp in ISO 8601',
  client_ip  STRING             COMMENT 'IP of requesting user-agent (follows XFF)',
  user_agent STRING             COMMENT 'User-Agent header of request',
  params     MAP<STRING,STRING> COMMENT 'Request parameters',
  ip_class   STRING             COMMENT 'IP based origin derived from client_ip, can be internal, external or labs',
  year       INT                COMMENT 'Unpadded year of request',
  month      INT                COMMENT 'Unpadded month of request',
  day        INT                COMMENT 'Unpadded day of request',
  hour       INT                COMMENT 'Unpadded hour of request',
  COMMENT 'Action API requests'
  )
PARTITIONED BY (year INT, month INT, day INT, hour INT);

-- NOTE: action_request.client_ip is not needed at all if we can do the ip classification as part of the import from Kafka.


CREATE TABLE IF NOT EXISTS action_ua_hourly (
  user_agent STRING COMMENT 'Raw user-agent',
  ip_class   STRING COMMENT 'IP based origin, can be internal, external or labs',
  view_count BIGINT COMMENT 'Number of requests',
  year       INT    COMMENT 'Unpadded year of request',
  month      INT    COMMENT 'Unpadded month of request',
  day        INT    COMMENT 'Unpadded day of request',
  hour       INT    COMMENT 'Unpadded hour of request',
  COMMENT 'Hourly summary of Action API requests bucketed by user-agent'
)
PARTITIONED BY (year INT, month INT, day INT, hour INT);


CREATE TABLE IF NOT EXISTS action_action_hourly (
  action     STRING COMMENT 'Action parameter value',
  ip_class   STRING COMMENT 'IP based origin, can be internal, external or labs',
  view_count BIGINT COMMENT 'Number of requests',
  year       INT    COMMENT 'Unpadded year of request',
  month      INT    COMMENT 'Unpadded month of request',
  day        INT    COMMENT 'Unpadded day of request',
  hour       INT    COMMENT 'Unpadded hour of request',
  COMMENT 'Hourly summary of Action API requests bucketed by action'
)
PARTITIONED BY (year INT, month INT, day INT, hour INT);


CREATE TABLE IF NOT EXISTS action_param_hourly (
  action     STRING COMMENT 'Action parameter value',
  param      STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
  value      STRING COMMENT 'Parameter value',
  ip_class   STRING COMMENT 'IP based origin, can be internal, external or labs',
  view_count BIGINT COMMENT 'Number of requests',
  year       INT    COMMENT 'Unpadded year of request',
  month      INT    COMMENT 'Unpadded month of request',
  day        INT    COMMENT 'Unpadded day of request',
  hour       INT    COMMENT 'Unpadded hour of request',
  COMMENT 'Hourly summary of Action API requests bucketed by action, parameter and value'
)
PARTITIONED BY (year INT, month INT, day INT, hour INT);
-- Create tables for Action API stats
--
-- Usage:
--     hive -f create-action-tables.sql --database wmf

CREATE TABLE IF NOT EXISTS action_ua_hourly (
  userAgent STRING COMMENT 'Raw user-agent',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
  action    STRING COMMENT 'Action parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
  action    STRING COMMENT 'Action parameter value',
  param     STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
  value     STRING COMMENT 'Parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;

-- NOTE: there are many params we would not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). It will be easier to whitelist in the ETL process
-- than to try and selectively blacklist.

Please document these tables once they are final similar to how the rest of final tables are documented in wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

Having a page in wikitech is convenient to document changes and known issues with dataset.

Please document these tables once they are final similar to how the rest of final tables are documented in wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

Having a page in wikitech is convenient to document changes and known issues with dataset.

https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction