Design aggregate tables to drive Action API reports
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bd808
	Oct 20 2015, 6:34 PM

Description

Design hourly aggregate tables that can be used to generate monthly reports on Action API usage requested in T102079:

Number of user agents coming from Labs or third party services, on a monthly basis + all time (DevRel, to check whether our APIs are increasing adoption)

Volume of API requests coming from Labs or third party services, on a monthly basis (DevRel, to check the trend of usage of our APIs)

Ranking of user agents coming from Labs or third party services with a highest activity, on a monthly basis + all time (DevRel, to help identifying the services making intensive use of our APIs)

Ranking of most requested actions/parameters, on a monthly basis + all time (DevRel, to help identifying usage of our APIs and check against our documentation, APIs we should promote...)

Retaining the full request information similar to api.log in Hadoop to support 30 day reporting is undesirable for several reasons:

Large amount of data to store and query
Possibility of leaking private/sensitive information in the event of a data breach
Possible to correlate events in ways that create privacy issue (e.g. distinct user reading history)

Using aggregate tables with hourly aggregate data can allow us to answer questions by summing simple counts and separate data in such a way that we can avoid allowing undesirable correlations.

Once we have the reporting tables designed we can work backwards to determine what source data we need to capture to populate the aggregate tables and further restrict the possibility of unwanted correlations by not logging data that is unneeded for the current reporting.

Related Objects
Search...

Status	Assigned	Task
Resolved	Qgil	T153007 Technical Collaboration annual plan FY2017-18
Resolved	Keegan	T131689 Second iteration of the Technical Collaboration strategy
Declined	None	T926 Engage with established technical communities
Resolved	Qgil	T102790 Engineering Community quarterly review (as part of the Community Engagement review)
Resolved	Qgil	T93770 Engineering Community quarterly goals for April-June 2015
Invalid	None	T98348 Implement the Wikimedia Foundation Call to Action 2015
Invalid	None	T98359 Create spaces for future community-led innovations and new knowledge creation
Resolved	None	T98361 Strengthen partnerships with organizations that use or contribute free content, or are aligned with the WMF in the free-knowledge movement
Declined	Qgil	T96013 Identify Wikimedia's top technical partners
Resolved	Qgil	T97283 Plan to focus on the Developer audience
Resolved	None	T114017 Map current use of Wikimedia web APIs
Resolved	bd808	T102079 Metrics about the use of the Wikimedia web APIs
Resolved	bd808	T116065 Design aggregate tables to drive Action API reports

Event Timeline

bd808 created this task.Oct 20 2015, 6:34 PM

bd808 claimed this task.

bd808 raised the priority of this task from to Medium.

bd808 updated the task description. (Show Details)

bd808 added projects: Reading-Infrastructure-Team-Old (Don't use), Developer-Advocacy, Research-consulting.

bd808 added subscribers: • Tnegrin, • Spage, bd808, Qgil.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2015, 6:34 PM

I have intentionally excluded "Counts of errors (T113672) by action and user agent, in order to identify problem areas and proactively reach out to API clients getting errors (DevRel and documentation)" which is mentioned on T102079. We may ultimately be able to provide a solution for that use case but it seems to be much more of an operational concern than an analytics concern and may need to be solved with a different toolset.

Thanks!

Retaining the full request information similar to api.log in Hadoop to support 30 day reporting is undesirable

Fine. How long will the data in T108618: Publish detailed Action API request information to Hadoop be available somewhere?

Ranking of most requested actions/parameters, on a monthly basis

How will we bucket API parameters for this? Presumably we won't count every combination of API parameters, as the number of buckets will be vast and most of it unnecessary. For example, we might want to know what props are most popular for action=parse, but not which titles are parsed the most.

As a minimal effort we need to separately aggregate query submodules (the list, meta, and prop values)

In T116065#1740295, @Spage wrote:

Thanks!

Retaining the full request information similar to api.log in Hadoop to support 30 day reporting is undesirable

Fine. How long will the data in T108618: Publish detailed Action API request information to Hadoop be available somewhere?

TBD, but I would hope ~3days or less. Basically we just need to hold the raw data long enough to ensure that we have run the ETL job needed to aggregate the data into the summary tables that will drive reports.

The problem with hanging on to this data in an easily queriable format is that it leaks sensitive data. Using the combination of IP and user-agent along with the api request details would allow for example reconstructing the browsing history of any native mobile device user. This is obviously user tracking and we currently don't do user tracking.

Ranking of most requested actions/parameters, on a monthly basis

How will we bucket API parameters for this? Presumably we won't count every combination of API parameters, as the number of buckets will be vast and most of it unnecessary. For example, we might want to know what props are most popular for action=parse, but not which titles are parsed the most.

As a minimal effort we need to separately aggregate query submodules (the list, meta, and prop values)

"We might want to know" needs to go out the window and be replaced with "these are the exact reports we need". We are talking about creating KPI tracking reports here and not expansive data mining systems. Again, the raw data that will feed these aggregations is toxic data to hold on to. If it were leaked due to a network breach or a bad actor with insider access it would reveal things about our users that could be damaging.

I agree with bd808 -- we need to decide what data we need to support our
use cases and record just that, subject to normal privacy/ethical concerns
of course. We should take this approach across the board.

-Toby

EBernhardson mentioned this in T116230: Figure out if there is any kind of extra anonymization that can be applied to the CirrusSearchRequestSet schema.Oct 21 2015, 9:27 PM

bd808 added a parent task: T108618: Publish detailed Action API request information to Hadoop.Oct 23 2015, 4:11 PM

Qgil removed a project: Developer-Advocacy.Oct 27 2015, 7:11 PM

Qgil set Security to None.

Draft plan at https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics

Currently proposed schema:

CREATE TABLE IF NOT EXISTS action_request (
  dt         STRING             COMMENT 'Timestamp in ISO 8601',
  client_ip  STRING             COMMENT 'IP of requesting user-agent (follows XFF)',
  user_agent STRING             COMMENT 'User-Agent header of request',
  params     MAP<STRING,STRING> COMMENT 'Request parameters',
  ip_class   STRING             COMMENT 'IP based origin derived from client_ip, can be internal, external or labs',
  year       INT                COMMENT 'Unpadded year of request',
  month      INT                COMMENT 'Unpadded month of request',
  day        INT                COMMENT 'Unpadded day of request',
  hour       INT                COMMENT 'Unpadded hour of request',
  COMMENT 'Action API requests'
  )
PARTITIONED BY (year INT, month INT, day INT, hour INT);

-- NOTE: action_request.client_ip is not needed at all if we can do the ip classification as part of the import from Kafka.


CREATE TABLE IF NOT EXISTS action_ua_hourly (
  user_agent STRING COMMENT 'Raw user-agent',
  ip_class   STRING COMMENT 'IP based origin, can be internal, external or labs',
  view_count BIGINT COMMENT 'Number of requests',
  year       INT    COMMENT 'Unpadded year of request',
  month      INT    COMMENT 'Unpadded month of request',
  day        INT    COMMENT 'Unpadded day of request',
  hour       INT    COMMENT 'Unpadded hour of request',
  COMMENT 'Hourly summary of Action API requests bucketed by user-agent'
)
PARTITIONED BY (year INT, month INT, day INT, hour INT);


CREATE TABLE IF NOT EXISTS action_action_hourly (
  action     STRING COMMENT 'Action parameter value',
  ip_class   STRING COMMENT 'IP based origin, can be internal, external or labs',
  view_count BIGINT COMMENT 'Number of requests',
  year       INT    COMMENT 'Unpadded year of request',
  month      INT    COMMENT 'Unpadded month of request',
  day        INT    COMMENT 'Unpadded day of request',
  hour       INT    COMMENT 'Unpadded hour of request',
  COMMENT 'Hourly summary of Action API requests bucketed by action'
)
PARTITIONED BY (year INT, month INT, day INT, hour INT);


CREATE TABLE IF NOT EXISTS action_param_hourly (
  action     STRING COMMENT 'Action parameter value',
  param      STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
  value      STRING COMMENT 'Parameter value',
  ip_class   STRING COMMENT 'IP based origin, can be internal, external or labs',
  view_count BIGINT COMMENT 'Number of requests',
  year       INT    COMMENT 'Unpadded year of request',
  month      INT    COMMENT 'Unpadded month of request',
  day        INT    COMMENT 'Unpadded day of request',
  hour       INT    COMMENT 'Unpadded hour of request',
  COMMENT 'Hourly summary of Action API requests bucketed by action, parameter and value'
)
PARTITIONED BY (year INT, month INT, day INT, hour INT);

@Nuria Could you take a look at https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Action_API_request_analytics and give me some input on the general plan?

bd808 added a project: User-bd808.Nov 4 2015, 5:23 AM

bd808 moved this task from To Do to In Dev/Progress on the User-bd808 board.Nov 4 2015, 5:25 AM

bd808 moved this task from In Dev/Progress to Needs Review/Feedback on the User-bd808 board.Nov 13 2015, 6:35 PM

bd808 moved this task from Backlog to Needs Review/Feedback on the Reading-Infrastructure-Team-Old (Don't use) board.Jan 25 2016, 5:37 PM

bd808 removed a parent task: T108618: Publish detailed Action API request information to Hadoop.Mar 25 2016, 3:07 PM

-- Create tables for Action API stats
--
-- Usage:
--     hive -f create-action-tables.sql --database wmf

CREATE TABLE IF NOT EXISTS action_ua_hourly (
  userAgent STRING COMMENT 'Raw user-agent',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by user-agent and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_action_hourly (
  action    STRING COMMENT 'Action parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;


CREATE EXTERNAL TABLE IF NOT EXISTS action_param_hourly (
  action    STRING COMMENT 'Action parameter value',
  param     STRING COMMENT 'Parameter name, can be prop, list, meta, generator, etc',
  value     STRING COMMENT 'Parameter value',
  wiki      STRING COMMENT 'Target wiki (e.g. enwiki)',
  ipClass   STRING COMMENT 'IP based origin, can be wikimedia, wikimedia_labs or internet',
  viewCount BIGINT COMMENT 'Number of requests'
)
COMMENT 'Hourly summary of Action API requests bucketed by action, parameter, value and wiki'
PARTITIONED BY (
  year      INT COMMENT 'Unpadded year of request',
  month     INT COMMENT 'Unpadded month of request',
  day       INT COMMENT 'Unpadded day of request',
  hour      INT COMMENT 'Unpadded hour of request'
)
STORED AS PARQUET;

-- NOTE: there are many params we would not want to count distinct values of
-- at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken,
-- titles, pageids). It will be easier to whitelist in the ETL process
-- than to try and selectively blacklist.

Please document these tables once they are final similar to how the rest of final tables are documented in wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

Having a page in wikitech is convenient to document changes and known issues with dataset.

Qgil unsubscribed.Mar 30 2016, 9:44 AM

bd808 mentioned this in T132283: Determine which Action API parameters to whitelist/blacklist for action_param_hourly aggregate table.Apr 10 2016, 5:48 AM

In T116065#2155074, @Nuria wrote:

Please document these tables once they are final similar to how the rest of final tables are documented in wikitech: https://wikitech.wikimedia.org/wiki/Analytics/Data/Webrequest

Having a page in wikitech is convenient to document changes and known issues with dataset.

https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction

bd808 closed this task as Resolved.Jun 8 2016, 3:44 PM

bd808 mentioned this in T137321: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables.Jun 8 2016, 3:56 PM

bd808 moved this task from Needs Review/Feedback to Done on the User-bd808 board.Jun 9 2016, 7:49 PM

bd808 moved this task from Done to Archive on the User-bd808 board.Jun 21 2016, 7:40 PM

Design aggregate tables to drive Action API reportsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Design aggregate tables to drive Action API reports
Closed, ResolvedPublic
Actions

Related Objects
Search...