Page MenuHomePhabricator

[REQUEST] Data for external API Requests. Harden API Gateway
Open, Needs TriagePublic

Description

Name for main point of contact and contact preference
Virginia Poundstone via phab task comment/slack

What teams or departments is this for?
Platform Engineering (API Value Stream)

What are the details of your request? Include relevant timelines or deadlines
We need to create a dashboard to compare the requests made per endpoint for the following APIs:

It would be a great starting point to see the following available data per endpoint:

  • user_agent
  • user_agent_map
  • access_method
  • ip
  • HTTP Method
  • HTTP Status
  • URI Path
  • URI Query
  • Content Type
  • Response Size

Data Analysis goal:
Define the major "buckets" or types of client callers so we can design an API Architecture with their use cases in mind.
A secondary goal is to determine what services are used the most to the least and by whom, for what purpose and how much.

How will you use this data or analysis?
This data will help inform:

  1. Product decisions based on current API usage
  2. What endpoints we prioritize to maintain and enhance

Is this request urgent or time-sensitive?
It supports our ongoing work, until we have this data, we are making decisions based on guesses. THat said, this feels urgent, but is not technically blocking us.

Event Timeline

@sdkim this is not something we'll be able to take on in the foreseeable future, so we're putting this in our backlog for future reference.

@mpopov may have some pointers to share if someone on your team is able to query the data

@kzimmerman Understood! Thanks for the fast feedback.
@mpopov tips and tricks would be great! I could maybe see if @daniel could maybe help, otherwise I can take notes and create a task on our end

Prerequisite: If you don't already have someone who can query production databases, whoever you pick would need to go through this process. Refer to https://wikitech.wikimedia.org/wiki/Production_access for more information, and it may also be helpful to refer to https://www.mediawiki.org/wiki/Product_Analytics/Onboarding#Access_and_security From there, I recommend prototyping the Hive query in https://hue.wikimedia.org as it will give helpful messages if there are any issues or errors.

The underlying dataset is the Webrequest table. This is the source of all the data you need. There is A LOT of data in there so when you're prototyping the query I highly recommend just working with 1 hour of data, for example:

SELECT
  , -- dimensions
  SUM(response_size) AS total_size,
  COUNT(1) AS n_requests -- 
FROM webrequest
WHERE year = 2021 AND month = 6 AND day = 1 AND hour = 12
  AND webrequest_source = 'text',
   -- additional filters like looking for api.php in uri path and checking for specific query parameters
GROUP BY  -- dimensions

A possible starting point is the query that was used to measure search API usage for the decommissioned Discovery/Search dashboard: https://gerrit.wikimedia.org/r/plugins/gitiles/wikimedia/discovery/golden/+/refs/heads/master/modules/metrics/search/search_api_usage. Notes on search_classify(): that UDF just helped with processing URI path & query, but you can just use HiveQL's string functions to look for relevant API requests and extract the relevant parameters from URI path & query. If curious, refer to this definition and this supporting code.

Dimensions

The following requested dimensions would cause problems:

  • URI Query
  • Response Size

For response size, you would need to either use it as a measure instead of a dimension – which would allow you to have a chart that shows SUM(total_size) – aka total bandwidth – broken down by whatever dimensions you want (if any). You could also have SUM(total_size)/SUM(n_requests) as the average request size.

Alternatively, if you decide to have response size as a dimension – meaning the only measure/metric you'll have is count of requests – you would need to bucketize it (say 0-10 bytes, 11-20 etc.) just like you would with time (cf. https://wikitech.wikimedia.org/wiki/Analytics/Systems/Hive_to_Druid_Ingestion_Pipeline#Time_measure_buckets)

For uri query, you'll want to start with just extracting the action parameter from MW API requests since other parameters are conditional on action.

Productionizing

Once you have a query, we recommend working with the Analytics team to add it as an Oozie job in their refinery (since Airflow is not yet available). The pipeline is essentially: run Hive query on Webrequest and insert results into a Hive table, then ingest Hive table into Druid.

For example, the pageviews_hourly dataset in Superset is created & maintained with the following:

See also the event.mediawiki_api_request table in Hive. It comes from the mediawiki.api-request stream which has this schema.

I think this is only MediaWiki action API requests. To get really good data for all the sources you mentioned, it'd be nice if they all logged request events that could then be ingestioned into hive tables.

Oh, I forgot, there is also the event.api_gateway_request hive table, which comes from the api-gateway.request stream with this schema

Thanks, @mpopov and @Ottomata , for the very helpful replies! It all makes sense at a high level, but it'll take me a bit to absorb the details. I did confirm that I could execute queries via https://hue.wikimedia.org/, so I think I have everything I need to do some experimenting.

One question before I really dive in: is there anything I should know in order to be a good citizen of this system? In particular, do I need to be concerned about executing heavy/slow queries that could affect the overall system for other users? Especially given that I don't yet have much sense what might constitute an impolite query...

FWIW, I don't imagine I'll need to do anything other than SELECTs at this stage (if ever).

Some helpful hints on this page, but here this is the most important one:

https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries#Use_partitions

BTW, Hue is our (DE team) least favorite interface, and we will one day deprecate it. However, it works! And you are welcome to use it. :)

Alternatively, you could try Spark, Hive CLI directly, Jupyter Notebooks, or Superset SQL Lab (via Presto, not all tables available).

Let us know if you need any help. You can chat about analytics stuff in the #analytics room in slack and/or the #wikimedia-analytics IRC room.

@VirginiaPoundstone: Is this still needed or is it safe to close/decline this task? If the data is still needed, do you want to adjust the request details?

JArguello-WMF renamed this task from [REQUEST] Data for external API Requests to [REQUEST] Data for external API Requests. Harden API Gateway.Feb 1 2023, 9:19 PM

Updated description. Scheduling a consultation hour to discuss next steps.

Aklapper changed the subtype of this task from "Deadline" to "Task".Apr 26 2023, 8:43 AM