Page MenuHomePhabricator

Top API user agents stats
Closed, ResolvedPublic

Description

We would like to get a better understanding of the top API user agents, for both the REST & Action APIs. These statistics should cover both cache misses & hits.

We do have user agents in web request logs, but as far as I know we currently only expose UA stats aggregated across all requests. Would it be possible to set up separate UA stats for requests matching /w/api.php and /api/rest_v1/ (separately + combined)?

See also: T122245: REST API entry point web request statistics at the Varnish level

Event Timeline

GWicke renamed this task from Top API user agents stat to Top API user agents stats.Aug 4 2016, 8:57 PM

That information exists for the php api in the api tables : https://wikitech.wikimedia.org/wiki/Analytics/Data/ApiAction

Data from api in webrequest is partial and thus this data is published by the api itself.

hive (wmf_raw)> desc apiaction;
OK
col_name data_type comment
ts int
ip string
useragent string
wiki string
timespentbackend int
haderror boolean
errorcodes array<string>
params map<string,string>
year string
month string
day string
hour string

  1. Partition Information
  2. col_name data_type comment

year string
month string
day string
hour string

@Nuria, we are interested in all requests, including cache hits. Requests recorded by backends like the PHP API would not include those.

We do have user agents in web request logs, but as far as I know we currently only expose UA stats aggregated across all requests.

A small clarification: we expose UAS aggregated across pageviews, rather than requests, meaning that a pageview that involves 1 html fetch and 20 javascript fetches is counted as "1 pageview" and thus reports "1 UA", otherwise our UA reporting will be over representing the browsers with js support.

Milimetric triaged this task as Medium priority.Aug 8 2016, 4:47 PM
Milimetric moved this task from Incoming to Backlog (Later) on the Analytics board.

ApiAction is collected via varnishkafka and does include cached requests. Stats aggregation is T137321.

Ping @Tgr what is the status of this?

ApiAction is collected via varnishkafka and does include cached requests. Stats aggregation is T137321.

Actually, ApiAction is collected via the PSR-3 logger in MediaWiki, which sends Avro objects to Kafka. So it does not include cached requests (but does include requests which are not logged by Kafka, such as POST request). Sorry for the misinformation.

Ping @Tgr what is the status of this?

Action API requests which hit a backend server are logged (in a way that enables top UA stats), but the logging code is hacky and needs to be finalized (T137321: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables). Logging cached API requests via varnishkafka and joining them with the existing stats is T155478: Copy cached API requests from raw webrequests table to ApiAction; as you said there, it probably should be done on top of webrequest tagging.

I don't know much about RESTBase stats. For ORES (not mentioned in the task but would be the next logical thing to add) we don't collect UA data at all AFAIK.

I saw this ticket go by, much has changed since it was filed. ActionAPI table has not been updated for a while, a much more reliable flow of data can be found at: mediawiki_api_request

hive (event)> desc mediawiki_api_request;
OK
col_name data_type comment
_schema string
meta struct<uri:string,request_id:string,id:string,dt:string,domain:string,stream:string>
http struct<method:string,client_ip:string,request_headers:map<string,string>>
database string
backend_time_ms bigint
api_error_codes array<string>
params map<string,string>
datacenter string
year bigint
month bigint
day bigint
hour bigint

  1. Partition Information
  2. col_name data_type comment

datacenter string
year bigint
month bigint
day bigint
hour bigint
Time taken: 0.229 seconds, Fetched: 21 row(s)

Also, we have api requests (sampled 1/128) in turnilo.
This is the tally of the top user agents on the php api for post requests for yesterday:

https://bit.ly/2YBoCSG

Screen Shot 2019-07-05 at 1.12.05 PM.png (1×1 px, 280 KB)

I don't think this is relevant anymore and any work should be done, but it contains some good info, so moving to Icebox.