Copy cached API requests from raw webrequests table to ApiAction
Open, LowPublic

Description

The current pipeline for Action API statistics is MediaWiki -> Monolog -> Kafka -> Hadoop. This has the disadvantage that reponses that come from the Varnish cache are ignored. We can work around this by copying all requests from the webrequest table where the endpoint is api.php and the request is cached (cache_status is hit).

To be able to provide error information, we would need T116658: Add Application errors for Mediawiki API to x-analytics .

Tgr created this task.Jan 17 2017, 6:58 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2017, 6:58 AM
Tgr updated the task description. (Show Details)Jan 17 2017, 7:09 AM
Nuria moved this task from Incoming to Q1 (July 2018) on the Analytics board.Jan 23 2017, 4:51 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria added a subscriber: Nuria.
This comment was removed by Nuria.

Sorry if I'm missing something obvious, why does this need research's attention? We could make a job to process out api.php requests. I would love to have a single job that goes over all webrequest records and applies a set of regexes to them to split up smaller sets of data needed in situations like this one. We already do this on a case-by-case basis and it's just wasting resources to go over the data multiple times.

I would be happy to use this as an opportunity to build such a simple oozie job with a small configurable UDF. The UDF could tag a webrequest as matching a certain set of regexes and we could store the results in a simple table with the schema:

(date, webrequest_tags, ua_map, geo_map, etc...)

Then people could query this like:

select ua_map['browser'],
       count(*)
  from tagged_webrequests
 where 'api' in webrequest_tags
   and date = '2017-01-01'
 group by ua_map['browser'];
Tgr added a subscriber: bd808.Jan 25 2017, 6:30 PM
Nuria edited projects, added Analytics; removed Analytics-Kanban.Jan 26 2017, 4:42 PM
Nuria added a comment.EditedMar 20 2017, 4:15 PM

Why don't we have a meeting to talk about how do we want to evolve api error logs? We have now api publishing into hadoop but also we (as in wmf) are maintaining udp2log infrstructure. Could we clarify what is the path we want to follow regarding logs and api?

Nuria triaged this task as Low priority.Mar 20 2017, 4:15 PM
Anomie added a subscriber: Anomie.Mar 20 2017, 4:23 PM

We have now api publishing into hadoop but also we are maintaining udp2log infrstructure.

I note the hadoop data and the udp2log data serve different purposes. For example, I don't think the hadoop data is particularly amenable to tailing to watch for certain active queries. But the udp2log data sucks for trying to find out how many hits some particular endpoint is getting over a time period longer than the past few minutes.

Ottomata added a subscriber: Ottomata.EditedMar 20 2017, 4:52 PM

We have now api publishing into hadoop but also we are maintaining udp2log infrstructure.

This really should be stated as 'API logs published to Kafka which are then imported into hadoop for later analysis'. kafkatee is a drop in replacement for udp2log that reads data from Kafka and pipes it to outputs, including local files.

Kafka is just a distributed seekable log buffer, so it is great for tailing to watch for active queries.

Tgr added a comment.Mar 20 2017, 5:29 PM

This seems like an "if it works, don't try to fix it" thing to me. udp2log is used by all MediaWiki logging, having the API as an additional client is no maintenance burden. Replacing it with Kafka does not seem to have any benefit.

Nuria added a comment.Mar 20 2017, 6:12 PM

Replacing it with Kafka does not seem to have any benefit.

I think it does:. We have an outstanding request of publishing cached requests from webrequest into some hadoop api tables. Having mediawiki requests being published to kafka for cached/non cached and error logs would make importing all that data into 1 place a lot easier.

Tgr added a comment.Mar 20 2017, 7:09 PM

Having mediawiki requests being published to kafka for cached/non cached and error logs would make importing all that data into 1 place a lot easier.

We already send data about non-cached requests (which is the only kind of request as far as the API code in MediaWiki is concerned) to Kafka (in the form of avro packets tailored to what data the API metrics need). We also send it (in the form of fully formatted log lines) to udp2log (which is where log lines from all other MediaWiki sources get sent to). Replacing udp2log with kafka does not seem useful at all: fully formatted log lines are not particularly useful for hadoop, we don't want to maintain formatting code in two places, we don't want to divert away logs from logfiles (that would force deployers/ops to guess which logs need tail and which logs need kafkatee, and to rewrite all the existing tooling) and we don't want to complicate the data flow for server logfiles by introducing a new channel when there is no tangible benefit to it.

Nuria added a comment.Jun 12 2017, 4:00 PM

This ticket has several requests, regarding of being able to harvest API cached requests:

@Tgr: I think part of this work can be addressed with tagging changes that are now WIP , we can tag requests as coming from the api and those will get grouped to a webrequest "subtable" . See tagging WIP code : https://gerrit.wikimedia.org/r/#/c/353287/

The other pieces of work regarding errorlogging and API we can tackle on other tickets.

Nuria added a comment.Jun 15 2017, 4:05 PM

@Tgr, this will benefit from changes happening on tagging of requests. We can tag requests that need to be "copied" easily and i think it will be trivila to copy those once tagging is done. Code changes for tagging are here, see example for "portal" : https://gerrit.wikimedia.org/r/#/c/353287/

Nuria added a comment.Jun 15 2017, 4:06 PM

We think this work can happen next quarter.

Linking to task T142139 cause i think is realted, @Tgr let us know otherwise

fdans moved this task from Q1 (July 2018) to Deprioritized on the Analytics board.Oct 26 2017, 4:16 PM