Page MenuHomePhabricator

Port usage of mediawiki_ApiAction to mediawiki_api_request
Open, HighPublic

Description

Change the roll-up scripts that run as bd808 on stat1007 to use mediawiki_api_request OR get T137321: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables

Event Timeline

Ottomata created this task.May 1 2019, 2:51 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 1 2019, 2:51 PM

I just tried a version of one of @bd808's queries to compare results from ApiAction vs api_request:

mediawiki_ApiAction query
ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION network_origin AS 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF';

set hivevar:year=2019;
set hivevar:month=4;
set hivevar:day=30;
set hivevar:hour=0;

SELECT
    userAgent,
    wiki,
    network_origin(ip),
    COUNT(1) as cnt
FROM wmf_raw.ApiAction
WHERE year = ${year}
  AND month = ${month}
  AND day = ${day}
  AND hour = ${hour}
GROUP BY
    userAgent,
    wiki,
    network_origin(ip)
ORDER BY  cnt desc
limit 10
;

Results:

useragent	wiki	_c2	cnt
Parsoid/0.10.0+git	enwiki	wikimedia	276082
WikipediaApp/6.2.2.1605 (iOS 12.2; Phone)	enwiki	internet	209858
	enwiki	internet	187714
citationhunt (https://tools.wmflabs.org/citationhunt)	enwiki	wikimedia_labs	172830
Parsoid/0.10.0+git	dewiki	wikimedia	158411
Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Mobile/15E148 Safari/604.1	enwiki	internet	144135
Object Revision Evaluation Service <ahalfaker@wikimedia.org>	wikidatawiki	wikimedia	137611
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36	enwiki	internet	134738
Parsoid/0.10.0+git	ruwiki	wikimedia	113636
Parsoid/0.10.0+git	commonswiki	wikimedia	102749
mediawiki_api_request query
ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar;
CREATE TEMPORARY FUNCTION network_origin AS 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF';

set hivevar:year=2019;
set hivevar:month=4;
set hivevar:day=30;
set hivevar:hour=0;

SELECT
    http.request_headers['user-agent'],
    `database`,
    network_origin(http.client_ip),
    COUNT(1) as cnt
FROM event.mediawiki_api_request
WHERE year = ${year}
  AND month = ${month}
  AND day = ${day}
  AND hour = ${hour}
GROUP BY
    http.request_headers['user-agent'],
    `database`,
    network_origin(http.client_ip)
ORDER BY cnt desc
limit 10
;

Results:

_c0	database	_c2	cnt
Parsoid/0.10.0+git	enwiki	wikimedia	552939
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36	enwiki	internet	432342
WikipediaApp/6.2.2.1605 (iOS 12.2; Phone)	enwiki	internet	419224
NULL	enwiki	internet	374836
citationhunt (https://tools.wmflabs.org/citationhunt)	enwiki	wikimedia_labs	345037
Parsoid/0.10.0+git	dewiki	wikimedia	316665
Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Mobile/15E148 Safari/604.1	enwiki	internet	287920
Object Revision Evaluation Service <ahalfaker@wikimedia.org>	wikidatawiki	wikimedia	274048
Parsoid/0.10.0+git	ruwiki	wikimedia	228212
Parsoid/0.10.0+git	commonswiki	wikimedia	204986

Those are pretty different! Notably the mediawiki_api_request data is showing many more requests, about twice as many... hm.

Ottomata added a comment.EditedMay 1 2019, 3:26 PM
select count(*) 
FROM wmf_raw.ApiAction
WHERE year = ${year}
  AND month = ${month}
  AND day = ${day}
  AND hour = ${hour}

9809748


select count(*) 
FROM event.mediawiki_api_request
WHERE year = ${year}
  AND month = ${month}
  AND day = ${day}
  AND hour = ${hour}

19601806

Hm ok, I don't fully know what is going on here, but I found some api requests that are missing from ApiAction. I focused on one specifically, and verified that it is in the udp2log generated api.log on mwlog1001.

Find some logs that are in mediawiki_api_request but not in ApiAction:


set hivevar:year=2019;
set hivevar:month=5;
set hivevar:day=1;
set hivevar:hour=8;

SELECT A.*
FROM event.mediawiki_api_request as A
LEFT OUTER JOIN wmf_raw.ApiAction as B
ON (
      B.wiki              = A.`database`  AND
      B.userAgent           = A.http.request_headers["user-agent"]  AND
      B.params["action"]    = A.params["action"] AND
      B.params["search"]    = A.params["search"] AND
      B.timespentbackend    = A.backend_time_ms AND
      B.ip                  = A.http.client_ip 
AND B.year = ${year}
AND B.month = ${month}
AND B.day = ${day}
AND B.hour = ${hour}
)
WHERE B.params IS null
AND A.params["action"] = "opensearch"
AND A.`database` = 'enwiki'
AND A.http.request_headers["user-agent"] = 'Python-urllib/2.7'
AND A.year = ${year}
AND A.month = ${month}
AND A.day = ${day}
AND A.hour = ${hour}
;

This query resulted in 42 rows that were in mediawiki_api_request but not in ApiAction. I decided to look for one of these in api.log.

/mediawiki/api/request/0.0.1    {"uri":null,"request_id":"XMlfAgpAEMQAADLJSA8AAABK","id":"df0bf6b5-6bee-11e9-a9a8-1866da9950b2","dt":"2019-05-01T08:55:30Z","domain":"en.wikipedia.org","stream":"mediawiki.api-request"}   {"method":"GET","client_ip":"XXXXXXXXXXX","request_headers":{"user-agent":"Python-urllib/2.7"}}    enwiki  24  NULL    {"action":"opensearch","limit":"1","search":"Kaawsat letters world","format":"xml"} eqiad   2019    5   1   8

I found it:

2019-05-01 08:55:30 [XMlfAgpAEMQAADLJSA8AAABK] mw1315 enwiki 1.34.0-wmf.1 api INFO: API GET XXXXXXXX XXXXXXXX T=24ms action=opensearch format=xml search=Kaawsat%20letters%20world limit=1

So, I think mediawiki_api_request is correct. Is it possible that LOTS of data is somehow missing from ApiAction, and has been?

Nuria added a comment.May 1 2019, 6:01 PM

I think mediawiki_api_request is correct. Is it possible that LOTS of data is somehow missing from ApiAction, and has been?

We can verify this by counting api requests on webrequest table right?

Tgr added a comment.May 1 2019, 6:54 PM

AIUI the webrequest table only contains GET requests but includes requests that Varnish serves from cache (Varnish caching is not typical for the Action API but is used by a few tools). In contrast, api.log (and in theory ApiAction, which should be based on the exact same log stream) include both GET and POST but only requests which reach the appservers.

Webrequest will contain the POST requests too, but it won't have any of the POST body parameters.

Haven't yet counted totals (a bit hard and expensive to join properly here), but as expected the request I'm looking at is also in webrequest:

select *
from wmf.webrequest where 
uri_host = 'en.wikipedia.org'
AND uri_path = '/w/api.php'
AND uri_query LIKE '%action=opensearch%'
AND uri_query LIKE '%search=Kaawsat+letters+world%'
AND webrequest_source="text"
AND year = ${year}
AND month = ${month}
AND day = ${day}
AND hour = ${hour}
limit 10;

cp3030.esams.wmnet  22078811449 2019-05-01T08:55:30 0.138076  XXXXXXXX  miss  200 185 GET en.wikipedia.org  /w/api.php  ?action=opensearch&search=Kaawsat+letters+world&limit=1&format=xml  text/xml; charset=utf-8 - NULL  Python-urllib/2.7 - ns=-1;special=Badtitle;https=1;nocookies=1  - false 0.0.21  XXXXXXXX  {"XXXXXXXXX"}  cp1079 pass, cp3033 miss, cp3030 miss {"browser_family":"Python-urllib","os_major":"-","wmf_app_version":"-","browser_major":"2","os_minor":"-","os_family":"Other","device_family":"Spider"} {"special":"Badtitle","ns":"-1","nocookies":"1","https":"1"}  2019-05-01 08:55:30 desktop spider  false none  {"project_class":"wikipedia","project":"en","qualifiers":[],"tld":"org","project_family":"wikipedia"} NULL  NULL  -1    []  {XXXXXXXX}  - text  2019  5 1 8
Nuria added a comment.May 1 2019, 7:47 PM

I think there is no need to join, plain counts in webrequest and apiAction (old table) should match closely for api requests with MISS status in webrequest, right? I would count(*) for couple hours in two different days and see differences among the three sources grouping by miss/cached status.

Gathered some counts:

set hivevar:year=2019;
set hivevar:month=4;
set hivevar:day=30;
set hivevar:hour=0;

wmf_raw.ApiAction
9809748

event.mediawiki_api_request
19601806

wmf.webrequest by http_method

http_method	cnt
GET	       24009722
POST	        1888533
OPTIONS	         397522
HEAD	           5151

wmf.webrequest by cache_status

cache_status	cnt
pass	       14031925
hit-front	7835177
miss	        2044842
hit-local	1304081
int-front	 787150
hit-remote	 297568
-	            183
int-local	      2

I don't see any particular discrepancy correlation with cache_status or http_method. I also tried to count total api.php webrequests for POST, GET and cache_status not a hit: 16461519

Nothing lines up! It'd be nice to know how to query webrequest with exact parameters that also will result in an Mediawiki api request, but I'm not sure how to do it. It does seem clear that ApiAction is missing data though.

Queries:


set hivevar:year=2019;
set hivevar:month=4;
set hivevar:day=30;
set hivevar:hour=0;

SELECT cache_status, count(*) as cnt from wmf.webrequest
where uri_path LIKE '%api.php%'
AND webrequest_source = 'text'
AND year = ${year}
AND month = ${month}
AND day = ${day}
AND hour = ${hour}
group by cache_status
order by cnt desc
LIMIT 20;

cache_status  cnt
pass        14031925
hit-front    7835177
miss         2044842
hit-local    1304081
int-front     787150
hit-remote    297568
-                183
int-local          2


SELECT http_method, count(*) as cnt from wmf.webrequest
where uri_path LIKE '%api.php%'
AND webrequest_source = 'text'
AND year = ${year}
AND month = ${month}
AND day = ${day}
AND hour = ${hour}
group by http_method
order by cnt desc
LIMIT 20;

http_method cnt
GET    24009722
POST    1888533
OPTIONS  397522
HEAD       5151


SELECT count(*) from wmf.webrequest
where uri_path LIKE '%api.php%'
AND http_method IN ('GET', 'POST')
AND cache_status NOT LIKE 'hit%'
AND webrequest_source = 'text'
AND year = ${year}
AND month = ${month}
AND day = ${day}
AND hour = ${hour};
16461519

SELECT count(*) from wmf_raw.ApiAction
where year = ${year}
AND month = ${month}
AND day = ${day}
AND hour = ${hour};
  9809748


SELECT count(*) from event.mediawiki_api_request
where year = ${year}
AND month = ${month}
AND day = ${day}
AND hour = ${hour};
  19601806
Tgr added a comment.May 1 2019, 10:46 PM

Not counting all the hit-* requests is expected since those never reach the appservers and ApiAction is logged by the appservers. (Not sure what int-* is but there's not many of those anyway.) Not counting OPTIONS is expected, those are probably CORS preflight requests, not real API requests, handled but not logged by the appservers (but there's not many of those either). Beyond that, not sure. Maybe worth checking whether the discrepancy is uniform in time (vs. being related to some event propagation mechanism temporarily breaking) and in API request type?

It'd be nice to know how to query webrequest with exact parameters that also will result in an Mediawiki api request, but I'm not sure how to do it.

Probably WHERE (uri_path LIKE '/w/api.php' OR uri_path LIKE '/w/api.php?%') AND http_method IN ('GET', 'POST') AND cache_status NOT LIKE 'hit%'. There are probably some types of errors that happen before the main API handler code is invoked and so don't get logged, like an invalid OAuth signature, but I doubt there's many of those.

Ottomata removed Ottomata as the assignee of this task.May 2 2019, 3:34 PM

So in Camus logs, I see errors like

May  1 09:15:07 an-coord1001 camus-mediawiki-analytics[69524]: 19/05/01 09:15:07 ERROR kafka.CamusJob: Offset range from kafka metadata is outside the previously persisted offset, mediawiki_ApiAction#011uri:tcp://kafka1013.eqiad.wmnet:9092#011leader:13#011partition:2#011earliest_offset:46734685522#011offset:28906055925#011latest_offset:47032737089#011avg_msg_size:319#011estimated_size:5782411291316

for a few ApiAction partitions. I don't fully know why this is happening, but it looks like this Camus import job is misconfigured. It doesn't set kafka.move.to.earliest.offset=true, so when it tries to use an offset that no longer exists in Kafka, it just skips that partition (or topic?). Not sure about this, I'd expect it to just start from latest when this happens.

Anyway, this seems to have been going on for a LONG time (who knows how long). The new mediawiki.api-request topics should be better monitored, since they use the same system and configuration we use for all events.

Since we have a culprit (even if we don't know the full cause), I feel ok with just proceeding to port over Bryan's Hive queries and use the new event topic, if there are no objections.

Ottomata triaged this task as High priority.May 2 2019, 4:50 PM
Ottomata moved this task from Incoming to Modern Event Platform on the Analytics board.
Ottomata raised the priority of this task from High to Needs Triage.
Ottomata triaged this task as High priority.
Tgr added a comment.May 2 2019, 6:53 PM

Where does mediawiki_api_request come from? The EventBus extension?

Yes and no. It comes via monolog, just like ApiAction in ApiMain.php. The EventBus extension now has a Monolog handler that is used for the api-request logging.

See the parent task(s) :)

Ottomata added a comment.EditedMay 2 2019, 7:54 PM

I modified the mediawiki-analytics (avro) camus job to set kafka.move.to.earliest.offset=true and ran it a a few times. It seems to have fixed its offset and is now importing all data from the last 7 days. This should allow us to vet the new data and new queries against the old ones.

Expect a big jump in the downstream data that computes from ApiAction...

Tgr added a comment.May 2 2019, 8:46 PM

Thanks, so this is from the api-request MediaWiki log channel created in 491887. I missed when that happened.

Nuria added a comment.May 2 2019, 10:31 PM

@Tgr: the code was merged a while back but it is only as of (last week?) that this is deployed to all wikis.

Looking a bit better!

SELECT count(*) from wmf_raw.ApiAction
where year = 2019
AND month = 5
AND day = 3
AND hour BETWEEN 9 AND 12
AND ts >= 1556877600 AND ts < 1556881200;

22786681

SELECT count(*) from event.mediawiki_api_request
where year = 2019
AND month = 5
AND day = 3
AND hour BETWEEN 9 AND 12
AND meta.dt >= "2019-05-03T10:00:00Z" AND meta.dt < "2019-05-03T11:00:00Z";

22851410

Still missing 0.02% of api requests in ApiAction, but maybe that's ok? Hm.

Ah, and I just realized there will be more actual API requests than webrequests. Internal API requests likely do not hit frontend varnishes (or do they?). They wouldn't be logged via varnishkafka.

Ottomata moved this task from Backlog to In Progress on the EventBus board.May 3 2019, 6:00 PM

@bd808 @Tgr
https://github.com/bd808/action-api-analytics/pull/1

I tested the selects on theses queries, I think they all work as expected. It'd be good to have someone more familiar with the data check.

@EvanProdromou whatcha think about a timeline on this? Turning off the ApiAction events is a goal for this quarter, and this task has to be resolved before we do that.

@EvanProdromou @bd808, @Tgr: I'm still not sure who the owner of this data is, but we need to set a deadline I think! We need to turn of the ApiAction events soon so we have time to decommission it's source Kafka cluster this quarter. So, I'd appreciate if the queries were ported over by June 7th. If there isn't any movement here before that, we'll go ahead and turn the ApiAction events off.

Thanks!

Ottomata moved this task from Next Up to Paused on the Analytics-Kanban board.Thu, Jun 6, 5:04 PM

FYI, we will be disabling the ApiAction stream this week.

Change 516303 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Disable ApiAction log channel

https://gerrit.wikimedia.org/r/516303

Bump! @bd808 if I don't hear anything by Wednesday my morning I'm going to turn off this stream! :o