Change the roll-up scripts that run as bd808 on stat1007 to use mediawiki_api_request OR get T137321: Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T185233 Modern Event Platform | |||
Resolved | Ottomata | T201068 Modern Event Platform: Stream Intake Service | |||
Resolved | Ottomata | T206785 Modern Event Platform: Stream Intake Service (EventGate): Implementation | |||
Resolved | Ottomata | T214080 Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate | |||
Resolved | EvanProdromou | T222267 Port usage of mediawiki_ApiAction to mediawiki_api_request |
Event Timeline
I just tried a version of one of @bd808's queries to compare results from ApiAction vs api_request:
mediawiki_ApiAction query
ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar; CREATE TEMPORARY FUNCTION network_origin AS 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF'; set hivevar:year=2019; set hivevar:month=4; set hivevar:day=30; set hivevar:hour=0; SELECT userAgent, wiki, network_origin(ip), COUNT(1) as cnt FROM wmf_raw.ApiAction WHERE year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} GROUP BY userAgent, wiki, network_origin(ip) ORDER BY cnt desc limit 10 ;
Results:
useragent wiki _c2 cnt Parsoid/0.10.0+git enwiki wikimedia 276082 WikipediaApp/6.2.2.1605 (iOS 12.2; Phone) enwiki internet 209858 enwiki internet 187714 citationhunt (https://tools.wmflabs.org/citationhunt) enwiki wikimedia_labs 172830 Parsoid/0.10.0+git dewiki wikimedia 158411 Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Mobile/15E148 Safari/604.1 enwiki internet 144135 Object Revision Evaluation Service <ahalfaker@wikimedia.org> wikidatawiki wikimedia 137611 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 enwiki internet 134738 Parsoid/0.10.0+git ruwiki wikimedia 113636 Parsoid/0.10.0+git commonswiki wikimedia 102749
mediawiki_api_request query
ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar; CREATE TEMPORARY FUNCTION network_origin AS 'org.wikimedia.analytics.refinery.hive.NetworkOriginUDF'; set hivevar:year=2019; set hivevar:month=4; set hivevar:day=30; set hivevar:hour=0; SELECT http.request_headers['user-agent'], `database`, network_origin(http.client_ip), COUNT(1) as cnt FROM event.mediawiki_api_request WHERE year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} GROUP BY http.request_headers['user-agent'], `database`, network_origin(http.client_ip) ORDER BY cnt desc limit 10 ;
Results:
_c0 database _c2 cnt Parsoid/0.10.0+git enwiki wikimedia 552939 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36 enwiki internet 432342 WikipediaApp/6.2.2.1605 (iOS 12.2; Phone) enwiki internet 419224 NULL enwiki internet 374836 citationhunt (https://tools.wmflabs.org/citationhunt) enwiki wikimedia_labs 345037 Parsoid/0.10.0+git dewiki wikimedia 316665 Mozilla/5.0 (iPhone; CPU iPhone OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1 Mobile/15E148 Safari/604.1 enwiki internet 287920 Object Revision Evaluation Service <ahalfaker@wikimedia.org> wikidatawiki wikimedia 274048 Parsoid/0.10.0+git ruwiki wikimedia 228212 Parsoid/0.10.0+git commonswiki wikimedia 204986
Those are pretty different! Notably the mediawiki_api_request data is showing many more requests, about twice as many... hm.
select count(*) FROM wmf_raw.ApiAction WHERE year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} 9809748 select count(*) FROM event.mediawiki_api_request WHERE year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} 19601806
Hm ok, I don't fully know what is going on here, but I found some api requests that are missing from ApiAction. I focused on one specifically, and verified that it is in the udp2log generated api.log on mwlog1001.
Find some logs that are in mediawiki_api_request but not in ApiAction:
set hivevar:year=2019; set hivevar:month=5; set hivevar:day=1; set hivevar:hour=8; SELECT A.* FROM event.mediawiki_api_request as A LEFT OUTER JOIN wmf_raw.ApiAction as B ON ( B.wiki = A.`database` AND B.userAgent = A.http.request_headers["user-agent"] AND B.params["action"] = A.params["action"] AND B.params["search"] = A.params["search"] AND B.timespentbackend = A.backend_time_ms AND B.ip = A.http.client_ip AND B.year = ${year} AND B.month = ${month} AND B.day = ${day} AND B.hour = ${hour} ) WHERE B.params IS null AND A.params["action"] = "opensearch" AND A.`database` = 'enwiki' AND A.http.request_headers["user-agent"] = 'Python-urllib/2.7' AND A.year = ${year} AND A.month = ${month} AND A.day = ${day} AND A.hour = ${hour} ;
This query resulted in 42 rows that were in mediawiki_api_request but not in ApiAction. I decided to look for one of these in api.log.
/mediawiki/api/request/0.0.1 {"uri":null,"request_id":"XMlfAgpAEMQAADLJSA8AAABK","id":"df0bf6b5-6bee-11e9-a9a8-1866da9950b2","dt":"2019-05-01T08:55:30Z","domain":"en.wikipedia.org","stream":"mediawiki.api-request"} {"method":"GET","client_ip":"XXXXXXXXXXX","request_headers":{"user-agent":"Python-urllib/2.7"}} enwiki 24 NULL {"action":"opensearch","limit":"1","search":"Kaawsat letters world","format":"xml"} eqiad 2019 5 1 8
I found it:
2019-05-01 08:55:30 [XMlfAgpAEMQAADLJSA8AAABK] mw1315 enwiki 1.34.0-wmf.1 api INFO: API GET XXXXXXXX XXXXXXXX T=24ms action=opensearch format=xml search=Kaawsat%20letters%20world limit=1
So, I think mediawiki_api_request is correct. Is it possible that LOTS of data is somehow missing from ApiAction, and has been?
I think mediawiki_api_request is correct. Is it possible that LOTS of data is somehow missing from ApiAction, and has been?
We can verify this by counting api requests on webrequest table right?
AIUI the webrequest table only contains GET requests but includes requests that Varnish serves from cache (Varnish caching is not typical for the Action API but is used by a few tools). In contrast, api.log (and in theory ApiAction, which should be based on the exact same log stream) include both GET and POST but only requests which reach the appservers.
Webrequest will contain the POST requests too, but it won't have any of the POST body parameters.
Haven't yet counted totals (a bit hard and expensive to join properly here), but as expected the request I'm looking at is also in webrequest:
select * from wmf.webrequest where uri_host = 'en.wikipedia.org' AND uri_path = '/w/api.php' AND uri_query LIKE '%action=opensearch%' AND uri_query LIKE '%search=Kaawsat+letters+world%' AND webrequest_source="text" AND year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} limit 10; cp3030.esams.wmnet 22078811449 2019-05-01T08:55:30 0.138076 XXXXXXXX miss 200 185 GET en.wikipedia.org /w/api.php ?action=opensearch&search=Kaawsat+letters+world&limit=1&format=xml text/xml; charset=utf-8 - NULL Python-urllib/2.7 - ns=-1;special=Badtitle;https=1;nocookies=1 - false 0.0.21 XXXXXXXX {"XXXXXXXXX"} cp1079 pass, cp3033 miss, cp3030 miss {"browser_family":"Python-urllib","os_major":"-","wmf_app_version":"-","browser_major":"2","os_minor":"-","os_family":"Other","device_family":"Spider"} {"special":"Badtitle","ns":"-1","nocookies":"1","https":"1"} 2019-05-01 08:55:30 desktop spider false none {"project_class":"wikipedia","project":"en","qualifiers":[],"tld":"org","project_family":"wikipedia"} NULL NULL -1 [] {XXXXXXXX} - text 2019 5 1 8
I think there is no need to join, plain counts in webrequest and apiAction (old table) should match closely for api requests with MISS status in webrequest, right? I would count(*) for couple hours in two different days and see differences among the three sources grouping by miss/cached status.
Gathered some counts:
set hivevar:year=2019; set hivevar:month=4; set hivevar:day=30; set hivevar:hour=0;
wmf_raw.ApiAction
9809748
event.mediawiki_api_request
19601806
wmf.webrequest by http_method
http_method cnt GET 24009722 POST 1888533 OPTIONS 397522 HEAD 5151
wmf.webrequest by cache_status
cache_status cnt pass 14031925 hit-front 7835177 miss 2044842 hit-local 1304081 int-front 787150 hit-remote 297568 - 183 int-local 2
I don't see any particular discrepancy correlation with cache_status or http_method. I also tried to count total api.php webrequests for POST, GET and cache_status not a hit: 16461519
Nothing lines up! It'd be nice to know how to query webrequest with exact parameters that also will result in an Mediawiki api request, but I'm not sure how to do it. It does seem clear that ApiAction is missing data though.
Queries:
set hivevar:year=2019; set hivevar:month=4; set hivevar:day=30; set hivevar:hour=0; SELECT cache_status, count(*) as cnt from wmf.webrequest where uri_path LIKE '%api.php%' AND webrequest_source = 'text' AND year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} group by cache_status order by cnt desc LIMIT 20; cache_status cnt pass 14031925 hit-front 7835177 miss 2044842 hit-local 1304081 int-front 787150 hit-remote 297568 - 183 int-local 2 SELECT http_method, count(*) as cnt from wmf.webrequest where uri_path LIKE '%api.php%' AND webrequest_source = 'text' AND year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour} group by http_method order by cnt desc LIMIT 20; http_method cnt GET 24009722 POST 1888533 OPTIONS 397522 HEAD 5151 SELECT count(*) from wmf.webrequest where uri_path LIKE '%api.php%' AND http_method IN ('GET', 'POST') AND cache_status NOT LIKE 'hit%' AND webrequest_source = 'text' AND year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour}; 16461519 SELECT count(*) from wmf_raw.ApiAction where year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour}; 9809748 SELECT count(*) from event.mediawiki_api_request where year = ${year} AND month = ${month} AND day = ${day} AND hour = ${hour}; 19601806
Not counting all the hit-* requests is expected since those never reach the appservers and ApiAction is logged by the appservers. (Not sure what int-* is but there's not many of those anyway.) Not counting OPTIONS is expected, those are probably CORS preflight requests, not real API requests, handled but not logged by the appservers (but there's not many of those either). Beyond that, not sure. Maybe worth checking whether the discrepancy is uniform in time (vs. being related to some event propagation mechanism temporarily breaking) and in API request type?
It'd be nice to know how to query webrequest with exact parameters that also will result in an Mediawiki api request, but I'm not sure how to do it.
Probably WHERE (uri_path LIKE '/w/api.php' OR uri_path LIKE '/w/api.php?%') AND http_method IN ('GET', 'POST') AND cache_status NOT LIKE 'hit%'. There are probably some types of errors that happen before the main API handler code is invoked and so don't get logged, like an invalid OAuth signature, but I doubt there's many of those.
Hm, comparing throughput in Kafka:
These look to be about the same, and at least defintely differ by a factor of 2.
So, there must be something either in the Camus import or in the Hive Avro stuff that is missing.
So in Camus logs, I see errors like
May 1 09:15:07 an-coord1001 camus-mediawiki-analytics[69524]: 19/05/01 09:15:07 ERROR kafka.CamusJob: Offset range from kafka metadata is outside the previously persisted offset, mediawiki_ApiAction#011uri:tcp://kafka1013.eqiad.wmnet:9092#011leader:13#011partition:2#011earliest_offset:46734685522#011offset:28906055925#011latest_offset:47032737089#011avg_msg_size:319#011estimated_size:5782411291316
for a few ApiAction partitions. I don't fully know why this is happening, but it looks like this Camus import job is misconfigured. It doesn't set kafka.move.to.earliest.offset=true, so when it tries to use an offset that no longer exists in Kafka, it just skips that partition (or topic?). Not sure about this, I'd expect it to just start from latest when this happens.
Anyway, this seems to have been going on for a LONG time (who knows how long). The new mediawiki.api-request topics should be better monitored, since they use the same system and configuration we use for all events.
Since we have a culprit (even if we don't know the full cause), I feel ok with just proceeding to port over Bryan's Hive queries and use the new event topic, if there are no objections.
Yes and no. It comes via monolog, just like ApiAction in ApiMain.php. The EventBus extension now has a Monolog handler that is used for the api-request logging.
I modified the mediawiki-analytics (avro) camus job to set kafka.move.to.earliest.offset=true and ran it a a few times. It seems to have fixed its offset and is now importing all data from the last 7 days. This should allow us to vet the new data and new queries against the old ones.
Expect a big jump in the downstream data that computes from ApiAction...
Thanks, so this is from the api-request MediaWiki log channel created in 491887. I missed when that happened.
@Tgr: the code was merged a while back but it is only as of (last week?) that this is deployed to all wikis.
Looking a bit better!
SELECT count(*) from wmf_raw.ApiAction where year = 2019 AND month = 5 AND day = 3 AND hour BETWEEN 9 AND 12 AND ts >= 1556877600 AND ts < 1556881200; 22786681 SELECT count(*) from event.mediawiki_api_request where year = 2019 AND month = 5 AND day = 3 AND hour BETWEEN 9 AND 12 AND meta.dt >= "2019-05-03T10:00:00Z" AND meta.dt < "2019-05-03T11:00:00Z"; 22851410
Still missing 0.02% of api requests in ApiAction, but maybe that's ok? Hm.
Ah, and I just realized there will be more actual API requests than webrequests. Internal API requests likely do not hit frontend varnishes (or do they?). They wouldn't be logged via varnishkafka.
@bd808 @Tgr
https://github.com/bd808/action-api-analytics/pull/1
I tested the selects on theses queries, I think they all work as expected. It'd be good to have someone more familiar with the data check.
@EvanProdromou whatcha think about a timeline on this? Turning off the ApiAction events is a goal for this quarter, and this task has to be resolved before we do that.
@EvanProdromou @bd808, @Tgr: I'm still not sure who the owner of this data is, but we need to set a deadline I think! We need to turn of the ApiAction events soon so we have time to decommission it's source Kafka cluster this quarter. So, I'd appreciate if the queries were ported over by June 7th. If there isn't any movement here before that, we'll go ahead and turn the ApiAction events off.
Thanks!
Change 516303 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Disable ApiAction log channel
Bump! @bd808 if I don't hear anything by Wednesday my morning I'm going to turn off this stream! :o
Change 516303 merged by Ottomata:
[operations/mediawiki-config@master] Disable ApiAction log channel
Mentioned in SAL (#wikimedia-operations) [2019-06-19T13:56:04Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Disabling Avro ApiAction Monolog channel - T222267 (duration: 00m 57s)
Change 517864 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/core@master] Remove deprecated ApiAction logging code from ApiMain.php
Change 517918 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet
Change 517918 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet
Change 517918 merged by Ottomata:
[operations/puppet@production] camus - ApiAction stream has been disabled, only import CirrusSearchRequestSet
Change 517864 merged by jenkins-bot:
[mediawiki/core@master] Remove deprecated ApiAction logging code from ApiMain.php