Page MenuHomePhabricator

Fix Reading List Service dashboard
Open, NormalPublic

Event Timeline

Tgr created this task.Mar 9 2018, 12:51 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 9 2018, 12:51 AM
Tgr added a comment.Mar 9 2018, 11:39 PM

Fixed some bugs (summed statsd items instead of counting them so ended up with total request time in ms as request count... awkward), made formatting a bit nicer, added request duration charts.
The tables seem somewhat accurate now, they show ~2000 5xx results for /changes/since/ and I get the same in Logstash as well. There are a bunch of other 5xx errors that do not correlate with the MW exception logs though. And the singlestat numbers at the top do not even resemble the detailed counts.

bearND added a subscriber: bearND.Mar 9 2018, 11:52 PM

Wanna add links to the patches here? I guess Grafana would still show the issues until those patches are deployed, right?

Tgr added a comment.Mar 9 2018, 11:58 PM

I meant fixing bugs in the dashboard metrics definitions (although I did fix T189272 in the meanwhile). Those are handled realtime via the Grafana admin interface (I think there is some way to store dashboards in git but I don't think it's worth the overhead unless a lot of people use that dashboard), so what you see now is the (somewhat) fixed version.

Ah, I was hoping the error rate would be reduced as well. 66% is very high.

Tgr added a comment.Mar 10 2018, 1:01 AM

The restbase.external.v1_data_lists* statsd metrics shows all kinds of 4xx and 5xx errors, about 7500 in total for the last week; the RESTBase logs cannot be easily grouped by request path (that should be fixed) but show about 20,000 log entries for the same period.

The two singlestats counters at the top of the dasboard (reads / writes) display sumSeries(restbase.external.v1_data_lists*.GET.2xx.count) and sumSeries(exclude(restbase.external.v1_data_lists*.*.2xx.count, 'GET')), the "OK" chart at the middle left shows restbase.external.v1_data_lists_*.*.2xx.count, so they should be roughly equal, but the counters show ~2.5K hits for the last week, and the chart sums up to around 15K.

Tgr added a comment.Mar 10 2018, 1:05 AM

Ah, I was hoping the error rate would be reduced as well. 66% is very high.

Well, they show data for the last 7 days by default so they won't change anytime soon. But in any case the errors seem to be dominated by unauthenticated requests from the app (for the whole week, 17K of those vs. 2,5K of the errors caused by T189272).

Tgr added a comment.Mar 10 2018, 1:06 AM

But in any case the errors seem to be dominated by unauthenticated requests from the app

Or maybe something other than the app, I guess that's worth verifying.

Yeah, insteresting. In the RESTBase logs see 17k of "MW API call error notloggedin" with Android app user agents like "WikipediaApp/2.7.225-alpha-2018-02-22 (Android 7.0; Phone) Alpha Channel".
That is probably worth filing another task for. The app should not try to sync if not logged in. Or it could be that the user session has expired?

Tgr added a comment.Mar 10 2018, 1:08 AM

Also the singlestat for error rate says 66% but just by eyeballing the two charts, there are about twice as much successful requests than failed ones...

Tgr added a comment.Mar 10 2018, 1:24 AM

Raw API request counts from Kafka/Hive:

0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select day, count(*) from ApiAction where year = 2018 and month = 3 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries') group by day;
...
day     _c1
1       7272
2       15638
3       4133
4       6752
5       40991
6       13175
7       13773
8       13480
9       6471
9 rows selected (446.841 seconds)

That's 120K requests for the month, 100K-ish for the last 7 days. Grafana only shows around 20K. (Also, February has multiple 100K+ request days. What's up with those request counts?)

Tgr added a comment.EditedMar 10 2018, 1:36 AM
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select userAgent ua, count(*) count from ApiAction where year = 2018 and month = 3 and day = 8 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries') group by userAgent having count > 100 order by count desc limit 100;
...
ua      count
WikipediaApp/5.8.0.0 (iOS 11.2; Phone)  6256
WikipediaApp/5.8.0.0 (iOS 10.3.1; Phone)        5863
WikipediaApp/5.8.0.0 (iOS 11.1; Phone)  231
WikipediaApp/5.8.0.0 (iOS 11.2; Tablet) 216
WikipediaApp/5.8.0.0 (iOS 10.3.1; Tablet)       178
WikipediaApp/2.7.226-beta-2018-03-06 (Android 6.0.1; Phone) Google Play Beta Channel    159
WikipediaApp/2.7.226-beta-2018-03-06 (Android 7.0; Phone) Google Play Beta Channel      104
7 rows selected (142.125 seconds)

So yeah, the requests do come from the Android client.
So most requests actually come from the iOS client. Part of their testing with huge lists maybe?

Tgr added a comment.Mar 10 2018, 2:01 AM
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select count(*) count, client, status from (select case when userAgent like 'WikipediaApp/%iOS%' then 'iOS' when userAgent like 'WikipediaApp/%Android%' then 'Android' else 'other' end client, concat_ws(',', errorCodes) status from ApiAction where year = 2018 and month = 3 and day = 8 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries'))x group by client, status having count > 100 order by count desc limit 100;
...
count   client  status
7106    iOS
4140    iOS     readinglists-db-error-no-such-list-entry
773     iOS     notloggedin
747     iOS     internal_api_error_LogicException
488     Android 
112     Android readinglists-db-error-not-set-up
6 rows selected (172.125 seconds)

Same interval in RESTBase logs - the numbers match up reasonably well (given that the Hive query has some filtering).
So apparently RESTBase only logs 5xx errors (which makes sense although it would be cool to be able to see bad requests as well). notloggedin is not on the list of not-500 errors, that needs to be fixed.

Tgr added subscribers: Mhurd, Dbrant.Mar 10 2018, 2:30 AM

Same query for 7 days (and without the having limit):

count   client  status
67803   iOS
11381   iOS     notloggedin
8920    iOS     readinglists-db-error-no-such-list-entry
4257    iOS     readinglists-db-error-entry-limit
3119    Android 
1678    iOS     internal_api_error_LogicException
640     iOS     readinglists-db-error-list-entry-deleted
376     Android readinglists-db-error-not-set-up
237     Android readinglists-db-error-no-such-project
177     iOS     readinglists-db-error-not-set-up
61      iOS     readinglists-db-error-already-set-up
37      Android notloggedin
20      iOS     readinglists-db-error-duplicate-page
18      Android internal_api_error_LogicException
17      other   
10      other   notloggedin
10      Android readinglists-db-error-list-limit
10      iOS     badtoken
9       other   mustpostparams
8       Android readinglists-db-error-entry-limit
5       Android readinglists-db-error-already-set-up
3       Android readinglists-db-error-no-such-list
3       Android badtoken
2       Android readinglists-db-error-duplicate-page
2       Android readinglists-db-error-list-deleted
1       other   readinglists-db-error-duplicate-page
1       other   invalidparammix

@Dbrant @Mhurd

Tgr added a comment.Mar 10 2018, 2:36 AM

(Which is a 28% error rate and matches the Grafana tables well enough. So it looks like only the singlestats are broken.)

Tgr updated the task description. (Show Details)Mar 10 2018, 2:42 AM
Tgr added a comment.Mar 10 2018, 2:44 AM

(Which is a 28% error rate and matches the Grafana tables well enough. So it looks like only the singlestats are broken.)

The actual counts don't match though, the above gives ~60K successful and ~30K failed requests, with Grafana it's more like 16K and 8K.

Tgr updated the task description. (Show Details)Mar 10 2018, 2:45 AM
Tgr added a comment.Mar 10 2018, 3:00 AM

Raw API request counts from Kafka/Hive:

0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select day, count(*) from ApiAction where year = 2018 and month = 3 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries') group by day;
...
day     _c1
...
8       13480
9       6471

Sanity check on fluorine:

tgr@mwlog1001:/srv/mw-log$ head -n1 /srv/mw-log/archive/api.log-20180309 | cut -d' ' -f1-2
2018-03-08 07:40:43
tgr@mwlog1001:/srv/mw-log$ tail -n1 /srv/mw-log/archive/api.log-20180309 | cut -d' ' -f1-2
2018-03-09 07:44:26
tgr@mwlog1001:~$ cat /srv/mw-log/archive/api.log-20180309 | ack readinglist | wc -l
28945

It would be nice to find two metrics which agree for once :/

Could theoretically be due to some client combining the action=query request with other API modules (the grepping would catch that, the Hive query wouldn't) but seems unlikely that anyone would do that.

Tgr added a comment.EditedMar 10 2018, 3:37 AM
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select count(*) count, type, hadError from (select hadError, case when params['action'] = 'readinglists' then params['command'] when params['meta'] = 'readinglists' then 'query+readinglists' when params['list'] = 'readinglistentries' then 'query+readinglistentries' else '?' end type from ApiAction where year = 2018 and month = 3 and day > 2 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries'))x group by type, hadError order by count desc limit 100;
...
count   type    haderror
45024   query+readinglistentries        false
15744   query+readinglists      false
12327   query+readinglists      true
9571    deleteentry     true
6221    deleteentry     false
5494    createentry     true
2870    createentry     false
813     create  false
262     delete  true
131     delete  false
119     setup   false
115     create  true
75      setup   true
14      teardown        false
13      teardown        true
8       query+readinglistentries        true
3       update  false
1       update  true
18 rows selected (172.255 seconds)

Comparing that to Grafana:
OK

GET v1/data/lists/{id}/entries/	9.522 K
DELETE v1/data/lists/{id}/entries/{entry-id}	3.074 K
GET v1/data/lists/changes/since/{date}	2.117 K
POST v1/data/lists/{id}/entries/batch	700
POST v1/data/lists/	301
GET v1/data/lists/	298
POST v1/data/lists/setup	160
POST v1/data/lists/{id}/entries/	152
POST v1/data/lists/batch	101
DELETE v1/data/lists/{id}	84
POST v1/data/lists/teardown	21
PUT v1/data/lists/{id}	4
GET v1/data/lists/pages/{project}/{title}	0

Error:

DELETE v1/data/lists/{id}/entries/{entry-id} 4xx	4.088 K
GET v1/data/lists/changes/since/{date} 5xx	1.766 K
POST v1/data/lists/{id}/entries/batch 4xx	1.103 K
GET v1/data/lists/ 4xx	355
POST v1/data/lists/{id}/entries/batch 5xx	219
GET v1/data/lists/changes/since/{date} 4xx	202
POST v1/data/lists/setup 4xx	64
POST v1/data/lists/ 5xx	63
DELETE v1/data/lists/{id} 5xx	56
POST v1/data/lists/{id}/entries/ 4xx	52
GET v1/data/lists/ 5xx	50
POST v1/data/lists/batch 5xx	15
DELETE v1/data/lists/{id} 4xx	7
POST v1/data/lists/ 4xx	7
POST v1/data/lists/setup 5xx	6
GET v1/data/lists/{id}/entries/ 5xx	5
POST v1/data/lists/teardown 4xx	4
POST v1/data/lists/teardown 5xx	3
GET v1/data/lists/{id}/entries/ 4xx	2
DELETE v1/data/lists/{id}/entries/{entry-id} 5xx	1

So,

MW API moduleRESTBase endpointhas error?Hive countGrafana countratio
query+readinglists, query+readinglistentriesGET v1/data/lists/, GET v1/data/lists/{id}/entries/, GET v1/data/lists/changes/since/{date}15744 + 45024300 + 9522 + 211723% (*)
query+readinglists, query+readinglistentriesGET v1/data/lists/, GET v1/data/lists/{id}/entries/, GET v1/data/lists/changes/since/{date}X12327 + 0405 + 7 + 196835% (*)
deleteentryDELETE v1/data/lists/{id}/entries/{entry-id}X9571408943%
deleteentryDELETE v1/data/lists/{id}/entries/{entry-id}6221307449%
createentryPOST v1/data/lists/{id}/entries/ (**)X5494137425%
createentryPOST v1/data/lists/{id}/entries/ (**)287085230%
createGET v1/data/lists/81340249%

(cutting off at 500).
(*) /changes/since/ typically hits both API modules
(**) + batched version

The ratios are at least in the same ballpark, so maybe RESTBase statsd logs are sampled? Although I would still expect more consistent numbers then.

Tgr updated the task description. (Show Details)Mar 10 2018, 3:42 AM
Tgr added a subscriber: bd808.Mar 10 2018, 3:57 AM

Could theoretically be due to some client combining the action=query request with other API modules (the grepping would catch that, the Hive query wouldn't) but seems unlikely that anyone would do that.

Indeed, running the query with params['action'] = 'readinglists' or params['meta'] like '%readinglists%' or params['list'] like '%readinglistentries%' yields the exact same results.

@bd808 any idea what might cause the discrepancy in T189274#4039924 between ApiAction and api.log on fluorine?
(FWIW the same grep gives 10289 for api.log-20180308.gz which seems possible, given that the logfile rotation is off by a couple hours from UTC days.)

bd808 added a comment.Mar 10 2018, 5:56 PM

@bd808 any idea what might cause the discrepancy in T189274#4039924 between ApiAction and api.log on fluorine?

Nothing more than the obvious "data out of sync" answer. The data flowing into hadoop has to jump through several hoops including validation and exotic transit so I wouldn't be amazed if you found ways that the signal is more lossy than the more simple udp2log records. I can't remember if these two views of the request are generated at the exact same point in the application flow or not either. Its been 2 years since I worked on that code. :)

Pinging also @NHarateh_WMF for T189274#4039890 (various reading list errors iOS clients generate). Not causing any problem right now, just FYI.

Tgr removed Tgr as the assignee of this task.Jun 30 2018, 11:53 AM