Page MenuHomePhabricator

Frequent client-side errors for reading list service
Closed, ResolvedPublic

Description

0: jdbc:hive2://analytics1003.eqiad.wmnet:100> select count(*) count, client, status from (select case when userAgent like 'WikipediaApp/%iOS%' then 'iOS' when userAgent like 'WikipediaApp/%Android%' then 'Android' else 'other' end client, concat_ws(',', errorCodes) status from ApiAction where year = 2018 and month = 4 and day >= 20 and (params['action'] = 'readinglists' or params['meta'] = 'readinglists' or params['list'] = 'readinglistentries'))x group by client, status order by count desc limit 100;
count   client  status
256063  iOS
113351  Android 
109745  iOS     readinglists-db-error-not-set-up
19911   Android readinglists-db-error-not-set-up
12895   Android readinglists-db-error-no-such-project
2928    iOS     readinglists-db-error-no-such-project
2454    iOS     notloggedin
1909    Android notloggedin
777     iOS     readinglists-db-error-entry-limit
577     iOS     readinglists-db-error-already-set-up
112     Android badtoken
85      Android readinglists-db-error-no-such-list
42      Android readinglists-too-old
41      Android readinglists-db-error-already-set-up
37      iOS     badtoken
33      iOS     readinglists-db-error-list-entry-deleted
33      other   
21      other   readinglists-db-error-not-set-up
14      Android readinglists-db-error-list-entry-deleted
13      Android readinglists-db-error-list-limit
8       Android maxbytes
7       other   notloggedin
4       Android readinglists-db-error-no-such-list-entry
4       iOS     readinglists-db-error-not-own-list
2       Android readinglists-db-error-list-deleted

That means ~30% of all iOS requests and ~15% of all Android requests fail because the backend has not been set up. Given that (AIUI) sync does not start until the user opts in, that seems like a logic error with those apps (or the backend, but that's less likely).

Also, there's plenty of invalid project name errors; would be nice to see where those are coming from.

None of that is too much of a problem, but it makes actual problems harder to find.

Event Timeline

Speaking only for Android, we're basically relying on the changes/since endpoint to tell us whether the backend is set up. i.e. when the user logs in, we make a request to changes/since, and expect it to return either not-set-up or actual valid content. It's difficult to tell if the 15% figure is within the realm of expectation, but I think it's possible, since we have numerous onboarding materials that are encouraging new and existing users to log in for the first time and sync their lists.

We'll investigate the no-such-project errors further. I'm betting that it's a small number of users that managed to add a malformed item to a list, and it's producing this error on every sync attempt.

The origin of not-set-up errors on iOS is similar. We're also calling changes/since to check if the user enabled sync for the account on another device. We make that call every ~15 seconds.

We'll look into no-such-project errors.

Re-ran the query for recent days (27th and later) to see if the recent increase of response times resulted in different errors, but there is no obvious difference:

count   client  status
627923  iOS
518077  iOS     readinglists-db-error-not-set-up
100839  Android 
13558   Android readinglists-db-error-no-such-project
13515   iOS     readinglists-db-error-no-such-project
9358    Android readinglists-db-error-not-set-up
3086    iOS     notloggedin
1591    Android notloggedin
1002    Android badtoken
610     iOS     readinglists-db-error-already-set-up
131     Android readinglists-db-error-no-such-list
118     iOS     readinglists-db-error-entry-limit
54      Android readinglists-too-old
47      iOS     readinglists-db-error-no-such-list
43      iOS     badtoken
39      iOS     readinglists-db-error-list-entry-deleted
39      Android readinglists-db-error-already-set-up
20      Android readinglists-db-error-list-entry-deleted
11      Android maxbytes
10      iOS     readinglists-db-error-not-own-list
8       Android readinglists-db-error-list-deleted
7       other   mustpostparams
5       Android readinglists-db-error-list-limit
4       other   notloggedin
4       iOS     readinglists-db-error-duplicate-list
4       Android Talk:Talk:Gaulish_language
2       other   
1       other   badtimestamp_rlechangedsince

(no idea WTF is the error code Talk:Talk:Gaulish_language...)

@Tgr

@Dbrant is going to write a script to see if he can break the service by adding pages from every project that the app supports.

2 questions:

  1. All WMF projects are valid to be added to the RL service, right? Are there any obvious domains that would fail? Is anything blacklisted?
  2. If Dmitry is unsuccessful in breaking the service, what do you estimate the effort in logging the failing project server side? (some know what the failure is from)

cc @JoeWalsh

All (non-private) WMF projects that show up in SiteMatrix. (Which is pretty exhaustive I think.)
Logging is easy (probably a one-line patch), if the only reason for writing the script is to identify what domain name fails, don't bother with it IMO.

It was simple enough to whip up a loop that adds a page from all known wikipedia subdomains to a list.
And actually... there are three (valid) subdomains that seem to result in a no-such-project error:

https://gor.wikipedia.org
https://inh.wikipedia.org
https://lfn.wikipedia.org

This doesn't necessarily mean that these are the projects represented in the error counts listed above, and it would still be useful to know which projects are invalid from the server's perspective.

These three are all fairly new wikis, see T192678. I think RI needs to add these to the project whitelist for the Readinglist MW extension. Maybe @Tgr can walk the rest of RI through how this is done during the Hackathon?

Thanks! I didn't think of that; the whitelist does indeed not track site changes. That's unlikely to be the cause of ~10% of Android requests having an invalid domain, though.

Maybe @Tgr can walk the rest of RI through how this is done during the Hackathon?

It's just a DB table with domain names (well, origins) in it. You could update it by hand:

tgr@terbium:~$ sql wikishared --write
wikiadmin@10.64.48.19(wikishared)> insert into reading_list_project (rlp_project) values ('https://gor.wikipedia.org'), ('https://inh.wikipedia.org'), ('https://lfn.wikipedia.org');

but you can also just run populateProjectsFromSiteMatrix.php. That should probably be added to a cronjob.

Change 433385 had a related patch set uploaded (by Dbrant; owner: Dbrant):
[apps/android/wikipedia@master] Be more fault-tolerant with malformed reading list pages.

https://gerrit.wikimedia.org/r/433385

Change 433385 merged by jenkins-bot:
[apps/android/wikipedia@master] Be more fault-tolerant with malformed reading list pages.

https://gerrit.wikimedia.org/r/433385

Our latest production release should lead to a marked decrease in no-such-project errors.

Vvjjkkii renamed this task from Frequent client-side errors for reading list service to 3ceaaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
Lofhi renamed this task from 3ceaaaaaaa to Frequent client-side errors for reading list service.Jul 1 2018, 6:57 AM
Lofhi raised the priority of this task from High to Needs Triage.
Lofhi updated the task description. (Show Details)
Lofhi added subscribers: gerritbot, Aklapper.

Are there any other unexplained errors at this time? (can this task be closed?)

Sorry, forgot about this.

Stats for the first ten days of December:

count	client	status
2020973	iOS	
998886	iOS	readinglists-db-error-not-set-up
298553	Android	
193189	iOS	readinglists-too-old
26781	Android	readinglists-too-old
18297	iOS	readinglists-db-error-no-such-project
11375	Android	readinglists-db-error-not-set-up
6799	iOS	readinglists-db-error-list-limit
5707	iOS	notloggedin
5288	iOS	readinglists-db-error-no-such-list-entry
5081	iOS	readinglists-db-error-entry-limit
3635	Android	notloggedin
2279	iOS	readinglists-db-error-list-entry-deleted
1792	other	
1597	iOS	badtoken
1562	Android	readinglists-db-error-no-such-list
843	iOS	readinglists-db-error-already-set-up
612	iOS	readinglists-db-error-duplicate-list
574	iOS	readinglists-db-error-not-own-list
414	Android	readinglists-db-error-no-such-project
191	Android	readinglists-db-error-list-limit
128	Android	readinglists-db-error-list-entry-deleted
64	Android	readinglists-db-error-list-deleted
57	Android	readinglists-db-error-duplicate-list
54	Android	readinglists-db-error-already-set-up
49	Android	badtoken
39	iOS	readinglists-db-error-no-such-list
15	Android	readinglists-db-error-no-such-list-entry
14	other	mustpostparams
12	Android	invalidtitle
10	iOS	maxbytes
10	iOS	readinglists-db-error-too-long
9	other	readinglists-db-error-not-set-up
7	other	notloggedin
5	Android	maxbytes
4	iOS	invalidtitle
4	iOS	readinglists-db-error-list-deleted
3	iOS	internal_api_error_LogicException
3	other	nocommand
2	Android	readinglists-db-error-not-own-list
1	Android	readinglists-db-error-user-required

Some of those I would not expect to happen: the not-own-list error, although it's much less frequent on Android now; the no-such-list / no-such-entry errors; the notloggedin error (one for every ~300 successful requests). I guess the deleted / already set up / not set up errors can just be race conditions between two devices.

As I said earlier, none of this is a problem for the server. Feel free to close the task if it does not seem useful or it does not seem worth looking into the errors.

Dbrant claimed this task.

Thanks!