Page MenuHomePhabricator

Get UploadWizard's uncaught exceptions on Commons to fewer than 50/day
Closed, ResolvedPublic

Description

Some time ago, before the new Multimedia team started working seriously on UploadWizard again, it was throwing thousands of exceptions every day, according to our logging data.

DateUncaught exceptions
2015100210927
2015100311431
2015100415274
2015100518870
2015100612872

We're in a lot better situation today, but it's still kind of terrible:

DateUncaught exceptions
201605191366
20160520634
20160521419
20160522716
20160523517

So let's focus on this for a bit.

I glanced through recent data and I think 50/day is a reasonable goal to aim for. We should be able to get it down to 150 or so without breaking a sweat, there are lots of identical errors which look kind of obvious (mentioning variables by a name we only use once in the whole codebase, etc.). 50 will be challenging, and any further than that might not really be possible, since by that point we'll be probably left with unreproduceable random failures and useless error messages. We have things like "Script error" or "Out of memory" in the logs that don't really look like we'll be able to guess what causes them.

Related Objects

StatusAssignedTask
Resolvedmatmarex
DuplicateNone
Declinedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
OpenNone
ResolvedAnomie
DuplicateNone
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex
Resolvedmatmarex

Event Timeline

matmarex created this task.May 25 2016, 7:28 PM
Restricted Application added a project: Multimedia. · View Herald TranscriptMay 25 2016, 7:28 PM
Restricted Application added subscribers: Zppix, Steinsplitter, Aklapper. · View Herald Transcript
Restricted Application added a subscriber: Matanya. · View Herald TranscriptMay 25 2016, 7:37 PM

Should be enough subtasks for now ;) I'll file more when we fix these. If anyone wants to work on anything, they're free for taking. These are the most common error messages in the few days' sample I was looking through.

The patterns are rather interesting. Some of the very prominent errors seem to affect just a single user or two on a particular day, and for some reason they occur thousands of times and spam our data. At least two of the ones I'm seeing ("this.get(...).querySelectorAll is not a function" and "clearSelection is not defined") are clearly bogus and probably caused by some buggy browser extensions the user has installed. So… I guess we'll need to identify and filter them out to reach the 50/day target ;)

Here are counts of the most common error messages from the last days, plotted by day and by message ("NULL" signifies all other errors, grouped together):


message20160518201605192016052020160521201605222016052320160524201605252016052620160527201605282016052920160530201605312016060120160602
NULL281201016472126975111413040123871479526
Argument konstruktoru typovaného pole je neplatný.0000015469000039000
Für diesen Vorgang ist nicht genügend Speicher verfügbar.07154063274301004973202
NS_ERROR_NOT_AVAILABLE:00001254301312710100
Not enough storage is available to complete this operation.748443851205620003
TypeError: str.replace is not a function3614617724949527122101204867027510
TypeError: this.descriptionsDetails.getSerialized(...).descriptions[0] is undefined099717991813812121731291
Typed array constructor argument is invalid0120470405100046778000
Unbekannter Fehler.01017016410010030311700
Uncaught ReferenceError: clearSelection is not defined0000000016560000000
Uncaught TypeError: Cannot read property 'code' of undefined0171739293422316414483977634751
Uncaught TypeError: Cannot read property 'description' of undefined5473161643959515643463658612223
Uncaught TypeError: Cannot read property 'getMainText' of null0025182571157161332312659
Uncaught TypeError: Cannot read property 'getName' of null04353104103117482536
Uncaught TypeError: Cannot read property 'toLowerCase' of null02011082301601145170126
Uncaught TypeError: str.replace is not a function1120313014919019513419472135101121156181370
Uncaught TypeError: this.get(...).querySelectorAll is not a function0522900000033122001146000
Uncaught TypeError: undefined is not a function02911014110018021774770000
Ungültiges Konstruktorargument in typisiertem Array.00270169321000001463500
Unknown operation "tls"1111056104598110267527700000000
select
	message,
	coalesce(sum(case when ts='20160518' then num end), 0) as `20160518`,
	coalesce(sum(case when ts='20160519' then num end), 0) as `20160519`,
	coalesce(sum(case when ts='20160520' then num end), 0) as `20160520`,
	coalesce(sum(case when ts='20160521' then num end), 0) as `20160521`,
	coalesce(sum(case when ts='20160522' then num end), 0) as `20160522`,
	coalesce(sum(case when ts='20160523' then num end), 0) as `20160523`,
	coalesce(sum(case when ts='20160524' then num end), 0) as `20160524`,
	coalesce(sum(case when ts='20160525' then num end), 0) as `20160525`,
	coalesce(sum(case when ts='20160526' then num end), 0) as `20160526`,
	coalesce(sum(case when ts='20160527' then num end), 0) as `20160527`,
	coalesce(sum(case when ts='20160528' then num end), 0) as `20160528`,
	coalesce(sum(case when ts='20160529' then num end), 0) as `20160529`,
	coalesce(sum(case when ts='20160530' then num end), 0) as `20160530`,
	coalesce(sum(case when ts='20160531' then num end), 0) as `20160531`,
	coalesce(sum(case when ts='20160601' then num end), 0) as `20160601`,
	coalesce(sum(case when ts='20160602' then num end), 0) as `20160602`
from (
	select trim(trailing '\r\n' from event_message) as message, ts, num from (
		select event_message, left(timestamp,8) as ts, count(*) as num
		from UploadWizardExceptionFlowEvent_11772722
		where timestamp > '2016' and event_message in (
			select event_message 
			from UploadWizardExceptionFlowEvent_11772722
			where timestamp > '2016'
			group by event_message
			having count(*)>100
		)
		group by event_message, left(timestamp,8)
		union
		select null, left(timestamp,8) as ts, count(*) as num
		from UploadWizardExceptionFlowEvent_11772722
		where timestamp > '2016' and event_message not in (
			select event_message 
			from UploadWizardExceptionFlowEvent_11772722
			where timestamp > '2016'
			group by event_message
			having count(*)>100
		)
		group by left(timestamp,8)
	) u
	order by ts, num, message
) v
group by message;
matmarex triaged this task as Normal priority.Jun 4 2016, 5:18 PM

message201605192016052020160521201605222016052320160524201605252016052620160527201605282016052920160530201605312016060120160602201606032016060420160605201606062016060720160608201606092016061020160611201606122016061320160614
NULL1158154501248545107117341138213990265523393732561125137513340
Argument konstruktoru typovaného pole je neplatný.000015469000039000000000000000
Für diesen Vorgang ist nicht genügend Speicher verfügbar.71540632743010049732020220010101600
NS_ERROR_NOT_AVAILABLE:0001254301312710101021634021061515
Not enough storage is available to complete this operation.4844385120562000511480823101353018
QuotaExceededError000000000000000000000535241245322
QuotaExceededError: DOM Exception 2200000000000000000000040941165420
Script error.41861817131013122212172141005143
TypeError: str.replace is not a function14617724949527122101204867027510000000000000
TypeError: this.descriptionsDetails.getSerialized(...).descriptions[0] is undefined9971799181381212173129882015354000000
Typed array constructor argument is invalid12047040510004677800023340158231026
Unbekannter Fehler.10170164100100303117000010010010300
Uncaught ReferenceError: clearSelection is not defined0000000165600000000039000000048300
Uncaught TypeError: Cannot read property 'code' of undefined171739293422316414483977634766464397463928000000
Uncaught TypeError: Cannot read property 'description' of undefined473161643959515643463658612232171930193534000000
Uncaught TypeError: Cannot read property 'getMainText' of null02518257115716133231265151331366551810482657
Uncaught TypeError: Cannot read property 'getName' of null435310410311748253632611728006102248
Uncaught TypeError: Cannot read property 'resolve' of undefined12441554357463372183214972901
Uncaught TypeError: Cannot read property 'toLowerCase' of null2011082301601145170126636126500362003
Uncaught TypeError: str.replace is not a function20313014919019513419472135101121156181370000000000000
Uncaught TypeError: this.get(...).querySelectorAll is not a function522900000033122001146003420060280130240000
Uncaught TypeError: undefined is not a function2911014110018021774770000000006000000
Ungültiges Konstruktorargument in typisiertem Array.02701693210000014635000000028000000
Unknown operation "tls"1056104598110267527700000000000000000000

Tgr added a subscriber: Tgr.Jun 14 2016, 4:27 PM

You should be able to get rid of much of the noise by filtering on event_url. Browser extensions will either have a chrome:// or similar URL or some external domain (if the extension pulls it from the web).

We have things like "Script error" or "Out of memory" in the logs that don't really look like we'll be able to guess what causes them.

Script error means the error message was hidden for security reasons. This only happens when some cross-domain resource was involved and I don't think UW uses any, so those can be ignored.

You should be able to get rid of much of the noise by filtering on event_url. Browser extensions will either have a chrome:// or similar URL or some external domain (if the extension pulls it from the web).

Ooh, yeah, that would be pretty smart. I don't see any chrome:// URLs, but there are plenty from various weird shady domains.

Filtering out everything that's not from 'https://commons.wikimedia.org':


message201605192016052020160521201605222016052320160524201605252016052620160527201605282016052920160530201605312016060120160602201606032016060420160605201606062016060720160608201606092016061020160611201606122016061320160614
NULL11073324676843094102278946120701345172029315378154331518
Argument konstruktoru typovaného pole je neplatný.000015469000039000000000000000
Für diesen Vorgang ist nicht genügend Speicher verfügbar.71540632743010049732020220010101600
NS_ERROR_NOT_AVAILABLE:0001254301312710101021634021061515
Not enough storage is available to complete this operation.48443851205520005114808227351811
TypeError: str.replace is not a function14617724949527122101204867027510000000000000
TypeError: this.descriptionsDetails.getSerialized(...).descriptions[0] is undefined9971799181381212173129882015354000000
Typed array constructor argument is invalid12047040510004677800023340158231026
Unbekannter Fehler.10170164100100303117000010010010300
Uncaught TypeError: Cannot read property 'code' of undefined171739293422316414483977634766464397463928000000
Uncaught TypeError: Cannot read property 'description' of undefined473161643959515643463658612232171930193534000000
Uncaught TypeError: Cannot read property 'getMainText' of null02518257115716133231265151331366551810482658
Uncaught TypeError: Cannot read property 'getName' of null435310410311748253632611728006102248
Uncaught TypeError: Cannot read property 'resolve' of undefined12441554357463372183214972904
Uncaught TypeError: Cannot read property 'toLowerCase' of null2011082301601145170126636126500362003
Uncaught TypeError: str.replace is not a function20313014919019513419472135101121156181370000000000000
Ungültiges Konstruktorargument in typisiertem Array.02701693210000014635000000028000000
Tgr added a comment.Jun 14 2016, 8:17 PM

Note that we load our scripts (most of the time, on most browsers) by reading the source from localStorage and eval-ing it, in which case the URL will be empty. (Plus there are the magic URLs you added in 17b40ce7/f563e187.)

Almost all of the error messages with event_url='' is stuff that is very definitely not coming from our code. (And the magic URLs account for a very small number of entries and in general are mostly harmless.)

mysql:research@dbstore1002.eqiad.wmnet [log]> select event_message, count(*) as n from UploadWizardExceptionFlowEvent_11772722 where event_url='' and timestamp>'2016' group by event_message having n>5;
+---------------------------------------------------------------------------------------------------+------+
| event_message                                                                                     | n    |
+---------------------------------------------------------------------------------------------------+------+
| Die Eigenschaft "code" eines undefinierten oder Nullverweises kann nicht abgerufen werden.        |    9 |
| Die Eigenschaft "getMainText" eines undefinierten oder Nullverweises kann nicht abgerufen werden. |    6 |
| Die Eigenschaft "resolve" eines undefinierten oder Nullverweises kann nicht abgerufen werden.     |    7 |
| Die Eigenschaft "toLowerCase" eines undefinierten oder Nullverweises kann nicht abgerufen werden. |    8 |
| L’objet ne gère pas la propriété ou la méthode « replace »                                        |    9 |
| Object doesn't support property or method 'replace'                                               |   11 |
| Out of memory                                                                                     |    7 |
| Script error.                                                                                     |   81 |
| Unable to get property 'code' of undefined or null reference                                      |    8 |
| Unable to get property 'toLowerCase' of undefined or null reference                               |   13 |
| Uncaught Error: Java exception was raised during method invocation                                |   10 |
| Uncaught ReferenceError: clearOverlappingSelection is not defined                                 |   13 |
| Uncaught ReferenceError: clearSelection is not defined                                            | 2529 |
| Uncaught ReferenceError: initMultiSelection is not defined                                        |   13 |
| Uncaught TypeError: Cannot read property 'addEventListener' of null                               |    9 |
| Uncaught TypeError: Cannot read property 'style' of null                                          |    6 |
| Uncaught TypeError: Failed to execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'. |   10 |
| Uncaught TypeError: this.get(...).querySelectorAll is not a function                              | 2389 |
| Uncaught TypeError: undefined is not a function                                                   | 1184 |
| uncaught exception: out of memory                                                                 |   38 |
+---------------------------------------------------------------------------------------------------+------+
20 rows in set (3.90 sec)

n<=5 yields 32 more rows, which I'm omitting for privacy/security reasons.


Commit 046cbe84a29d1e4a3473d7497415f94889ca6182 (https://gerrit.wikimedia.org/r/297120) allows us to finally filter out non-UW exceptions, as all UW code should now be served by ResourceLoader in a single separate load.php request. So if we check for event_url like '%ext.uploadWizard%', we know for sure that all of the results are actually coming from UploadWizard, and we can cease wild goose chases like T136237.

However, it will not include some legitimate entries, e.g. when the browser just reports https://commons.wikimedia.org/wiki/Special:UploadWizard as the URL. So I'll be only using this to exclude specific error messages that appear for like 'https://commons.wikimedia.org/%' but never for like '%ext.uploadWizard%', not for the counts in the chart.

Status update:


This chart excludes errors related to T136237 and T139130 (which I'm pretty sure are not coming from our code, and I can't identify their source).

Most of the remaining logs are from T136235 (Internet-Explorer-specific) and T136831 (Firefox-specific). Both have patches pending.

I think we're pretty close to the goal of <50/day. As long as we don't introduce new ones ;)

All subtasks are resolved and I'm pretty sure we're now where I wanted us to be. I'll keep this open for a week or two for new data to come in, make some more charts, and close if it all looks good.

Status update:


Spikes on 2016-07-19, 2016-07-24 and 2016-07-26 are T136235. Spike on 2016-07-28 is T141529. Spikes on 2016-08-12, 2016-08-16 and 2016-08-17 are a return of T136831, I'm reopening that one. The spike on 2016-08-17 also has some other weird stuff that I haven't investigated yet.

(I did some tweaking of the spreadsheets for the chart and the SQL query, to make them more correct, more automatic and prettier, so they differ a bit from previous ones.)

Status update:


Spike on 2016-08-16 is T136831, as I said before. Spike on 2016-08-17 is actually three small spikes, T136831 and two instances of T145341, with a couple different errors seen by two users (in Spanish and German). Spike on 2016-08-23 is T136831 again and Error: Permission denied to access property "toString", which I think is probably coming from a browser extension. Spikes on 2016-08-30 and 2016-09-04 are T145341. Huge spikes on 2016-09-05–2016-09-11 are all T145341 too.

So… it's basically all T136831 and T145341, both of which are harmless but noisy. I think we're nearly done here, let's just eliminate them.

matmarex closed this task as Resolved.Sep 30 2016, 9:20 PM

Status update:


During the last two weeks, every day has been below the stated goal of 50 uncaught exceptions per day, so I'm going to close this, yay!

There is still a slow trickle of exceptions that might be legitimately caused by bugs in UploadWizard (as opposed to browser extensions and user scripts and non-UploadWizard deployed code and user's browsers running out of memory), and I'll keep running this query every week or so and trying to sift through the noise. But we're in a pretty good state right now.