Page MenuHomePhabricator

Fix empty error logging from GrowthTasksApi
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
normalized_message
exception.trace
Impact

We get errors logged but no details about the error message. It looks like all the errors occur on Special:Homepage; there's probably an error that occurs when fetching tasks or metadata for tasks.

Notes

handleError in GrowthTasksApi.js looks like this:

GrowthTasksApi.prototype.handleError = function ( error, details ) {
    var message,
        isRealError = true;

    if ( error === 'http' && details && details.textStatus === 'abort' ) {
        // XHR abort, not a real error
        message = null;
        isRealError = false;
    } else if ( error === 'http' ) {
        // jQuery AJAX error; textStatus is AJAX status, exception is status code text
        // from server
        message = ( typeof details.exception === 'string' ) ? details.exception : details.textStatus;
    } else if ( error === 'ok-but-empty' ) {
        // maybe a PHP fatal; not much we can do
        message = error;
    } else if ( error instanceof Error ) {
        // JS error in our own code
        message = error.name + ': ' + error.message;
    } else {
        // API error code
        message = error;
        // DEBUG T238945
        if ( !error ) {
            // log a snippet from the API response. Errors are at the front so
            // hopefully this will show what's going on.
            message = JSON.stringify( details ).substr( 0, 1000 );
        }
    }

    if ( isRealError ) {
        mw.log.error( 'Fetching task suggestions failed: ' + message, error, details );
        mw.errorLogger.logError( error instanceof Error ? error : new Error( message ), 'error.growthexperiments' );
    }

    return $.Deferred().reject( message ).promise();
}

We should either find the relevant branch statement to ensure that message has a value, or (easier) check if ( isRealError && message ) before logging.

Event Timeline

kostajh lowered the priority of this task from Unbreak Now! to Medium.Oct 5 2021, 8:29 AM

See also T238945: Newcomer tasks: log error details where we ran into the same problem.

Also T285759: SuggestedEdits: Error card shown temporarily if user clicks before fetch tasks request completes. Per the conversation in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/698805/3#message-22f31f4c33541772e253ae1609087543aab34cdd this is probably caused by Firefox aborting AJAX reqeusts in a weird way when the user is leaving the page.

Logs - it seems that actually Firefox, Safari and mobile Opera all cause this (but not Chrome, which is weird because mobile Opera is Chrome-based).

Krinkle renamed this task from Fix empty error logging to Fix empty error logging from GrowthTasksApi.Nov 4 2021, 6:41 PM
Krinkle edited projects, added GrowthExperiments; removed Wikimedia-production-error.
mewoph subscribed.

Per the documentation for mw.Api:

On other types of errors, rejects with ( 'http', details ) where details is an object with three fields: xhr (the jqXHR object), textStatus, and exception. The meaning of the last two fields is as follows:

  • When the request is aborted (the abort method of the promise is called), textStatus and exception are both set to "abort".
  • On a network timeout, textStatus and exception are both set to "timeout".
  • On a network error, textStatus is "error" and exception is the empty string.
  • When the HTTP response code is anything other than 2xx or 304 (the API does not use such response codes but some intermediate layer might), textStatus is "error" and exception is the HTTP status text (the text following the status code in the first line of the server response). For HTTP/2, exception is always an empty string.
  • When the response is not valid JSON but the previous error conditions aren't met, textStatus is "parsererror" and exception is the exception object thrown by JSON.parse.

We're using exception as the message in the case of HTTP errors, which can be an empty string. I was able to trigger an empty client error by briefly interrupting the network while fetching tasks https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2022.01.18?id=hf-3bn4BMAtV7EgGfwPz

Change 755017 had a related patch set uploaded (by MewOphaswongse; author: MewOphaswongse):

[mediawiki/extensions/GrowthExperiments@master] GrowthTasksApi: Set error message for HTTP errors

https://gerrit.wikimedia.org/r/755017

There was some related discussion in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/698805/3#message-22f31f4c33541772e253ae1609087543aab34cdd FWIW - tl;dr Firefox seems to generate such errors when the user navigates away while the AJAX request is still pending.

Change 755017 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] GrowthTasksApi: Set error message for HTTP errors

https://gerrit.wikimedia.org/r/755017

Etonkovidova subscribed.

Checked wmf.19 and group0 and group1 on wmf.20 - empty errors are not present.

As expected we're getting a high number of HTTP errors now in the logstash client dashboard, like this one.

I can't find very useful information in the data tracked a part from the stack trace and the URL where this happened. Would it be useful to also print the XHR request config here or we are sure all these are "fair requests aborts"? Maybe the stack trace is enough to debug and differentiate these errors, I might need some help with it. My concern is I can't easily triage if these are FF aborts or some other relevant error. What do you think? @mewoph @Tgr

Here's the details object when there is an HTTP error (reproduced via network throttle)

Screen Shot 2022-02-03 at 9.38.50 AM.png (814×776 px, 142 KB)

Since there's not much useful diagnostic information, I wonder if we should be logging these at all and if we should treat it like aborts (where the error isn't logged).

Tgr added a subscriber: Jdlrobson.

Aborts are a noop; on HTTP errors we want to show the error card. Not logging would make sense but probably just filtering it in Logstash is easier.
(@Jdlrobson any thoughts? If we filter these from the mw-client-growth-team-errors dashboard, will it cause confusion on other dashboards?)

I can't find very useful information in the data tracked a part from the stack trace and the URL where this happened. Would it be useful to also print the XHR request config here or we are sure all these are "fair requests aborts"?

In theory, not necessarily aborts, could be real HTTP errors (such as a connection timeout). But most of the errors seem to be from Firefox + Safari, despite those being less popular browsers, so probably they are actually aborts. (33 out of 543 in the last 7 days have a Chrome user agent, despite Chrome's market share being ~60%, which would suggest 90% of the errors are aborts and 10% connection errors.)

Ideally we'd differentiate between the two; abort means the user is shown an error card for a split second before navigating away, which is somewhat jarring. We could:

  1. resurrect c698805 and ignore the 10%;
  2. or reproduce the abort and the connection error in those two browsers, compare the XHR object's properties and hope there is something to differentiate by;
  3. or set a flag suppressing this error when the user starts navigating away (maybe an onbeforeunload handler, or a click handler on the card).

The second option is the nicest in that we could fix the issue in core so other teams don't have to deal with this problem. It's also the least likely to succeed. The first is the simplest and dirtiest.

(@Jdlrobson any thoughts? If we filter these from the mw-client-growth-team-errors dashboard,

Ideally we wouldn't log things unless we plan to do something about that, and when we do log it would be best if the error messages were meaningful. e.g. instead of empty string 'Generic growth team error". We actively monitor the number of errors and use it to block train roll outs, so the more reliable the data is describing errors in our system, the better. As long as the volume isn't too high, I don't think it will be a problem.