Page MenuHomePhabricator

Newcomer tasks: log error details
Closed, ResolvedPublic

Description

We log an error event when loading a task card in the Newcomer Tasks / Suggested Edits module fails, but not the error itself; we should do that.
As a first approximation just truncate the error message to some reasonable length and put it into the data. Later we might want to consider a separate schema with relevant details like language or user agent.
(We should probably check first where the effort to have generic JS error logging is at, maybe a one-off solution is not worth the effort today.)

Also make sure that errors are logged as such; it seems they sometimes get logged as an empty event.

Event Timeline

MMiller_WMF renamed this task from Log Newcomer Tasks error details to Newcomer Tasks: log error details.Dec 10 2019, 1:28 AM
MMiller_WMF renamed this task from Newcomer Tasks: log error details to Newcomer tasks: log error details.
MMiller_WMF added subscribers: Catrope, kostajh, marcella and 2 others.

As of 2019-12-09, there have been 45 events with action = "se-task-pseudo-impression". Of those, here is the breakdown:

  • type=empty: 2
  • type=end: 16
  • type=error: 27

The errors have happened to 21 distinct users (out of 752 who have activated the module, which is about 3%). I think that's enough errors that we should cut down on them. Therefore, I'll look into prioritizing this task.

Due to T244928, there were quite few error details recorded in betalabs eventlogging in the following format (all errors were found in deployment-eventlog05:/srv/log/eventlogging) :

"dt": "2020-02-11T21:14:23Z", "event": {"action": "se-task-pseudo-impression", "action_data": "type=error;errorMessage=http-bad-status"

The type=empty example:

dt": "2020-02-11T21:23:23Z", "event": {"action": "se-task-pseudo-impression", "action_data": "taskTypes=links;taskCount=24;topics=arts,sports,literature,religion,popular-culture,entertainment,food-drink,biography,military,technology,film,philosophy,crafts-hobbies,games,health;type=empty"

Once there was an incomplete errorMessage (I added the full record in case it may provide some clues):

{"dt": "2020-02-11T21:23:23Z", "event": {"action": "se-task-pseudo-impression", "action_data": "taskTypes=links;taskCount=24;topics=arts,sports,literature,religion,popular-culture,entertainment,food-drink,biography,military,technology,film,philosophy,crafts-hobbies,games,health;type=error;errorMessage=", "homepage_pageview_token": "gm7vbsopmktppkno7am5f3r7d95091ca", "is_mobile": false, "mode": "desktop", "module": "suggested-edits", "state": "activated", "user_editcount": 1, "user_id": 63}, "ip": "198.73.209.241", "recvFrom": "deployment-cache-text05.deployment-prep.eqiad.wmflabs", "revision": 19713759, "schema": "HomepageModule", "seqId": 7866871, "userAgent": {"browser_family": "Firefox", "browser_major": "72", "browser_minor": "0", "device_family": "Other", "is_bot": false, "is_mediawiki": false, "os_family": "Mac OS X", "os_major": "10", "os_minor": "10", "wmf_app_version": "-"}, "uuid": "4d2a3c1e0ad95c0785ced257da56a639", "webHost": "cs.wikipedia.beta.wmflabs.org", "wiki": "cswiki"}

@Etonkovidova -- should this still be in your column? Or are there next steps that someone should take?

This year we had 109 type=error events with no errorMessage at all, 41 with an empty errorMessage, 19 with ok-but-empty, 10 with timeout and 1 with Gateway Timeout.

@Etonkovidova -- should this still be in your column? Or are there next steps that someone should take?

There are two questions to answer before marking this task either as Resolved or Need more work

  1. If the error logging is ok for type=error? Especially since there are some empty records.
  2. And if the #1 is not ok, will (or could)l we do something about it?

@Tgr - what do you think?
Also ping to @nettrom_WMF.

@Tgr @Etonkovidova -- I'm not really sure where we are here. So we are currently logging errors, but most don't have error messages? Why do they not have error messages?

@Tgr @Etonkovidova -- I'm not really sure where we are here. So we are currently logging errors, but most don't have error messages? Why do they not have error messages?

The last update from @Tgr was that the issue needs to be prioritized.

@Tgr -- is going to write some explanation on here.

Moving back to Ready for Development for @Tgr to look more into how we can get useful errors here.

Error logging happens in a single place. The promise rejection either happens in mediawiki.api or jQuery.

There are only two branches there that can result in an empty/null/undefined error message, the API error code one (API error codes are of course never empty, but maybe the logic assuming that branch is always an API error code is wrong) and the http one. The latter comes from jQuery, and textStatus/statusText is set to error when it's empty so that can't be it. The other would imply a bug in mediawiki.api as far as I can see.

Log events with no errorMessage at all stopped happening on Feb 3. So that's normal, rEGREdd1cd2050565: Improve suggested edits error logging got deployed around that time, we didn't log messages before that. Empty strings are not related to any specific code change and the majority (75% of events since Feb 3); not sure what's up with that. I added some extra logging in case that's somehow related to mediawiki.api not working correctly, although that seems unlikely.

Change 584601 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Add API debug data when we'd log an empty error message.

https://gerrit.wikimedia.org/r/584601

Change 584601 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add API debug data when we'd log an empty error message.

https://gerrit.wikimedia.org/r/584601

This is on back-burner for a few weeks while we are collecting logs.

As of 2019-12-09, there have been 45 events with action = "se-task-pseudo-impression". Of those, here is the breakdown:

  • type=empty: 2
  • type=end: 16
  • type=error: 27

The errors have happened to 21 distinct users (out of 752 who have activated the module, which is about 3%). I think that's enough errors that we should cut down on them. Therefore, I'll look into prioritizing this task.

@MMiller_WMF do you mind re-running this analysis at some point, so we could reassess if we want to put more effort into tracking this down?

Prioritization aside, there is a proper JS error logging infrastructure now, so eventually errors should probably be logged there and not into our analytics schema.