Page MenuHomePhabricator

Investigate: Log and store errors
Closed, ResolvedPublic13 Estimated Story Points

Description

In general, we should be able to 1. log errors as they happen and 2. possibly be paged by emergent critical errors. Please investigate:

  • If we get any of this logging out of the box with service-runner (the repo security-api is forked from)
  • If not, how we should implement this logging

This may deserve its own ticket, but while importing data I noticed a few lines failing without any additional details. Presumably, the data was in a format that the import script didn't expect but without being able to log the failing data, there was no way to determine next steps to fixing it. If as part of this investigate there's an easy way to implement this logging, please either do so or file an additional ticket if it's more complicated than expected.

Event Timeline

Niharika set the point value for this task to 13.Dec 20 2022, 5:28 PM
Tchanders subscribed.

Do we get error logging out of the box?

Yes.

Example:

  • I forced an error on this line by returning undefined from feedUtil.createConnection
  • I then checked out http://localhost:6927/feed/v1/ip/185.186.83.70
    • An error was displayed: {"status":500,"type":"internal_error","title":"TypeError","detail":"Cannot read property 'query' of undefined","method":"GET","uri":"/feed/v1/ip/185.186.83.70"}
    • This was logged (viewable via docker log <container ID>), including the stack trace: "stack":"TypeError: Cannot read properties of undefined (reading 'query')\n at /srv/service/routes/feed/v1.js:32:29\n at processTicksAndRejections (node:internal/process/task_queues:96:5)"

How it's working:

  • Error handling is implemented in lib/util.js
  • app.js calls setErrorHandler from lib/util.js, which logs the error and formats the response body (I tinkered with the error/response formatting to prove this)

What about outside the web app?

  • None of the above applies to the data import pipeline, or web app setup
  • I tried forcing the same error as above from the data import pipeline, and the error was logged with a helpful stack trace: "TypeError: Cannot read properties of undefined (reading 'query') at addTables (/srv/service/init-db.js:22:22) [...]"
  • I tried forcing an error during web startup by introducing a typo, and again the error was logged with a helpful stack trace: "stack":"ReferenceError: expressssssss is not defined\n at initApp (/srv/service/app.js:24:17)\n [...]"

@STran Was this what you wanted to know?

This may deserve its own ticket, but while importing data I noticed a few lines failing without any additional details. Presumably, the data was in a format that the import script didn't expect but without being able to log the failing data, there was no way to determine next steps to fixing it.

Could you provide any more details about these, or are they addressed by the errors you were fixing recently?

@STran Was this what you wanted to know?

@Tchanders When I wrote this ticket, I thought it would be possibly be good to allow errors to fail and be logged but not halt the entire import since it's 20+M lines and we'd have to redo it from the beginning since there isn't a concept of resuming built into the app. I thought those errors could be logged for a human to check out (eg. the data coming in is bad and we should ignore it or there was an error in import and we should import it, manually or otherwise).

Could you provide any more details about these, or are they addressed by the errors you were fixing recently?

I've fixed the ones I found but that doesn't guarantee more won't show up. Some of the possible errors are described in and will be fixed in https://gerrit.wikimedia.org/r/c/wikimedia/security/security-api/+/921036 (the ipoid version of this). As mentioned, if more show up for whatever reason, it would be good if a human could double check it.

For instance, that patch increases the org column to 1280 characters not because I thought any valid org would have a name that long, but it covers this entry (which, upon any sort of human inspection, is invalid):

The activity you have detected originates from a dynamic hosting environment. For fastest response, please submit abuse reports at http://aws-portal.amazon.com/gp/aws/html-forms-controller/contactus/AWSAbuse For more information regarding EC2 see: http://ec2.amazonaws.com/ All reports MUST include: * src IP * dest IP (your IP) * dest port * Accurate date/timestamp and timezone of activity * Intensity/frequency (short log extracts) * Your contact details (phone and email) Without these we will be unable to identify the correct owner of the IP address at that point in time.

I would prefer not to increase the org character limit to that size and let this pass through but did it as part of my fixes so that I could parse the rest of the dataset, since any error would kill the process.

Change 924552 had a related patch set uploaded (by Tchanders; author: Tchanders):

[mediawiki/services/ipoid@master] WIP Log errors with database writes from init-db.js

https://gerrit.wikimedia.org/r/924552

@STran Was this what you wanted to know?

@Tchanders When I wrote this ticket, I thought it would be possibly be good to allow errors to fail and be logged but not halt the entire import since it's 20+M lines and we'd have to redo it from the beginning since there isn't a concept of resuming built into the app. I thought those errors could be logged for a human to check out (eg. the data coming in is bad and we should ignore it or there was an error in import and we should import it, manually or otherwise).

Thanks @STran. This should be captured by T338016, and the patch above (or its equivalent in Gitlab) is tagged against that task instead.

Change 924552 abandoned by Tchanders:

[mediawiki/services/ipoid@master] WIP Log errors with database writes from init-db.js

Reason:

Moved to GitLab: https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/3

https://gerrit.wikimedia.org/r/924552