Page MenuHomePhabricator

Client side error logging production launch
Closed, ResolvedPublic

Description

This umbrella task will be used to track production launch of "client side error logging" project.

Paraphrasing from https://etherpad.wikimedia.org/p/clients-error-logging, we'll be lining up as many ducks as we can in Q1, with launch in Q2 to many/most wikis. End of Q2 is naturally risking for high traffic wikis because of donation campaigns, we'll have to weight benefit/risks of launching to big wikis too.

Q2 FY2019/2020

  • Schema validation for eventgate events
  • kubernetes setup for eventgate deployment (i.e. backend component) initially set to receive errors from low traffic wikis - @Ottomata
  • Write minimal client to send errors without attempting normalization for MVP
  • Security review of the js client we'll be using, preliminary task at T232820
  • Performance review of the js client we'll be using (Timo reviewed code)
  • Verify we have enough ingestion capacity on the Logstash side @fgiunchedi
  • Deploy and enable on one low traffic wiki - T246030
  • Related to the above, make sure deduplication/rate limiting in depth (i.e. both on the client side, and on the backend side) is in place before high traffic wikis launch.
  • Verify events show up in logstash and we have a Kibana dashboard available @fgiunchedi

@dr0ptp4kt @phuedx @Ottomata @Milimetric @CDanis @colewhite @Tgr @Krinkle please check and adjust the above plan as needed! What do you think?

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Breakdown by wiki domain (derived from webpage url?). If you do this, I'd recommend using server

If you mean domain, we have meta.domain which can and should be set. This could be en.wikipedia.org or en.m.wikipedia.org.

Yep, I meant domain name – And, no, it is not currently set in the data I see in Logstash.

The url key is currently set to the web page url,

@Krinkle, just to clarify, you think url should be location.href and a new error_url should be the Error object's url?

Yes.

The meta.request_id key does not appear to hold the request ID of the page view.

Interesting. EventGate should set this to the value of the X-Request-ID header if it is not set in the event data. I will investigate.

We could set it client side, but we chose to make EventGate fill in some defaults to avoid some duplicate transmitted bytes (since this and other info is already in the header). […] I wouldn't mind calling it reqid, but then we'd have to revisit all of the other event schemas too.

I'm not sure where that comes form then. Afaik internal edge traffic via Nginx/ATS/Varnish doesn't (yet) add a request ID header. The only server I'm aware of doing that are the Node.js services, and MW Apaches.

Maybe a proxy between ATS and EventGate is creating its own request ID. That would mean it describes the /beacon request itself, not the page view it came from. This wouldn't have diagnostic value in relation to the error that is being logged.

Setting it from the client would make sense indeed. Although we need to be careful about propagating request IDs internally. I think we should (until further notice) never "whitewash" request ID values provided by external clients into our own infrastructure. Besides, it might also be useful for you to be able to trace beacon requests as their own thing throughout our infrastructure. After alll, the beacon request from the web browser is not a "dependency" of the MediaWiki backend request. They could be days apart from each other (given CDN caching) and in no way part of each other.

I'd recommend storing the page's reqId in the stream separately from meta.request-id. If you don't (yet) need such meta request-id, perhaps we can also omit that from the Logstash stream to avoid confusion? As for the naming, naming it reqid would not help so migiht as well make it more obviously differnet from reqId. Don't worry! I'm actually slightly against reqid as that could be unproductive in terms of debugging confusion. It is too easily misread as the thing we see elsewhere, which would then not match in any queries etc.

Putting the MW request ID into a JS error event seems not so useful anyway. It's pretty unlikely the JS error was caused by something server-side when rendering the page, and that something would be identifiable from the logs. Maybe if it could be connected to API requests on the same page as well, that could be useful, but we have no such functionality currently.

Change 578951 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@master] clientError: Fixes behavior around 'type' and 'url' fields.

https://gerrit.wikimedia.org/r/578951

I'm not sure where that comes form then.

Ah, service-template-node is doing it. (And more specifically this is the code that eventgate has.)

I could make eventgate not do this if that makes more sense.

@Krinkle and @Tgr, to summarize your comments it sounds like it would be best not to send mw.config.get('wgRequestId') from the client at all, right?

The type key is clashing with Logstash-wide key

Let's just change the name of this field to error_type

Maybe let's do error_name? The field we're going to be using to populate this will be Error.name in JavaScript. Or can we use plain name without collision?

Should this be added in errorLogging.js or in WME? I made a patchset for WME but it's simple to make a patch for errorLogging.js too.

Shouldn't the stack trace be structured? These look like just the native browser stack trace.

@Tgr they look as expected with the code we wrote now, which should produce a string that looks like a generic native browser stack trace. By structured do you mean JSON? I think it would be straightforward to do that, maybe if you make an example of what you'd like it to look like I can make a patch.

Submitted patchset to change URL and type behavior, will add everybody as reviewers.

Hm, what service-template-node should most properly do is what varnish frontends do: only propagate X-Request-Id if client.ip == wikimedia_trust. Getting that wikimedia_trust list into something that service-template-node can use is probably not that easy tho. Ping @Pchelolo.

Hm, what service-template-node should most properly do is what varnish frontends do: only propagate X-Request-Id if client.ip == wikimedia_trust.

Honestly... IMHO this logic doesn't belong in Varnish or Eventgate, but it's a separate issue..

Getting that wikimedia_trust list into something that service-template-node can use is probably not that easy tho.

If we could get it to eventgate puppet role, we could add it to deployment-vars and carry it into the config. From there it should be relatively easy to get the same logic into template-node. The puppet part I have no idea how to do, but the rest seems quite doable?

If we could get it to eventgate puppet role

Kubernetes :/

Putting the MW request ID into a JS error event seems not so useful anyway. It's pretty unlikely the JS error was caused by something server-side [..]

Just one example from recent memory, T228746. A server-side problem causes mw.config to be missing all its page-specific entries. This easily could, and probably already does, cause uncaught errors in JavaScript.

The SRE Observability folks are about to propose standardized logging schema conventions based on Elastic Common Schema. Since Event Platform has its own schema conventions and requirements, we can't totally conform to theirs, but we should try to do our best.

The mediawiki/client/error schema is live, but it isn't widely used yet so we can still tolerate some schema changes at this time. We should review the ECS with @colewhite and modify our schema to conform where we can. E.g. we probably should use this error field, or at least name our fields accordingly. error_stack, or error.stack, etc.

We should review the ECS with @colewhite and modify our schema to conform where we can

+1

Maybe let's do error_name? The field we're going to be using to populate this will be Error.name in JavaScript. Or can we use plain name without collision?

Not sure that Error.name would include any info for generic (uncaught) javascript errors as it is filled in with the Error.prototype.string so it would be always "Error" (unless we have a class that extends Error, the exceptions coming from those classes are probably caught already)

Change 580980 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[schemas/event/primary@master] Updates mediawiki/client/error schema (without versioning).

https://gerrit.wikimedia.org/r/580980

Ok, https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/580980 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/578951 are ready to be merged. I'd rather deploy and SWAT them out together, so I will merge and deploy changes needed on Monday. Here's the plan.

I'm going to disable client side error logging now so that hopefully by Monday all (or most) clients won't be sending any incompatible events.

On Monday:

  1. Merge changes.
  2. Rebuild eventgate-wikimedia image with latest schema repo and deploy to eventgate-logging-external
  3. Backport WikimediaEvents change
  4. Deploy modules/ext.wikimediaEvents/clientError.js
  5. Enable client side error logging

Change 582804 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Temporarily disable client side error logging for a deploy

https://gerrit.wikimedia.org/r/582804

Change 580980 merged by Ottomata:
[schemas/event/primary@master] Updates mediawiki/client/error schema (without versioning).

https://gerrit.wikimedia.org/r/580980

Change 582806 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Bump schema repo versions to get mediawiki/client/error in primary

https://gerrit.wikimedia.org/r/582806

Change 582804 merged by Ottomata:
[operations/mediawiki-config@master] Temporarily disable client side error logging for a deploy

https://gerrit.wikimedia.org/r/582804

Change 582806 merged by Ottomata:
[eventgate-wikimedia@master] Bump schema repo versions to get mediawiki/client/error in primary

https://gerrit.wikimedia.org/r/582806

Mentioned in SAL (#wikimedia-operations) [2020-03-23T13:40:22Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Temporarily disable client side error logging for a deploy - T226986 (duration: 01m 01s)

Change 582808 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-logging-external - bump image version to 2020-03-23-133653-production

https://gerrit.wikimedia.org/r/582808

Change 582808 merged by Ottomata:
[operations/deployment-charts@master] eventgate-logging-external - bump image version to 2020-03-23-133653-production

https://gerrit.wikimedia.org/r/582808

Change 578951 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] clientError: Changes event fields.

https://gerrit.wikimedia.org/r/578951

Change 582819 had a related patch set uploaded (by Ottomata; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@wmf/1.35.0-wmf.24] clientError: Changes event fields.

https://gerrit.wikimedia.org/r/582819

Change 582819 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.35.0-wmf.24] clientError: Changes event fields.

https://gerrit.wikimedia.org/r/582819

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:00:06Z] <otto@deploy1001> Synchronized php-1.35.0-wmf.24/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: [[gerrit:578951|clientError: Changes event fields (T226986)]] (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:04:08Z] <otto@deploy1001> Synchronized php-1.35.0-wmf.24/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: [[gerrit:578951|clientError: Changes event fields (T226986)]] (take 2) (duration: 00m 59s)

Change 582822 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Re-enable client side error logging on group0 and hawwiki

https://gerrit.wikimedia.org/r/582822

Change 582822 merged by Ottomata:
[operations/mediawiki-config@master] Re-enable client side error logging on group0 and hawwiki

https://gerrit.wikimedia.org/r/582822

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:56:56Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Re-enablle client side error logging for group0 and hawwike - T226986 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:59:02Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Re-enablle client side error logging for group0 and hawwike - T226986 (take 2) (duration: 00m 59s)

Ok! changes deployed. I verified that mw.track('global.error', ...} will produce the expected error event with the new schema:

var obj = {errorObject: new Error("WOW"), errorMessage: "OTTO WOW", stackTrace: "WOW WOW"}
mw.track('global.error', obj);

Results in the following event

{
  "http": {
    "request_headers": {
      "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"
    },
    "client_ip": "69.113.153.28"
  },
  "stack_trace": "WOW WOW",
  "url": "https://haw.wikipedia.org/wiki/Ka_papa_kinohi",
  "message": "OTTO WOW",
  "error_class": "Error",
  "$schema": "/mediawiki/client/error/1.0.0",
  "meta": {
    "request_id": "b0d2659e-1685-4b08-b387-3350481bd635",
    "dt": "2020-03-23T16:02:27.017Z",
    "id": "bf094f29-bace-4447-87f6-afa53574f2c7",
    "domain": "haw.wikipedia.org",
    "stream": "mediawiki.client.error"
  }
}

Ah we haven't yet resolved the x-request-id stuff.

In T226986#5959396, @Tgr wrote:
Putting the MW request ID into a JS error event seems not so useful anyway. It's pretty unlikely the JS error was caused by something server-side [..]

Just one example from recent memory, T228746. A server-side problem causes mw.config to be missing all its page-specific entries. This easily could, and probably already does, cause uncaught errors in JavaScript.

@Krinkle and @Tgr, what should we do?

  • Should EventGate stop generating an X-Request-ID header if one is not set? It sounds like yes.
  • Should WikimediaEvents set meta.request_id to mw.config.get('wgRequestId'), or not?

It seems like it might be nice to have request_id set to the original X-Request-ID value that the server had when it generated the JS, but perhaps not? Would it somehow enable us to track client errors from the same 100% cached page in varnish (is that a thing?)

We'll do whatever you two agree is best.

Change 582804 merged by Ottomata:
[operations/mediawiki-config@master] Temporarily disable client side error logging for a deploy

https://gerrit.wikimedia.org/r/582804

Just checking in regarding the convo at schemas/event/primary/+/580980, I thought we didn't want to disable the client code during the deploy? Did something happen to change that, or is this something you'd still be okay with trying for a next time?

Oh, I understood that we did want to just for good measure, but that it didn't really matter at this point if we did or didn't.

I think either way it would have been hard to evaluate the 'long tail'. The volume of error events is currently very low. I did see a few invalid events during the migration time, even though client code was disabled during that time, so there were indeed some cached clients still sending invalid events.

Ah we haven't yet resolved the x-request-id stuff.

Putting the MW request ID into a JS error event seems not so useful anyway. [..]

Just one example from recent memory, T228746. A server-side problem causes mw.config to be missing all its page-specific entries. This easily could, and probably already does, cause uncaught errors in JavaScript.

@Krinkle and @Tgr, what should we do?

  • Should EventGate stop generating an X-Request-ID header if one is not set? It sounds like yes.
  • Should WikimediaEvents set meta.request_id to mw.config.get('wgRequestId'), or not?

For the purposes monitoring and investigating client-side errors, I do not see any use in generating request IDs within the EventGate pipeline, as they do not correlate with anything in neither the MW backend nor the client-side code that caused or emitted the error. But, it seems totally reasonable for this field to exist in its current form. I can see how it might be useful for developers of EventGate and/or the client-side error pipeline. I can't speak to that.

Per my comment, yes, I think it is important that we log the wgRequestId from the MW request that generated the page that produced the error. But, even without a request ID for EventGate, you might to use a field other than meta.request_id to avoid confusion and to leave the option open in the future to use it for that purpose. Given that this request ID would not be of the beacon request. And also because it would not correlate with page views or even user sessions. A cached page will be viewed by many different users, from different countries, with a different device/browser, and interacting with different features on that page. Some other field might be clearer there, like page_request_id or some such.

Can we close this task and call the production launch done?

Closing cc-ing @dcipoletti so he knows that client side error logging is available (which is relevant for web teams, specially in the light of the desktop refresh) and that documentation (to be provided by product infra team) is still pending. Please see: T248884: Documentation of client side error logging capabilities on mediawiki

hey @jlinehan -- do you have any concerns about me re-routing the DNS of intake-logging.wikimedia.org to resolve not to the nearest edge datacenter to the user, but to the second-nearest edge datacenter? This would help us get Network Error Logging reports in realtime, while also not really negatively impacting the client JS error use case AFAICT. There's some more details about this specific change in T261340, and the larger context in T257527 (to which I think you're subscribed).

There are ways to implement this for NEL without impacting client JS error logging, but they involve nontrivial additional work, so if above sounds good to you, it would save me some time. Please let me know, and thanks!

For EventLogging, we specifically moved away from separate domains to using /beacon so that the majority of non-deferred events that are sent off during user interactions don't require separate connections to be established etc. It has been a while since we last quantified the benefits of this choice, so it's certainly worth revisiting.

I see that EventGate, which is not used yet for most events, uses a separate domain again at https://intake-logging.wikimedia.org. That'll involve a DNS query, but given it points to dyna.wikimedia.org same as text-lb, I'm assuming this means it is handled through the same connection and traffic layer as other requests.

This might not be a deal breaker per the above, but it would require additional research and evaluation, and certainly makes things less flexible in the future so from my perspective it'd be a safe choice to expose this alternate routing separately and used only for those events. The EventGate client is set up to take full URLs as its way of configuring so that would not require any changes or additional flexibility there.

@CDanis this should be fine, the errors are fire-and-forget so this shouldn't cause any slowdown e.g. increased RTT waiting for a response, and there is no timing or anything that depends on this. I can't see any issue that would result, the DNS routing should be totally opaque.

Great, thank you! That was my thinking as well, but I wanted to confirm.

For EventLogging, we specifically moved away from separate domains to using /beacon so that the majority of non-deferred events that are sent off during user interactions do require separate connections to be established etc. It has been a while since we last quantified the benefits of this choice, so it's certainly worth revisiting.

If we're trying to reduce DNS queries, yeah, we are already making a separate one for the logging endpoint. Consolidating those is certainly worth talking through, but Is that related to @CDanis's proposal?

DNS is relatively quick and well-cached on-device, local network, and in middleware networks. Starting to establish more than one primary connection on a majority of page views is something we phased out 5+ years ago and would be great not to bring back without further research and consideration first.

The latency of events is indeed not critical, but there is a performance cost. It's all happening on the same device and network, consuming limited battery and network resources.

DNS is relatively quick and well-cached on-device, local network, and in middleware networks. Starting to establish more than one primary connection on a majority of page views is something we phased out 5+ years ago and would be great not to bring back without further research and consideration first.

The latency of events is indeed not critical, but there is a performance cost. It's all happening on the same device and network, consuming limited battery and network resources.

@Krinkle got it. I'll create a task for this then. Is this something that should block @CDanis's request or something that we should use the client error and network error traffic to study?

Huh, I'm pretty sure I discussed this with bblack and/or traffic team when we were first setting up eventgate-analtyics-external, and they preferred that the intake service got its own unique URL, rather than serving it in the wiki domains.

I can't find a public discussion of this in Phab, just https://phabricator.wikimedia.org/T233629#557637, so we must have discussed it in IRC or elsewhere.

Huh, I'm pretty sure I discussed this with bblack and/or traffic team when we were first setting up eventgate-analtyics-external, and they preferred that the intake service got its own unique URL, rather than serving it in the wiki domains.

I think that the issue isn't so much the URL being distinct as the fact that https://intake-logging.wikimedia.org would resolve differently than /beacon or the other endpoint for events and hence require another connection.

Right, but I asked originally which we should do, host at same wiki domain at /beacon, or use a separate domain, and I was told to use a separate domain.

Change 628935 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] point intake-logging.wikimedia.org to text-next (second-best DC)

https://gerrit.wikimedia.org/r/628935

Change 628935 merged by CDanis:
[operations/dns@master] point intake-logging.wikimedia.org to text-next (second-best DC)

https://gerrit.wikimedia.org/r/628935