Client side error logging production launch
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	Jul 1 2019, 12:26 PM

Description

This umbrella task will be used to track production launch of "client side error logging" project.

Paraphrasing from https://etherpad.wikimedia.org/p/clients-error-logging, we'll be lining up as many ducks as we can in Q1, with launch in Q2 to many/most wikis. End of Q2 is naturally risking for high traffic wikis because of donation campaigns, we'll have to weight benefit/risks of launching to big wikis too.

Q2 FY2019/2020

Schema validation for eventgate events
kubernetes setup for eventgate deployment (i.e. backend component) initially set to receive errors from low traffic wikis - @Ottomata
Write minimal client to send errors without attempting normalization for MVP
Security review of the js client we'll be using, preliminary task at T232820
Performance review of the js client we'll be using (Timo reviewed code)
Verify we have enough ingestion capacity on the Logstash side @fgiunchedi
Deploy and enable on one low traffic wiki - T246030
Related to the above, make sure deduplication/rate limiting in depth (i.e. both on the client side, and on the backend side) is in place before high traffic wikis launch.
Verify events show up in logstash and we have a Kibana dashboard available @fgiunchedi

@dr0ptp4kt @phuedx @Ottomata @Milimetric @CDanis @colewhite @Tgr @Krinkle please check and adjust the above plan as needed! What do you think?

Details

Subject	Repo	Branch	Lines +/-
point intake-logging.wikimedia.org to text-next (second-best DC)	operations/dns	master	+2 -2
Set wgWMEClientErrorIntakeURL in onResourceLoaderGetConfigVars	mediawiki/extensions/WikimediaEvents	master	+2 -1
Re-enable client side error logging on group0 and hawwiki	operations/mediawiki-config	master	+2 -4
clientError: Changes event fields.	mediawiki/extensions/WikimediaEvents	wmf/1.35.0-wmf.24	+17 -25
clientError: Changes event fields.	mediawiki/extensions/WikimediaEvents	master	+17 -25
eventgate-logging-external - bump image version to 2020-03-23-133653-production	operations/deployment-charts	master	+3 -3
Bump schema repo versions to get mediawiki/client/error in primary	eventgate-wikimedia	master	+2 -2
Temporarily disable client side error logging for a deploy	operations/mediawiki-config	master	+4 -2
Updates mediawiki/client/error schema (without versioning).	schemas/event/primary	master	+61 -22
logstash: adjust client error topics for eqiad/codfw	operations/puppet	production	+29 -2
Set wgWMEClientErrorIntakeURL in onResourceLoaderGetConfigVars	mediawiki/extensions/WikimediaEvents	wmf/1.35.0-wmf.21	+2 -1
eventgate-logging-external - fix mediawiki/client/error schema title	operations/deployment-charts	master	+2 -2
eventgate-logging-external - bump image version to 2020-02-25-183224-production	operations/deployment-charts	master	+3 -3
prometheus: add alerts for eventgate-logging-external latency/errors	operations/puppet	production	+22 -0

Related Objects
Search...

Status	Assigned	Task
Open	None	T49145 Formally deprecate jQuery UI after we've stopped using jQuery UI in extensions and core
Open	None	T100270 Replace use of jQuery UI and MW UI with OOUI across all Wikimedia-deployed extensions and core
Open	None	T85394 Use OOUI suggestions/autocompletion components only (instead of jquery.suggestions, jquery.ui.autocomplete)
Open	None	T125725 [epic] Update autocomplete search box with metadata and remove and delete the old searchSuggest system
Open	None	T177251 Dead keys prevent autocomplete in search box
Resolved	ovasileva	T244392 [GOAL] Deploy the new Vue.js search experience
Resolved	ovasileva	T275200 Analyze results of A/B test for new search widget
Resolved	ovasileva	T249297 Deploy the new Vue.js search experience
Resolved	• jlinehan	T255585 [EPIC] Extend client-side error logging coverage to include English Wikipedia
Resolved	Ottomata	T185233 Modern Event Platform
Resolved	Ottomata	T201068 Modern Event Platform: Stream Intake Service
Resolved	phuedx	T217142 [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors
Resolved	None	T226986 Client side error logging production launch
Resolved	• jlinehan	T229442 Create client side error schema
Resolved	• jlinehan	T235189 Develop small client-side error logger for MediaWiki
Resolved	Ottomata	T236386 Set up eventgate-logging-external in production
Resolved	Ottomata	T246030 Enable client side error logging in prod for small wiki

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T226986#5956043, @Ottomata wrote:

Breakdown by wiki domain (derived from webpage url?). If you do this, I'd recommend using server

If you mean domain, we have meta.domain which can and should be set. This could be en.wikipedia.org or en.m.wikipedia.org.

Yep, I meant domain name – And, no, it is not currently set in the data I see in Logstash.

In T226986#5956043, @Ottomata wrote:

The url key is currently set to the web page url,

@Krinkle, just to clarify, you think url should be location.href and a new error_url should be the Error object's url?

Yes.

In T226986#5956043, @Ottomata wrote:

The meta.request_id key does not appear to hold the request ID of the page view.

Interesting. EventGate should set this to the value of the X-Request-ID header if it is not set in the event data. I will investigate.

We could set it client side, but we chose to make EventGate fill in some defaults to avoid some duplicate transmitted bytes (since this and other info is already in the header). […] I wouldn't mind calling it reqid, but then we'd have to revisit all of the other event schemas too.

I'm not sure where that comes form then. Afaik internal edge traffic via Nginx/ATS/Varnish doesn't (yet) add a request ID header. The only server I'm aware of doing that are the Node.js services, and MW Apaches.

Maybe a proxy between ATS and EventGate is creating its own request ID. That would mean it describes the /beacon request itself, not the page view it came from. This wouldn't have diagnostic value in relation to the error that is being logged.

Setting it from the client would make sense indeed. Although we need to be careful about propagating request IDs internally. I think we should (until further notice) never "whitewash" request ID values provided by external clients into our own infrastructure. Besides, it might also be useful for you to be able to trace beacon requests as their own thing throughout our infrastructure. After alll, the beacon request from the web browser is not a "dependency" of the MediaWiki backend request. They could be days apart from each other (given CDN caching) and in no way part of each other.

I'd recommend storing the page's reqId in the stream separately from meta.request-id. If you don't (yet) need such meta request-id, perhaps we can also omit that from the Logstash stream to avoid confusion? As for the naming, naming it reqid would not help so migiht as well make it more obviously differnet from reqId. Don't worry! I'm actually slightly against reqid as that could be unproductive in terms of debugging confusion. It is too easily misread as the thing we see elsewhere, which would then not match in any queries etc.

Putting the MW request ID into a JS error event seems not so useful anyway. It's pretty unlikely the JS error was caused by something server-side when rendering the page, and that something would be identifiable from the logs. Maybe if it could be connected to API requests on the same page as well, that could be useful, but we have no such functionality currently.

Change 578951 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@master] clientError: Fixes behavior around 'type' and 'url' fields.

https://gerrit.wikimedia.org/r/578951

I'm not sure where that comes form then.

Ah, service-template-node is doing it. (And more specifically this is the code that eventgate has.)

I could make eventgate not do this if that makes more sense.

@Krinkle and @Tgr, to summarize your comments it sounds like it would be best not to send mw.config.get('wgRequestId') from the client at all, right?

In T226986#5956043, @Ottomata wrote:

The type key is clashing with Logstash-wide key

Let's just change the name of this field to error_type

Maybe let's do error_name? The field we're going to be using to populate this will be Error.name in JavaScript. Or can we use plain name without collision?

Should this be added in errorLogging.js or in WME? I made a patchset for WME but it's simple to make a patch for errorLogging.js too.

Shouldn't the stack trace be structured? These look like just the native browser stack trace.

@Tgr they look as expected with the code we wrote now, which should produce a string that looks like a generic native browser stack trace. By structured do you mean JSON? I think it would be straightforward to do that, maybe if you make an example of what you'd like it to look like I can make a patch.

Submitted patchset to change URL and type behavior, will add everybody as reviewers.

Hm, what service-template-node should most properly do is what varnish frontends do: only propagate X-Request-Id if client.ip == wikimedia_trust. Getting that wikimedia_trust list into something that service-template-node can use is probably not that easy tho. Ping @Pchelolo.

Ottomata updated the task description. (Show Details)Mar 11 2020, 1:25 PM

In T226986#5960122, @Ottomata wrote:

Hm, what service-template-node should most properly do is what varnish frontends do: only propagate X-Request-Id if client.ip == wikimedia_trust.

Honestly... IMHO this logic doesn't belong in Varnish or Eventgate, but it's a separate issue..

Getting that wikimedia_trust list into something that service-template-node can use is probably not that easy tho.

If we could get it to eventgate puppet role, we could add it to deployment-vars and carry it into the config. From there it should be relatively easy to get the same logic into template-node. The puppet part I have no idea how to do, but the rest seems quite doable?

If we could get it to eventgate puppet role

Kubernetes :/

In T226986#5959396, @Tgr wrote:

Putting the MW request ID into a JS error event seems not so useful anyway. It's pretty unlikely the JS error was caused by something server-side [..]

Just one example from recent memory, T228746. A server-side problem causes mw.config to be missing all its page-specific entries. This easily could, and probably already does, cause uncaught errors in JavaScript.

The SRE Observability folks are about to propose standardized logging schema conventions based on Elastic Common Schema. Since Event Platform has its own schema conventions and requirements, we can't totally conform to theirs, but we should try to do our best.

The mediawiki/client/error schema is live, but it isn't widely used yet so we can still tolerate some schema changes at this time. We should review the ECS with @colewhite and modify our schema to conform where we can. E.g. we probably should use this error field, or at least name our fields accordingly. error_stack, or error.stack, etc.

We should review the ECS with @colewhite and modify our schema to conform where we can

Maybe let's do error_name? The field we're going to be using to populate this will be Error.name in JavaScript. Or can we use plain name without collision?

Not sure that Error.name would include any info for generic (uncaught) javascript errors as it is filled in with the Error.prototype.string so it would be always "Error" (unless we have a class that extends Error, the exceptions coming from those classes are probably caught already)

Change 580980 had a related patch set uploaded (by Jason Linehan; owner: Jason Linehan):
[schemas/event/primary@master] Updates mediawiki/client/error schema (without versioning).

https://gerrit.wikimedia.org/r/580980

Ok, https://gerrit.wikimedia.org/r/c/schemas/event/primary/+/580980 and https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/578951 are ready to be merged. I'd rather deploy and SWAT them out together, so I will merge and deploy changes needed on Monday. Here's the plan.

I'm going to disable client side error logging now so that hopefully by Monday all (or most) clients won't be sending any incompatible events.

On Monday:

Merge changes.
Rebuild eventgate-wikimedia image with latest schema repo and deploy to eventgate-logging-external
Backport WikimediaEvents change
Deploy modules/ext.wikimediaEvents/clientError.js
Enable client side error logging

Change 582804 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Temporarily disable client side error logging for a deploy

https://gerrit.wikimedia.org/r/582804

Change 580980 merged by Ottomata:
[schemas/event/primary@master] Updates mediawiki/client/error schema (without versioning).

https://gerrit.wikimedia.org/r/580980

Change 582806 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Bump schema repo versions to get mediawiki/client/error in primary

https://gerrit.wikimedia.org/r/582806

Change 582804 merged by Ottomata:
[operations/mediawiki-config@master] Temporarily disable client side error logging for a deploy

https://gerrit.wikimedia.org/r/582804

Change 582806 merged by Ottomata:
[eventgate-wikimedia@master] Bump schema repo versions to get mediawiki/client/error in primary

https://gerrit.wikimedia.org/r/582806

Mentioned in SAL (#wikimedia-operations) [2020-03-23T13:40:22Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Temporarily disable client side error logging for a deploy - T226986 (duration: 01m 01s)

Change 582808 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-logging-external - bump image version to 2020-03-23-133653-production

https://gerrit.wikimedia.org/r/582808

Change 582808 merged by Ottomata:
[operations/deployment-charts@master] eventgate-logging-external - bump image version to 2020-03-23-133653-production

https://gerrit.wikimedia.org/r/582808

Change 578951 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@master] clientError: Changes event fields.

https://gerrit.wikimedia.org/r/578951

Change 582819 had a related patch set uploaded (by Ottomata; owner: Jason Linehan):
[mediawiki/extensions/WikimediaEvents@wmf/1.35.0-wmf.24] clientError: Changes event fields.

https://gerrit.wikimedia.org/r/582819

Change 582819 merged by jenkins-bot:
[mediawiki/extensions/WikimediaEvents@wmf/1.35.0-wmf.24] clientError: Changes event fields.

https://gerrit.wikimedia.org/r/582819

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:00:06Z] <otto@deploy1001> Synchronized php-1.35.0-wmf.24/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: [[gerrit:578951|clientError: Changes event fields (T226986)]] (duration: 01m 01s)

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.24; 2020-03-17); removed MW-1.35-notes (1.35.0-wmf.21; 2020-02-25).Mar 23 2020, 3:00 PM

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:04:08Z] <otto@deploy1001> Synchronized php-1.35.0-wmf.24/extensions/WikimediaEvents/modules/ext.wikimediaEvents/clientError.js: [[gerrit:578951|clientError: Changes event fields (T226986)]] (take 2) (duration: 00m 59s)

Change 582822 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Re-enable client side error logging on group0 and hawwiki

https://gerrit.wikimedia.org/r/582822

Change 582822 merged by Ottomata:
[operations/mediawiki-config@master] Re-enable client side error logging on group0 and hawwiki

https://gerrit.wikimedia.org/r/582822

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:56:56Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Re-enablle client side error logging for group0 and hawwike - T226986 (duration: 01m 00s)

Mentioned in SAL (#wikimedia-operations) [2020-03-23T15:59:02Z] <otto@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Re-enablle client side error logging for group0 and hawwike - T226986 (take 2) (duration: 00m 59s)

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.25; 2020-03-24); removed MW-1.35-notes (1.35.0-wmf.24; 2020-03-17).Mar 23 2020, 4:00 PM

Ok! changes deployed. I verified that mw.track('global.error', ...} will produce the expected error event with the new schema:

var obj = {errorObject: new Error("WOW"), errorMessage: "OTTO WOW", stackTrace: "WOW WOW"}
mw.track('global.error', obj);

Results in the following event

{
  "http": {
    "request_headers": {
      "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"
    },
    "client_ip": "69.113.153.28"
  },
  "stack_trace": "WOW WOW",
  "url": "https://haw.wikipedia.org/wiki/Ka_papa_kinohi",
  "message": "OTTO WOW",
  "error_class": "Error",
  "$schema": "/mediawiki/client/error/1.0.0",
  "meta": {
    "request_id": "b0d2659e-1685-4b08-b387-3350481bd635",
    "dt": "2020-03-23T16:02:27.017Z",
    "id": "bf094f29-bace-4447-87f6-afa53574f2c7",
    "domain": "haw.wikipedia.org",
    "stream": "mediawiki.client.error"
  }
}

Ah we haven't yet resolved the x-request-id stuff.

In T226986#5959396, @Tgr wrote:
Putting the MW request ID into a JS error event seems not so useful anyway. It's pretty unlikely the JS error was caused by something server-side [..]

Just one example from recent memory, T228746. A server-side problem causes mw.config to be missing all its page-specific entries. This easily could, and probably already does, cause uncaught errors in JavaScript.

@Krinkle and @Tgr, what should we do?

Should EventGate stop generating an X-Request-ID header if one is not set? It sounds like yes.
Should WikimediaEvents set meta.request_id to mw.config.get('wgRequestId'), or not?

It seems like it might be nice to have request_id set to the original X-Request-ID value that the server had when it generated the JS, but perhaps not? Would it somehow enable us to track client errors from the same 100% cached page in varnish (is that a thing?)

We'll do whatever you two agree is best.

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.24; 2020-03-17); removed MW-1.35-notes (1.35.0-wmf.25; 2020-03-24).Mar 23 2020, 5:00 PM

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.25; 2020-03-24); removed MW-1.35-notes (1.35.0-wmf.24; 2020-03-17).Mar 23 2020, 6:00 PM

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.24; 2020-03-17); removed MW-1.35-notes (1.35.0-wmf.25; 2020-03-24).Mar 23 2020, 7:00 PM

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.25; 2020-03-24); removed MW-1.35-notes (1.35.0-wmf.24; 2020-03-17).Mar 23 2020, 8:01 PM

In T226986#5991713, @gerritbot wrote:

Change 582804 merged by Ottomata:
[operations/mediawiki-config@master] Temporarily disable client side error logging for a deploy

https://gerrit.wikimedia.org/r/582804

Just checking in regarding the convo at schemas/event/primary/+/580980, I thought we didn't want to disable the client code during the deploy? Did something happen to change that, or is this something you'd still be okay with trying for a next time?

Oh, I understood that we did want to just for good measure, but that it didn't really matter at this point if we did or didn't.

I think either way it would have been hard to evaluate the 'long tail'. The volume of error events is currently very low. I did see a few invalid events during the migration time, even though client code was disabled during that time, so there were indeed some cached clients still sending invalid events.

ReleaseTaggerBot edited projects, added MW-1.35-notes (1.35.0-wmf.24; 2020-03-17); removed MW-1.35-notes (1.35.0-wmf.25; 2020-03-24).Mar 23 2020, 9:00 PM

Peter mentioned this in T240978: Performance review of Wikipedia KaiOS app.Mar 25 2020, 12:53 PM

@Krinkle can you give me some guidance about https://phabricator.wikimedia.org/T226986#5992337 ?

SBisson mentioned this in T248615: Capture and send back client-side errors.Mar 26 2020, 7:22 PM

In T226986#5992337, @Ottomata wrote:

Ah we haven't yet resolved the x-request-id stuff.

In T226986#5959396, @Tgr wrote:

Putting the MW request ID into a JS error event seems not so useful anyway. [..]

In T226986#5965925, @Krinkle wrote:

Just one example from recent memory, T228746. A server-side problem causes mw.config to be missing all its page-specific entries. This easily could, and probably already does, cause uncaught errors in JavaScript.

@Krinkle and @Tgr, what should we do?

Should EventGate stop generating an X-Request-ID header if one is not set? It sounds like yes.

Should WikimediaEvents set meta.request_id to mw.config.get('wgRequestId'), or not?

For the purposes monitoring and investigating client-side errors, I do not see any use in generating request IDs within the EventGate pipeline, as they do not correlate with anything in neither the MW backend nor the client-side code that caused or emitted the error. But, it seems totally reasonable for this field to exist in its current form. I can see how it might be useful for developers of EventGate and/or the client-side error pipeline. I can't speak to that.

Per my comment, yes, I think it is important that we log the wgRequestId from the MW request that generated the page that produced the error. But, even without a request ID for EventGate, you might to use a field other than meta.request_id to avoid confusion and to leave the option open in the future to use it for that purpose. Given that this request ID would not be of the beacon request. And also because it would not correlate with page views or even user sessions. A cached page will be viewed by many different users, from different countries, with a different device/browser, and interacting with different features on that page. Some other field might be clearer there, like page_request_id or some such.

• Nuria closed subtask T246030: Enable client side error logging in prod for small wiki as Resolved.Apr 9 2020, 3:12 PM

fgiunchedi moved this task from In progress to Externally blocked on the observability board.May 11 2020, 3:08 PM

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

Nirmos subscribed.Jun 16 2020, 4:33 PM

• jlinehan added a subtask: T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia.Jun 16 2020, 4:51 PM

Tgr mentioned this in T91357: Update server-side JS error-logging RfC.Jun 20 2020, 2:35 PM

CDanis mentioned this in T257527: automatically collect network error reports from users' browsers (Network Error Logging API).Jul 8 2020, 10:48 PM

Jdlrobson added a subtask: T255204: [Spike, 1hr]: Analyse Catalan Wikipedia error logs for iOS error.Jul 9 2020, 4:21 PM

• Niedzielski mentioned this in T257673: Uncaught RangeError: Maximum call stack size exceeded in MessageTable.resize().Jul 10 2020, 3:56 PM

Can we close this task and call the production launch done?

Closing cc-ing @dcipoletti so he knows that client side error logging is available (which is relevant for web teams, specially in the light of the desktop refresh) and that documentation (to be provided by product infra team) is still pending. Please see: T248884: Documentation of client side error logging capabilities on mediawiki

• Nuria closed this task as Resolved.Jul 14 2020, 4:09 PM

abi_ awarded a token.Jul 16 2020, 5:17 AM

Jdlrobson closed subtask T255204: [Spike, 1hr]: Analyse Catalan Wikipedia error logs for iOS error as Resolved.Aug 4 2020, 4:15 PM

• jlinehan removed a subtask: T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia.Aug 5 2020, 1:34 PM

• jlinehan added a parent task: T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia.

• jlinehan removed a parent task: T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia.

• jlinehan added a parent task: T255585: [EPIC] Extend client-side error logging coverage to include English Wikipedia.Aug 5 2020, 1:36 PM

• jlinehan removed a subtask: T255204: [Spike, 1hr]: Analyse Catalan Wikipedia error logs for iOS error.

• jlinehan removed a subtask: T248884: Documentation of client side error logging capabilities on mediawiki.Aug 6 2020, 2:31 PM

hey @jlinehan -- do you have any concerns about me re-routing the DNS of intake-logging.wikimedia.org to resolve not to the nearest edge datacenter to the user, but to the second-nearest edge datacenter? This would help us get Network Error Logging reports in realtime, while also not really negatively impacting the client JS error use case AFAICT. There's some more details about this specific change in T261340, and the larger context in T257527 (to which I think you're subscribed).

There are ways to implement this for NEL without impacting client JS error logging, but they involve nontrivial additional work, so if above sounds good to you, it would save me some time. Please let me know, and thanks!

For EventLogging, we specifically moved away from separate domains to using /beacon so that the majority of non-deferred events that are sent off during user interactions don't require separate connections to be established etc. It has been a while since we last quantified the benefits of this choice, so it's certainly worth revisiting.

I see that EventGate, which is not used yet for most events, uses a separate domain again at https://intake-logging.wikimedia.org. That'll involve a DNS query, but given it points to dyna.wikimedia.org same as text-lb, I'm assuming this means it is handled through the same connection and traffic layer as other requests.

This might not be a deal breaker per the above, but it would require additional research and evaluation, and certainly makes things less flexible in the future so from my perspective it'd be a safe choice to expose this alternate routing separately and used only for those events. The EventGate client is set up to take full URLs as its way of configuring so that would not require any changes or additional flexibility there.

@CDanis this should be fine, the errors are fire-and-forget so this shouldn't cause any slowdown e.g. increased RTT waiting for a response, and there is no timing or anything that depends on this. I can't see any issue that would result, the DNS routing should be totally opaque.

Great, thank you! That was my thinking as well, but I wanted to confirm.

In T226986#6467370, @Krinkle wrote:

For EventLogging, we specifically moved away from separate domains to using /beacon so that the majority of non-deferred events that are sent off during user interactions do require separate connections to be established etc. It has been a while since we last quantified the benefits of this choice, so it's certainly worth revisiting.

If we're trying to reduce DNS queries, yeah, we are already making a separate one for the logging endpoint. Consolidating those is certainly worth talking through, but Is that related to @CDanis's proposal?

DNS is relatively quick and well-cached on-device, local network, and in middleware networks. Starting to establish more than one primary connection on a majority of page views is something we phased out 5+ years ago and would be great not to bring back without further research and consideration first.

The latency of events is indeed not critical, but there is a performance cost. It's all happening on the same device and network, consuming limited battery and network resources.

In T226986#6467398, @Krinkle wrote:

DNS is relatively quick and well-cached on-device, local network, and in middleware networks. Starting to establish more than one primary connection on a majority of page views is something we phased out 5+ years ago and would be great not to bring back without further research and consideration first.

The latency of events is indeed not critical, but there is a performance cost. It's all happening on the same device and network, consuming limited battery and network resources.

@Krinkle got it. I'll create a task for this then. Is this something that should block @CDanis's request or something that we should use the client error and network error traffic to study?

Huh, I'm pretty sure I discussed this with bblack and/or traffic team when we were first setting up eventgate-analtyics-external, and they preferred that the intake service got its own unique URL, rather than serving it in the wiki domains.

I can't find a public discussion of this in Phab, just https://phabricator.wikimedia.org/T233629#557637, so we must have discussed it in IRC or elsewhere.

In T226986#6467482, @Ottomata wrote:

Huh, I'm pretty sure I discussed this with bblack and/or traffic team when we were first setting up eventgate-analtyics-external, and they preferred that the intake service got its own unique URL, rather than serving it in the wiki domains.

I think that the issue isn't so much the URL being distinct as the fact that https://intake-logging.wikimedia.org would resolve differently than /beacon or the other endpoint for events and hence require another connection.

Also relevant: T262996: Don't set cookies in traffic layer for non-user facing domains (avoid false third-party cookie warning) ?

Right, but I asked originally which we should do, host at same wiki domain at /beacon, or use a separate domain, and I was told to use a separate domain.

• jlinehan mentioned this in T263049: Avoid extra HTTPS connections for most Event Platform beacons.Sep 16 2020, 6:43 PM

Tgr mentioned this in T501: Create WMF endpoint for error logging - part 1 (producer).Sep 16 2020, 11:49 PM

Tgr mentioned this in T502: Create WMF endpoint for error logging - part 2 (consumer).Sep 16 2020, 11:52 PM

Tgr mentioned this in T500: Create basic endpoint for JS error logging.Sep 16 2020, 11:54 PM

Tgr mentioned this in T499: Create error logging JS module.Sep 17 2020, 12:00 AM

Tgr mentioned this in T106915: Use Sentry in production.Sep 17 2020, 2:12 AM

Tgr mentioned this in T93138: Procure hardware for Sentry.Sep 17 2020, 2:19 AM

Tgr mentioned this in Sentry.Sep 17 2020, 3:11 AM

Change 628935 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/dns@master] point intake-logging.wikimedia.org to text-next (second-best DC)

https://gerrit.wikimedia.org/r/628935

Change 628935 merged by CDanis:
[operations/dns@master] point intake-logging.wikimedia.org to text-next (second-best DC)

https://gerrit.wikimedia.org/r/628935

Krinkle mentioned this in T71519: Run automated browser tests across set of Wikimedia wiki pages to ensure proper JavaScript.Aug 6 2023, 10:21 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 6 2023, 10:31 PM

Aklapper mentioned this in T352108: Archive Gerrit repository "operations/software/sentry" (20150926).Nov 27 2023, 10:14 PM