Maniphest T319261

Significant and unexpected drop in JavaScript error logging
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Jdlrobson
	Oct 4 2022, 1:05 AM

Description

The drop looks too good to be true. Did we break the pipeline or change how we log errors?
https://grafana.wikimedia.org/d/000000566/overview?orgId=1&viewPanel=16

Screen Shot 2022-10-03 at 6.04.49 PM.png (705×1 px, 166 KB)

Marking as UBN until we can confirm what happened here.

Last 24hrs only 557 errors were logged.
Last week before the deploy we were averaging 50,000 errors a day.

Details

	Subject	Repo	Branch	Lines +/-
	Bump version of eventgate image that is in use	operations/deployment-charts	master	+4 -4
	blubber: update primary schema registry's git sha	eventgate-wikimedia	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	• demon	T314193 1.40.0-wmf.4 deployment blockers
		Resolved		None	T319261 Significant and unexpected drop in JavaScript error logging

Event Timeline

Jdlrobson created this task.Oct 4 2022, 1:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2022, 1:05 AM

Jdlrobson triaged this task as Unbreak Now! priority.Oct 4 2022, 1:05 AM

Jdlrobson mentioned this in T314952: Misleading message shows in skins where VE is compatible but the page because of its state isn't ["Incompatible skin" / "Incompatible with VisualEditor"].

Jdlrobson added subscribers: matmarex, cjming.

@Tgr @phuedx pinging you given the activity in WikimediaEvents last week (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/700242)
I am marking this as a train blocker for now, as it could potentially mean errors leak into production that we don't know about. Feel free to adapt once you've assessed! Thanks in advance!

Jdlrobson added a parent task: T314193: 1.40.0-wmf.4 deployment blockers.Oct 4 2022, 1:08 AM

Jdlrobson mentioned this in T318249: TypeError: Object.values is not a function.

Jdlrobson mentioned this in T314193: 1.40.0-wmf.4 deployment blockers.Oct 4 2022, 1:11 AM

Jdlrobson updated the task description. (Show Details)Oct 4 2022, 1:14 AM

https://logstash.wikimedia.org/goto/42fcb0aed180c0a197cecdd26bc9d8ae{F35545830}

Failed loading schema at /mediawiki/client/error/2.0.0 with ENOENT (meaning the schema file was not found).

The new schema was defined in c829295 and enabled in 829299. The only change the the schema name / path was a version bump, not sure what to look for there.

Restricted Application added a project: Data-Engineering. · View Herald TranscriptOct 4 2022, 3:12 AM

(On an aside, the schema version is hardcoded in puppet here, maybe that should be updated. But that's for the beta cluster.)

kostajh subscribed.Oct 4 2022, 5:56 AM

I think that the issue is related to how eventgate-logging-external reads the schemas, namely from local disk only:

https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#eventgate-logging-external

Meanwhile I can see it updated in the schema registry: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/client/error

Verified on a pod that the new schema is not there:

root@deploy1002:~# kubectl exec eventgate-logging-external-production-5f797889fb-9j5wl -n eventgate-logging-external -- ls /srv/service/schemas/event/primary/jsonschema/mediawiki/client/error
1.0.0
1.0.0.json
1.0.0.yaml
1.1.0
1.1.0.json
1.1.0.yaml
CHANGELOG.md
current.yaml
latest
latest.json
latest.yaml

In theory with a roll restart of the kubernetes pods we should resolve the problem, I'll ask an advice to Service Ops before proceeding.

Tried a roll restart of codfw pods, not successful.

It seems that the Docker image needs to be rebuilt with a new version of the schema registry, see this prev commit: https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/625966/1/.pipeline/blubber.yaml

Change 838067 had a related patch set uploaded (by Elukey; author: Elukey):

[eventgate-wikimedia@master] blubber: update primary schema registry's git sha

https://gerrit.wikimedia.org/r/838067

gerritbot added a project: Patch-For-Review.Oct 4 2022, 7:58 AM

BTullis subscribed.Oct 4 2022, 8:52 AM

Change 838067 merged by Btullis:

[eventgate-wikimedia@master] blubber: update primary schema registry's git sha

https://gerrit.wikimedia.org/r/838067

Change 838107 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump version of eventgate image that is in use

https://gerrit.wikimedia.org/r/838107

Change 838107 merged by jenkins-bot:

[operations/deployment-charts@master] Bump version of eventgate image that is in use

https://gerrit.wikimedia.org/r/838107

I have merged @elukey's change to eventgate-wikimedia (https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/838067) which built a new production eventgate image.

Next I have merged a change to deployment-charts (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/838107) so that this new image will be deployed.

I will proceed to restart the eventgate-logging-external service.

I have deployed eventgate-logging-external and the new schema appears to be in place.

I can see an uptick here: https://grafana.wikimedia.org/d/000000566/overview?orgId=1&viewPanel=16&from=now-1h&to=now&refresh=30s but it is gradual.
Is anyone else able to confirm whether or not this fix has been successful?

Both logstash and grafana look really good, I think that the task can be closed!

In T319261#8282373, @BTullis wrote:

Is anyone else able to confirm whether or not this fix has been successful?

I added a line like

setTimeout( function () { throw new Error( 'T319261' ); }, 5000 );

to my common.js, refreshed the page, and saw a POST request to https://intake-logging.wikimedia.org/v1/events?hasty=true that with HTTP 202 so 👍.

elukey lowered the priority of this task from Unbreak Now! to Medium.Oct 4 2022, 11:39 AM

Thank you all! Yes, only the 'analytics' eventgate clusters use dynamic schema lookup. All others rely on schemas baked into their local docker image, so they don't have to be runtime coupled to a remote schema service.

brennen subscribed.Oct 4 2022, 3:30 PM

Agusbou2015 subscribed.Oct 4 2022, 5:15 PM

Resolving per discussion here and normal-looking graphs.

RhinosF1 subscribed.Oct 4 2022, 6:07 PM

Thanks for the quick response here!

brennen removed brennen as the assignee of this task.Oct 4 2022, 6:12 PM

Aklapper edited projects, added Instrument-ClientError; removed DoNotUse---Instrument-ClientError.Nov 24 2022, 1:55 PM

	F35545831: Screenshot Capture - 2022-10-04 - 04-55-07.png
	Oct 4 2022, 3:12 AM

	F35545785: Screen Shot 2022-10-03 at 6.04.49 PM.png
	Oct 4 2022, 1:05 AM

	F35546184: image.png
	Oct 4 2022, 9:53 AM

Significant and unexpected drop in JavaScript error loggingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Significant and unexpected drop in JavaScript error logging
Closed, ResolvedPublic
Actions

Related Objects
Search...