Page MenuHomePhabricator

Significant and unexpected drop in JavaScript error logging
Closed, ResolvedPublic

Description

The drop looks too good to be true. Did we break the pipeline or change how we log errors?
https://grafana.wikimedia.org/d/000000566/overview?orgId=1&viewPanel=16

Screen Shot 2022-10-03 at 6.04.49 PM.png (705×1 px, 166 KB)

Marking as UBN until we can confirm what happened here.

Last 24hrs only 557 errors were logged.
Last week before the deploy we were averaging 50,000 errors a day.

Event Timeline

@Tgr @phuedx pinging you given the activity in WikimediaEvents last week (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/700242)
I am marking this as a train blocker for now, as it could potentially mean errors leak into production that we don't know about. Feel free to adapt once you've assessed! Thanks in advance!

https://logstash.wikimedia.org/goto/42fcb0aed180c0a197cecdd26bc9d8ae{F35545830}

Screenshot Capture - 2022-10-04 - 04-55-07.png (143×787 px, 26 KB)

Failed loading schema at /mediawiki/client/error/2.0.0 with ENOENT (meaning the schema file was not found).

The new schema was defined in c829295 and enabled in 829299. The only change the the schema name / path was a version bump, not sure what to look for there.

(On an aside, the schema version is hardcoded in puppet here, maybe that should be updated. But that's for the beta cluster.)

I think that the issue is related to how eventgate-logging-external reads the schemas, namely from local disk only:

https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate#eventgate-logging-external

Meanwhile I can see it updated in the schema registry: https://schema.wikimedia.org/#!//primary/jsonschema/mediawiki/client/error

Verified on a pod that the new schema is not there:

root@deploy1002:~# kubectl exec eventgate-logging-external-production-5f797889fb-9j5wl -n eventgate-logging-external -- ls /srv/service/schemas/event/primary/jsonschema/mediawiki/client/error
1.0.0
1.0.0.json
1.0.0.yaml
1.1.0
1.1.0.json
1.1.0.yaml
CHANGELOG.md
current.yaml
latest
latest.json
latest.yaml

In theory with a roll restart of the kubernetes pods we should resolve the problem, I'll ask an advice to Service Ops before proceeding.

Tried a roll restart of codfw pods, not successful.

It seems that the Docker image needs to be rebuilt with a new version of the schema registry, see this prev commit: https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/625966/1/.pipeline/blubber.yaml

Change 838067 had a related patch set uploaded (by Elukey; author: Elukey):

[eventgate-wikimedia@master] blubber: update primary schema registry's git sha

https://gerrit.wikimedia.org/r/838067

Change 838067 merged by Btullis:

[eventgate-wikimedia@master] blubber: update primary schema registry's git sha

https://gerrit.wikimedia.org/r/838067

Change 838107 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Bump version of eventgate image that is in use

https://gerrit.wikimedia.org/r/838107

Change 838107 merged by jenkins-bot:

[operations/deployment-charts@master] Bump version of eventgate image that is in use

https://gerrit.wikimedia.org/r/838107

I have merged @elukey's change to eventgate-wikimedia (https://gerrit.wikimedia.org/r/c/eventgate-wikimedia/+/838067) which built a new production eventgate image.

Next I have merged a change to deployment-charts (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/838107) so that this new image will be deployed.

I will proceed to restart the eventgate-logging-external service.

I have deployed eventgate-logging-external and the new schema appears to be in place.

image.png (347×1 px, 54 KB)

I can see an uptick here: https://grafana.wikimedia.org/d/000000566/overview?orgId=1&viewPanel=16&from=now-1h&to=now&refresh=30s but it is gradual.
Is anyone else able to confirm whether or not this fix has been successful?

Both logstash and grafana look really good, I think that the task can be closed!

Is anyone else able to confirm whether or not this fix has been successful?

I added a line like

setTimeout( function () { throw new Error( 'T319261' ); }, 5000 );

to my common.js, refreshed the page, and saw a POST request to https://intake-logging.wikimedia.org/v1/events?hasty=true that with HTTP 202 so 👍.

elukey lowered the priority of this task from Unbreak Now! to Medium.Oct 4 2022, 11:39 AM

Thank you all! Yes, only the 'analytics' eventgate clusters use dynamic schema lookup. All others rely on schemas baked into their local docker image, so they don't have to be runtime coupled to a remote schema service.

brennen claimed this task.

Resolving per discussion here and normal-looking graphs.

Thanks for the quick response here!