Page MenuHomePhabricator

Interface admins, engineers and deployers must check JavaScript error metrics when deploying code
Open, Needs TriagePublic

Description

Who?
This concerns anyone deploying code, enabling a banner, or making an edit to site JS (e.g. MediaWiki:Common.js or site-wide gadgets).

What?

When making changes please make sure you check the client error public graph to confirm that the changes haven't introduced errors to our ecosystem.

An alert has also been set up which will fire if more than 3k error events occur within a 5 minute period to the wikimedia-operations channel and email the readers web team. Notifications can be sent to other alerting groups if defined (please ask here).

cc @thcipriani @dduvall @Pcoombe @Esanders @Tgr @Urbanecm @kostajh @santhosh @egardner @Ladsgroup - please communicate to your peers so they know to check during deploys of banners/code. I also suggest setting up a highlight word in IRC for *wikimedia-client-errors-alerts* so that the log spam.

More background can be found on T264665. Thank you @colewhite for your help here!

Event Timeline

Tgr added a subscriber: Urbanecm.

wikitech:Backport windows/Deployers#SSH Connections and Error Logs already lists the logstash dasboard, which I assume is superior. The IRC highlight would be a useful addition - can it be named something clearer and more generic (like client-error-alerts)? Right now it sounds as if it's intended for one specific team.

Centralnotice admins and interface editors don't have access to Logstash sadly, and the grafana board does not seem all that useful to them - you can't filter by wiki, so errors caused on smallers sites will hardly register. In any case, I don't think a Grafana board is really digestible to centralnotice admins without further documentation. Also Tech News isn't really sufficient for setting guidelines - is there an outreach plan beyond that?

Centralnotice admins and interface editors don't have access to Logstash sadly, and the grafana board does not seem all that useful to them - you can't filter by wiki, so errors caused on smallers sites will hardly register.

True. I think the main people I'm aiming this at are on the larger wikis. Mostly recently Portuguese and Chinese Wikipedia introduced errors in site-wide javascript and those would have displayed a noticeable spike on the graph. Any kind of awareness about this would be helpful. In an ideal world all interface admins would have access to logstash to investigate further. Any help in moving in that direction would be appreciated.

In any case, I don't think a Grafana board is really digestible to centralnotice admins without further documentation.

Im curious what documentation would help here. A spike is all that they'd need to look for. In case of spikes I'm hoping a quick query with wikimedia staff in IRC would suffice.

Also Tech News isn't really sufficient for setting guidelines - is there an outreach plan beyond that?

I wasn't intending to set guidelines, but more recommendations through awareness. Certain admins have been wonderful during the last few months in responding to errors. Having a public graph is at least a starting point to a conversation.

I'm currently writing a post for the blog diff with further recommendations. One of which is Wikimedia playing a more active role in supporting better informed gadgets.

@Jdlrobson I think maybe waiting for that blog post would be a good idea, so it could go with the Tech News item?

I don't disagree with the general idea but I'm slightly unsure about having an alert on user-based client-side errors. I've done around 50K edits to help with legacy deprecation and other work so far as a global interface editor and I found it quite a mess that needs an underlying solution.

First of all, it's not a small codebase, it's definitely millions and millions of lines of code. Mostly copied from an old version of someone else's code. For example I replaced more then 2k copies of the same exact function with the same exact spaces. Some are beyond rotten. Some are minified (how am I supposed to fix a minified code?). Some of my edits got reverted because the user was like "this my personal space, you're not allowed" (your gadget will break soon though *shrugs*) and all sorts of complexities making it really hard to actually account for all edge cases.

Of course, I should not break a whole wiki or damage common.js but beyond that, it's responsibility of the maintainer to fix the code that's deprecated years ago and if errors go up when we pull the plug, so be it.

I don't disagree with the general idea but I'm slightly unsure about having an alert on user-based client-side errors. I've done around 50K edits to help with legacy deprecation and other work so far as a global interface editor and I found it quite a mess that needs an underlying solution.

Yeh it's a big mess. The root problem is that user and site scripts run in global JavaScript and cannot be filtered out programmatically. While that's the case alerts can and will be triggered by site JS and there's little we can do about that if we want to keep logging errors except dealing with this when it happens.

This is a big problem and we need a long-term strategy for it and I'm now switching gears to communicating that problem more broadly. One long term strategy I'm exploring is to limit errors in a given user session (leaning on mw.storage.session) and a trip-switch preference that disables error logging for buggy clients, but that needs a lot more thought. I'm writing down my thoughts on this so far for diff.wikimedia.org with some straw man proposals and I'll share it as soon...

That is good but I think it needs a bigger solution. Like global gadgets, or git-based gadgets so people stop copying javascript across different wikis. It would reduce the security vectors and improve the user experience for editors in general.

@Jdlrobson I think maybe waiting for that blog post would be a good idea, so it could go with the Tech News item?

I've published the post now to converse about this some more :) https://diff.wikimedia.org/2021/03/08/sailing-steady%e2%80%8a-%e2%80%8ahow-you-can-help-keep-wikimedia-sites-error-free/

That is good but I think it needs a bigger solution. Like global gadgets, or git-based gadgets so people stop copying javascript across different wikis. It would reduce the security vectors and improve the user experience for editors in general.

Please can you carry this conversation over to T262493 so it's all in one place ?