Page MenuHomePhabricator

Broken graphs on wikis accounting for 45% of our client side errors
Closed, ResolvedPublic

Event Timeline

Jdlrobson renamed this task from Broken graphs on wikis. to Broken graphs on wikis accounting for 20% of our client side errors.Feb 11 2021, 3:58 PM
Jdlrobson triaged this task as Unbreak Now! priority.
Jdlrobson updated the task description. (Show Details)

This issue appears to be getting worse as broken grphs are copied and pasted to other Covid articles.

😱 It's 45% of our errors now, which is making it really hard to see other errors in our software.

Milimetric renamed this task from Broken graphs on wikis accounting for 20% of our client side errors to Broken graphs on wikis accounting for 45% of our client side errors.Feb 19 2021, 4:38 PM

Don't know what exactly I did, but I corrected the graph here https://en.wikipedia.org/wiki/Rape_statistics#Norway and the graph looks working. Did this correction lowered the error rate, even slightly ?

(what I did was the removal of |showSymbols=)

This issue appears to be getting worse as broken grphs are copied and pasted to other Covid articles.

You could say the errors are going... viral.
(Sorry, just had to.)

Anything I can help with here? I think given the level of support Graphs currently experiences, the best outcome here would likely be to wrap the rendering of a Vega graph in a try / catch block for now. The errors will still existing, but at least won't clog our error pipeline.

Right now this is blocking us from expanding our pipeline to load errors from Chrome browser that occur on start up (T266517)

Seems like a number of these still happening on mobile pages from ext.graph.vega module it's definitely looks like the majority of messages on the mw-client logstash dashboard are coming from graphs-related problems: https://logstash.wikimedia.org/goto/2c3fb224d2e2fb7759f8327fbef7f307

All these graphs should fix, and/or we should consider try / catch block for Vega errors. […]

I guess some questions to consider answering might be:

  • Is the root cause user-error in the form of a graph spec that isn't meant to be renderable? Or did this kind of graph work before and something on our end broke it?
  • Why did it start now? Is it because we've stopped rendering them server-side? Did these work previously on the server-side? Did the errors merely go to a different log then? If so, perhaps the client-side version of Graph is doing something wrong and thus not able to render the graphs that worked fine before. E.g maybe a slightly different Vega version or configuration values.
  • More generally, what do we expect to happen when a user creates an unrenderable graph spec (regardless of whether that's the case in these articles). Does it always throw uncaught exceptions, or does that only happen when it's an error we don't anticipate? In other words, if we try-catch these, does that mean we're merely doing the right thing and not recording issues with user-input on the server (fine), or would it mean we're temporarily ignoring all Graph-related issues including those caused by our own fault? Ideally we'd be able to distinguish between invalid input, and a truly unexpected error. (\cc @Milimetric might know.)

Why did it start now? Is it because we've stopped rendering them server-side?

Yes. Previously these errors only occurred when the graph was clicked and rendered. These errors are happening with much greater frequency now as a direct consequence of rendering graphs on page load.

Why did it start now? Is it because we've stopped rendering them server-side?

Yes. Previously these errors only occurred when the graph was clicked and rendered. These errors are happening with much greater frequency now as a direct consequence of rendering graphs on page load.

I didn't mean it in that way. I understand that client-side we didn't see it before. What I mean is, did the graphs themselves render correctly when we rendered them server-side previously? Or did they produce holes in the page back then as well, and produces a similar error in the Node.js service before?

Change 667028 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Graph@master] Do not log graph errors to WMF servers

https://gerrit.wikimedia.org/r/667028

Change 667028 merged by jenkins-bot:
[mediawiki/extensions/Graph@master] Do not log graph errors to WMF servers

https://gerrit.wikimedia.org/r/667028

Change 666999 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Graph@wmf/1.36.0-wmf.32] Do not log graph errors to WMF servers

https://gerrit.wikimedia.org/r/666999

Change 666999 merged by jenkins-bot:
[mediawiki/extensions/Graph@wmf/1.36.0-wmf.32] Do not log graph errors to WMF servers

https://gerrit.wikimedia.org/r/666999

Mentioned in SAL (#wikimedia-operations) [2021-02-26T00:14:30Z] <urbanecm@deploy1001> Synchronized php-1.36.0-wmf.32/extensions/Graph/: 9d5cf348f5dda32f8889d5160bb1fe34a4e07f8c: Do not log graph errors to WMF servers (T274557) (duration: 01m 36s)

Jdlrobson claimed this task.

Since this UBN was open for over a week, I decided to roll the above patch to disable logging errors for user-generated errors in graphs. These graphs are likely rendering broken but that's not WMF's job to support IMO. The error is now caught and logged to the error console so editors can still fix these if needed. A user notice has been created to inform editors (T275833).

In future, if we want to restore the error logging here, we can revert this change and begin logging these again. I have not seen an error from the graph extension in the last 20 minutes so am resolving this ticket.

To @Krinkle's point from above, there's not much inside of Jon's try block that could be our fault, because it's 99.99% third-party logic running in there, save for what version of vega we call and so on. If we do something bad, I think graphs would all break and we'd know about it. And the future of the graph extension is very uncertain right now anyway, these regressions will hopefully drive some product thinking.