Page MenuHomePhabricator

Review Wikidata error dashboard. Create/adjust phabricator ticket priorities
Closed, ResolvedPublic

Description

Following up on T255585 @Jdlrobson has setup a dashboard for tracking client errors at: https://logstash.wikimedia.org/app/kibana#/dashboard/AXTmr5djLNRtRo5XelM_ (link updated to point to the currently saved version that includes the new "normalized messages per numbers of IPs" metric requested in T265131)

We want to look at the errors, and provide some overview / breakdown of the current situation and errors that occur.
We intend to define the process/ways of addressing different kind of errors. This will follow once the overview of the error landscape is created.

We might want to write phabricator tickets for these errors and put them in the general Wikidata backlog.

Some of the tickets that we create may be for code that we do not own or control (such as gadgets).
We still want to create these tickets so that the community has an opportunity to fix the issues.

Information to include in the overview (potentially incomplete list)

  • source of the error (at least classified as: Wikibase code, non-Wikibase Mediawiki, gadget; preferably link to the source line causing the error or gadget in question)
  • frequency of occurrence (e.g. with the weekly window)
  • link to phabricator task tracking the error cluster

ACs:

  • Wikidata.org Errors have been reviewed
  • We know the source of the errors
  • We have phabricator ticket for the errors that we want to fix

Original write up

Following up on T255585 I've setup a dashboard for tracking client errors at:
https://logstash.wikimedia.org/goto/0e48f48aaeb915d759f53eedc4230000
This filters various known gadget problems to allow you to focus on the errors that matter.

Right now wikidata seems to cause more errors than any other project that has error tracking enabled - about 8,988 errors in the last 24 hrs -we should strive to get these down to 5000 a day to be at similar levels to the other projects.

@Esanders has created a useful tool for shortening stack traces that may be useful:
https://edg2s.github.io/short-trace/

From experience, anything with a count of over 200 in 12hrs is probably something worth fixing. Troublesome IP addresses usually indicate a faulty user script that is also worth fixing to cut down log noise.

The biggest priority should be diagnosing and fixing the error TypeError: context is undefined

Let me know if I can help with anything relating to triage or the dashboard itself! Have fun!

Results

The following table should cover over 70% of the errors that were logged in the last 24 hours:

Error messageSource or steps to reproducefrequency per 24hphabricator task
TypeError: context is undefined and Uncaught TypeError: Cannot read property 'config' of undefinedType something in the small search box, error appears after pressing enter, works while logged out (->not gadget?)2300T105637 (fix should be deployed with the next train)
Error: invalid entity serializationSource: Termbox Interaction with mobile Lexeme; Steps: Go to https://m.wikidata.org/wiki/Lexeme:L136571160 (These are probably just the Firefox errors)T264893
TypeError: $.widget is not a functionprobably that user script: User:Ch1902/ancestry.js (see logstash)550T265015
NS_ERROR_FILE_CORRUPTED: unclear, all these errors come from only 2 IP6 adresses (see logstash). Probably one of those Gadgets: AuthorityControl, Descriptions, DragNDrop, DuplicateReferences, EasyQuery, Merge, NewSection,Preview, PrimarySources, ProtectionIndicators, RequestDeletion, Search, SiteIdToInterwiki, autoEdit, currentDate, formWizard, imagelinks, labelLister, linkscount, relateditems400T265022
Uncaught ReferenceError: wgCanonicalSpecialPageName is not definedSeems to be caused by the undefined variable wgCanonicalSpecialPageName in https://meta.wikimedia.org/wiki/User:HakanIST/global.js and maybe other copies of the confirmWatchlistRollback script Fixed by Esanders280None
Uncaught TypeError: $(...).css(...).draggable is not a functionseems to be caused by https://www.wikidata.org/w/index.php?title=User:Magnus_Manske/mixnmatch_gadget.js230T265035
ReferenceError: wikibase is not definedlikely caused by https://www.wikidata.org/w/index.php?title=User:Mat%C4%9Bj_Such%C3%A1nek/moveClaim.js125T265037
SyntaxError: expected expression, got '<'unclear what is going on. The file_url is given as https://www.wikidata.org/w/index.php?title=MediaWiki:Gadget-DraggableSitelinks&action=raw&ctype=text/javascript --- so maybe someone tries to include an actual HTML page? Or is that Gadget page broken? (see logstash)162None
Uncaught TypeError: $(...).dialog is not a functionError seems to stem from the following user script: https://www.wikidata.org/w/index.php?title=User:Bargioni/viaf.js ( see logstash )164T265053
TypeError: OO.ui is undefinedprecise source unclear, see the comment below. Seems to affecting <10 users, all Firefox440TBC

Event Timeline

It would probably be useful to summarize these on a wiki page, similar to User:Jdlrobson/User scripts with client errors, so volunteers without Logstash access can also look into them. I expect many of these errors are in gadgets and user scripts (though some might be in Wikibase code, too).

Addshore renamed this task from Review Wikidata error dashboard, fix gadgets and create/adjust phabricator ticket priorities to Review Wikidata error dashboard. Create/adjust phabricator ticket priorities.Oct 6 2020, 10:22 AM
Addshore updated the task description. (Show Details)
Addshore added a subscriber: Esanders.

Hint from Lydia: Project to tag possible gadget related errors with https://phabricator.wikimedia.org/project/view/1278/

list of errors has been moved to main ticket description for better visibility

TypeError: context is undefined is T105637, which I've also fixed.

TypeError: OO.ui is undefined is often being thrown by https://www.wikidata.org/w/index.php?title=MediaWiki:Gadget-Merge.js, which is strange because https://www.wikidata.org/wiki/MediaWiki:Gadgets-definition correctly adds a dependency on oojs-ui-windows. I searched for users who were pulling in this script directly and also added dependencies there, so I don't know why this is still happening.

TypeError: context is undefined is T105637, which I've also fixed.

Cool, thank you!

@Esanders Regarding the errors with NS_ERROR_FILE_CORRUPTED: : It seems they stem from only two IP addresses, probably some Gadget. Would it be possible to amend the script that is sending these errors to logstash to include the user if available? Shouldn't be much of a privacy issue since we are already collecting all the IP adresses and it would allow us to talk to them directly and ask what they are doing.

TypeError: OO.ui is undefined is often being thrown by https://www.wikidata.org/w/index.php?title=MediaWiki:Gadget-Merge.js, which is strange because https://www.wikidata.org/wiki/MediaWiki:Gadgets-definition correctly adds a dependency on oojs-ui-windows. I searched for users who were pulling in this script directly and also added dependencies there, so I don't know why this is still happening.

It hasn't happened for some hours now. Maybe you got all of them?

Otherwise, one could probably build a check for that module being loaded into the script itself.

It hasn't happened for some hours now. Maybe you got all of them?

Otherwise, one could probably build a check for that module being loaded into the script itself.

Still happening

You could add the loader to the script, but that just papers over the issue, which is that the gadget isn't fetching dependencies properly. CC @Krinkle @Catrope

I created a task to get maybe a more useful ordering of the normalized messages than just number of occurrences: T265131: Kibana: Sort normalized messages by how many users they affect

WMDE-leszek added a subscriber: Michael.

Thanks @Michael for the initial breakdown. I believe this should be repeated once or few times after the "big" issues are resolved. I have a feeling not all common errors have been unearthed yet.

@WMDE-leszek Should this stay on the campsite? or rather be resolved and brought up again if we want to take another look?
As I see it the ACs have all been met.

Closing as the initial breakdown happened and there seemed to be little motivation to do another iteration. The topic should resurface at some point in order to keep the dashboard meaningful.