Page MenuHomePhabricator

Correct logging of failure types on FileImporter Grafana board
Closed, ResolvedPublic8 Story Points

Description

Motivation
The FileImporter grafana dashboard currently only displays failures around user rights. However, it should also be reporting all other ones, such as AbuseFilter issues, issues around the config file, or already existent content. (For more info see T224230 and T189570)

Acceptance Criteria

  • Whenever an import is blocked (i.e. the user sees an unresolvable error), this error shows up in the error graph of the grafana board
  • The "Failure type" dropdown at the top of the grafana board is removed

Notes

  • It might make sense to first create a mediawiki page or so listing all of the errors in one place
  • If there are errors that are easier to implement the others, implement the easy ones and let's discuss how much we need the difficult ones.
  • We might be able to do this iteratively from more broadly to more in detail (e.g. AbuseFilter, ConfigFile, ExistingContent, Broken Content)
  • Document any new events in docs/metrics.md.

Event Timeline

Restricted Application added a project: TCB-Team. · View Herald TranscriptJun 12 2019, 10:39 AM
Lea_WMDE set the point value for this task to 8.
awight updated the task description. (Show Details)Jun 13 2019, 9:34 AM
awight updated the task description. (Show Details)Jun 13 2019, 3:07 PM
awight added a subscriber: awight.EditedJun 13 2019, 3:28 PM

I'm getting a little confused, so will leave some breadcrumbs about where various error stats come from in the current code:

  • MediaWiki.FileImporter.specialPage.execute.fail.* - Permission and user block errors when first opening the special page are tallied as , which should be used with care because it almost overlaps with the ...fail.plan.* below.
  • MediaWiki.FileImporter.specialPage.execute.fail.plan.total and detailed MediaWiki.FileImporter.specialPage.execute.fail.plan.byType.* tally errors when building the ImportPlan.
  • Mediawiki.FileImporter.import.result.exception is produced by Importer::import called when action=submit.

The grafana dashboard shows all these errors now, and sums correctly over the past 24hr. Next steps are to distinguish between recoverable and unrecoverable errors during the planning stage. Then we can provide more granularity where desired--probab best if @Lea_WMDE looks over the dashboard to help scope where we need this granularity.

Notes from discussion:

  • Definitely go ahead with creating a MediaWiki page to document what the buckets are and what specific errors are in each one.
  • Link to our documentation from the "i" information popup on the failure graph, and summarize to help users with interpretation.
  • Minimum granularity for this task is to split out unrecoverable from recoverable errors, and to expose exact counts for AbuseFilter matches.

Notes on next implementation steps:

  • Our exceptions should all respond to getCode, with a string constant naming the error type.
  • Take recoverable errors out of the planning phase error count, report it separately.
  • FileImporter.import.result.exception' is missing two errors, "bad edit token" and "bad import hash". Move import submit stats logging up to the exception handler to catch these.
  • Drop the "plan" vs. "submit" distinction, this doesn't seem to matter.
  • Report time taken to fail?

Report time taken to fail?

I don't think we need that yet. Right now we are interested in step 1: Knowing how often which errors appear.

Change 517081 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/FileImporter@master] Give all import exceptions a code

https://gerrit.wikimedia.org/r/517081

Change 517061 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/FileImporter@master] Standardize error buckets

https://gerrit.wikimedia.org/r/517061

Change 517081 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Give all import exceptions a code

https://gerrit.wikimedia.org/r/517081

Confirmed that AbuseFilter errors are reported with error codes like abusefilter-disallowed, so they'll be easy to visually group in Grafana.

Reminder that grouping automatically is not trivial until we have tagged metrics.

Split out obsessive documentation into T225951, a non-blocking subtask.

This task will need a short iteration after merge and beta deployment, to update and test Grafana panels for the new metrics.

Change 517061 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Standardize error buckets

https://gerrit.wikimedia.org/r/517061

Now we're testing with beta cluster Grafana, then updating the production. Oh and ¡surprise!, these changes were merged just in time to get dropped into the MediaWiki train tonight (testwiki) and tomorrow. We'll need to monitor that on Wednesday.

Trouble testing on beta: I'm unable to see any FileImporter metrics whatsoever on https://grafana-labs-admin.wikimedia.org/dashboard/db/awight-beta-fileimporter?orgId=1

I ran into this when building the production graphs, T226100. I'm not ready to trust the data due to this glitch--either something I'm doing wrong, or Graphite bugging out.

awight updated the task description. (Show Details)Jun 19 2019, 12:38 PM
awight claimed this task.Jun 19 2019, 1:05 PM

Worked around the glitch, I think the graphs are correct now. After demo and approval, we can remove the old graphs.

Change 517877 had a related patch set uploaded (by Awight; owner: Awight):
[mediawiki/extensions/FileImporter@master] Document new error metrics

https://gerrit.wikimedia.org/r/517877

Change 517877 merged by jenkins-bot:
[mediawiki/extensions/FileImporter@master] Document new error metrics

https://gerrit.wikimedia.org/r/517877

The dashboard is looking great now!
I just have one thing:

  • The graph contains both recoverable and unrecoverable errors. Let's focus on unrecoverable errors for now (but if it is a 2 minute thing to do, I would be interested in a seperate graph for recoverable errors, too)

The dashboard is looking great now!
I just have one thing:

  • The graph contains both recoverable and unrecoverable errors. Let's focus on unrecoverable errors for now (but if it is a 2 minute thing to do, I would be interested in a seperate graph for recoverable errors, too)

Sure, that's easy. I have a crude workaround the lack of tagged metrics (keyword bingo!), so we can filter on byRecoverable as part of the metric path.

Ready for re-review.

Lea_WMDE closed this task as Resolved.Jun 25 2019, 5:37 PM
Lea_WMDE moved this task from Demo to Done on the WMDE-QWERTY-Sprint-2019-06-12 board.