Improve database application performance monitoring visibility
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Oct 9 2017, 3:35 PM

Description

I got sometimes from platform and performance members (specially @Legoktm and @aaron) that they need to know when servers are lagging as in some cases it is not an infrastructure issue, but a new code causing too much load, concurrency issues or other logical-level reason. I suppose it is specially interesting for them to make multi-datacenter possible.

This ticket, with yet not clear actionables, is a way to start a conversation and coordinate with them to produce interesting alerts/metrics for them; however, we cannot send a page to all ops every time the application layer has a temporary hiccup. We can use existing technology (like Prometheus), introduce new ones, or change our model of monitoring/metrics/alerting/observability. Specially important is to separate "things are down", from "things are up but degraded" to "things are down".

Too many metrics/alerts will make people ignore them rather than focus on the important stuff. Also we have to make clear false positives (for example, when server goes down it is normal to create lag, but it is also depooled). This has been partially done with the mediwiki-based monitoring on graphite.

Related Objects
Search...

Status	Assigned	Task
Open	None	T143896 MySQL metrics monitoring
Resolved	None	T172492 Database alerting
Resolved	aaron	T177778 Improve database application performance monitoring visibility

Event Timeline

jcrespo created this task.Oct 9 2017, 3:35 PM

jcrespo edited parent tasks, added: T143896: MySQL metrics monitoring; removed: T172492: Database alerting.Oct 9 2017, 3:43 PM

jcrespo added a parent task: T172492: Database alerting.

jcrespo mentioned this in T177782: Reduce false positives on database pages.Oct 9 2017, 4:19 PM

• Marostegui moved this task from Triage to Meta/Epic on the DBA board.Oct 10 2017, 6:36 AM

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Oct 11 2017, 5:26 PM

• Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

Aklapper renamed this task from Improve database aplication performance monitoring visibility to Improve database application performance monitoring visibility.Oct 12 2017, 10:26 AM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Nov 6 2017, 8:55 PM

CCicalese_WMF moved this task from Inbox to Watching on the MediaWiki-Platform-Team-Archived board.Jan 2 2018, 6:19 PM

CCicalese_WMF edited projects, added Core-Platform-Team-Old; removed MediaWiki-Platform-Team-Archived.Jul 12 2018, 12:19 AM

CCicalese_WMF moved this task from Inbox to Watching on the Core-Platform-Team-Old board.

CCicalese_WMF edited projects, added Platform Team Legacy (Watching / External); removed Core-Platform-Team-Old.Oct 1 2018, 4:44 PM

Some actionables were done- graphite now has lag visible on grafana. And of course, we have MySQL prometheus dashboards. Logstash monitoring also improved, with increased visibility.

I think we can resolve this and continue working on database monitoring metrics T143896, unless someone else wants to bring up more concrete actionable.

Improve database application performance monitoring visibilityClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Improve database application performance monitoring visibility
Closed, ResolvedPublic
Actions

Related Objects
Search...