Page MenuHomePhabricator

Features to be put into dbbackups monitoring dashboards
Open, Needs TriagePublic

Description

This is a ticket to discuss and add what features should be there in the dashboard

Currently, we have the following

As a non-technical person, I want to see if something is wrong with the backups, at a glance

Dedicate one panel to this, eg below

image.png (370×1 px, 46 KB)

As a member of DBOps, I want to know where exactly the issue is

Dedicate one panel, or a page for this, eg below (TODO -- make it report by backup type, rather than by datacenter)

image.png (608×1 px, 52 KB)

As a member of DBOps, I want granular data on what backup is failing [whether snapshot or dump -- we need to refine]

An example is below (this would be shown per backup type -- snapshot or dump).
As discussed, we should be able to see

  • what type of backup is failing,
  • whether it is the size difference,
  • or whether it is caused by unfresh backups,
  • and whether this is caused by persistent backup failures
  • what datacenter and server took the most recent successful backup (where is this stored? What's the definition of a successful backup?)

image.png (788×1 px, 75 KB)

This ticket is to discuss what features you would like to see as a part of the final evaluation for GSoC, and what would be deemed essential

Event Timeline

@Marostegui @jcrespo feel free to add your thoughts and suggestions and what you would like to see for the final evaluations :-)

I think you captured rightly what we meant during the meeting.
It would be 3 layers as you describe there, a general overview of the status (good/bad) then a bit more granular (per section good/bad) and finally if you want to dig into an specific section, the last screenshot captures it: more detail and an historic list of events that have happened for that particular section (and even per type of backup: logical/snapshot)

I wonder if it would help, for navigational organization, to have a plan of ALL potential features we would like to have in the long term but impossible to have within 3 months (so we can say what is important and what we discard), or that would be more confusing for @h.krishna?

Thanks @Marostegui for your input and feedback.
I think that sound like a good plan @jcrespo -- some of the features might need to go into a stretch goal. First I need to analyse and see how hard it is to implement features (so that I get an idea of whether we have enough time to implement them by end of GSoC).
Let me have a look at which features are a definite yes. I think the only feature that needs to be refined further was the mention of "event" logs (eg. S3 entered RED state due to X, due to backup freshness which was caused by failing backups")
With regards to measuring size issues, there was a mention of how we can use past data to come up with an average change, obviously this sounds like it takes a lot of effort and might have to put this as a stretch goal.

I need to do a quick analysis/mockup of some of these features and see how feasible it is. I will post the functional requirements here later today to narrow requirements down into "Doable within GSoC", "Doable as a stretch goal", and "Doable in the future". That way we can have a good idea as well.