While chatting with the team it became clear that there is not a lot of knowledge about how to investigate an issue if ORES starts to emit a ton of errors, for example score_errored ones.
We should:
- Review our dashboards and see what's to fix (some graphs are not displayed etc..)
- Figure out how to debug an issue. For example, where to look when somebody wants to investigate why a spike of score_errored happens.
- Add basic alarms (SRE moved to AlertManager and we have still to add alarms to it, we never done it, the old alarms are gone).