= Background information
In our project we follow some guidelines for observability established by our SREs as described here:
- [[ https://wikitech.wikimedia.org/wiki/Observability | wiki/Observability ]]
- [[ https://wikitech.wikimedia.org/wiki/Observability/Dashboard_guidelines | wikitech/Observability/Dashboard_guidelines ]]
Here is how our template grafana dashboards look like:
- [[ https://grafana.wikimedia.org/d/stpmz_7Wz/template-dashboard?orgId=1&refresh=1m Grafana template dashboard ]]
Some example dashboards from our services:
- [[ https://grafana.wikimedia.org/d/NQO_pqvMk/push-notifications?orgId=1&refresh=1m | Example of nodejs service dashboard (our push notifications service) ]]
- [[ https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1 | Example of Go service (sessionstore) ]]
Most of the systems metrics are generated directly from our kubernetes setup for example
- Memory/CPU/Disk metrics
- Network metrics
- HTTP metrics
- Kubernetes metrics
Here are some ideas for tegola specifically:
- Go lang runtime
- Memory
- Garbage collection
- Timings
- Usage
- Prometheus go client exposes a variety of memory usage metrics by default
- `go_memstats_xxx`
- Goroutines
- Number of goroutines
- HTTP endpoints
- Timings for request/response cycle
- Per status code type/method
- Additional metadata related to tile/map/layer/zoom level if applicable?
- Request/response sizes
- Cache
- Number of hit/miss
- Get/put timings
- Seeding/purging stats
- Connection status (active/failed/reconnects) to cache backend
- Database connection
- Number of connections
- Query timings
===== Upstream issue
https://github.com/go-spatial/tegola/issues/714
===== Why track upstream?
One of the reasons to choose Tegola for our infrastructure is to rely on some upstream project that is well supported by the OSS maps community. With a few tweaks, like this task, Tegola will receive the needed changes to meet our infrastructure needs. We need to track this on Phabricator in order to be transparent about what is going on in the project, but all the work and discussion will be done on the Tegola GitHub page to have better communication with its community.
= Open questions
- Do we need fine-grained observability of the Swift cache (i.e. connection pool)
= Acceptance criteria
- [] Documentation is updated
- [] Grafana Dashboard is created to monitor the aforementioned metrics