There is a lot of overlap in the use cases that the Data Platform Engineering and SRE Observability teams support, especially when it comes to metrics and logging.
When emitting metric data, WMF engineers have to choose either platform based on what they need to do with the data. This means that some data is available in one, some in the other, and rarely is data in both.
How might we rectify this?
This is a parent task to explore the issue and potential solutions together.
== Data Platform | Observability
WMF **[[ https://wikitech.wikimedia.org/wiki/Data_Platform | Data Platform ]]** is a collection of systems and services that enable data producers and consumers to discover, use, and collect data to derive insights, conduct research, and build new data products. The Data Platform is primarily maintained by the [[ https://www.mediawiki.org/wiki/Data_Platform_Engineering | Data Platform Engineering team ]].
Data Platform Engineering team's mission is to empower Wiki Communities and the Wikimedia Foundation to gain insights, conduct research, and build compelling user experiences, through access to privacy-aware data and data platform services.
**[[ https://wikitech.wikimedia.org/wiki/SRE/Observability/About#What_is_observability? | Observability ]]** is about being able to understand what's happening inside a system just by observing it from the outside, without needing to interfere with its operation. [...] Observability typically involves collecting **metrics**, **logs**, and traces from applications and infrastructure, and then using this data to monitor system health, troubleshoot issues, understand system behavior, and improve performance and reliability.
The [[ https://wikitech.wikimedia.org/wiki/SRE/Observability/About | SRE Observability team's ]] mission is to equip teams across SRE and Technology with the tools, platforms, and insights they need to understand how their systems and services are performing.
== Product Metrics | Operational Metrics
Product metrics and logging are about understanding the way users interact with and use product features.
Operational metrics and logging are about understanding how a system is operating 'under the hood'.
However, there is no fundamental difference in this kind of data. Collection of metrics and logs is 'temporal' or 'event like'. A metric or logging data point contains data about something happening at a specific time.
E.g.
- //user A clicked the blue button at 05:00//
- //wiki page B was edited at 06:00//
- //encountered an error in the JS client at 16:00//
- //the amount of used memory for the service was 500Mb at 13:00//
The difference between operational and product 'metrics' of this kind not about the data, but about the ways in which the data is queried.
Operational metrics:
- need to be queried close to real time
- are primarily used to ensure that software systems are operating well
- Alerting is a primary need
- historical data is rarely required
Product metrics:
- rarely need to be queried real time (usually hourly is more than timely enough)
- are used to make product decisions
- alerting is sometimes needed for data quality
- historical data is often required
== Capability gaps
Data Platform and Observability provide different capabilities to users. Often those users would like to be able to use capabilities from either stack with the same data. Notably:
Data Platform is missing:
- **public dashboarding** -- superset is internal only
Observability is missing:
- ability to **join datasets with SQL**
- **data pipelining** tools
== Ideas
Some ideas on how to address the capability gaps:
- Make Grafana able to query Data Platform systems.
-- But how...via Presto? Druid?. Data Lake stores PII, so this might be a no go without a 'public data lake'.
- Make (product related) operational metrics and logging ingestible into the Data Platform.
-- Emit OpenMetric/Prometheus compatible events and consume them into Prometheus (similar to how statsv works). (TODO: create task).
-- {T355837}