Page MenuHomePhabricator

Data Platform, SRE Observability, overlaps, use cases, and potential
Open, Needs TriagePublic

Description

There is a lot of overlap in the use cases that the Data Platform Engineering and SRE Observability teams support, especially when it comes to metrics and logging.

When emitting metric data, WMF engineers have to choose either platform based on what they need to do with the data. This means that some data is available in one, some in the other, and rarely is data in both.

How might we rectify this?

This is a parent task to explore the issue and potential solutions together.

Data Platform | Observability

WMF Data Platform is a collection of systems and services that enable data producers and consumers to discover, use, and collect data to derive insights, conduct research, and build new data products. The Data Platform is primarily maintained by the Data Platform Engineering team.

Data Platform Engineering team's mission is to empower Wiki Communities and the Wikimedia Foundation to gain insights, conduct research, and build compelling user experiences, through access to privacy-aware data and data platform services.

Observability is [...] the ability to collect data about programs' execution, modules' internal states and communication between components. Observability typically involves collecting metrics, logs, and traces from applications and infrastructure, and then using this data to monitor system health, troubleshoot issues, understand system behavior, and improve performance and reliability.

The SRE Observability team's mission is to equip teams across SRE and Technology with the tools, platforms, and insights they need to understand how their systems and services are performing.

Product Metrics | Operational Metrics

Product metrics and logging are about understanding the way users interact with and use product features.

Operational metrics and logging are about understanding how a system is operating 'under the hood'.

However, there is no fundamental difference in this kind of data. Collection of metrics and logs is 'temporal' or 'event like'. A metric or logging data point contains data about something happening at a specific time.

E.g.

  • user A clicked the blue button at 05:00
  • wiki page B was edited at 06:00
  • encountered an error in the JS client at 16:00
  • the amount of used memory for the service was 500Mb at 13:00

The difference between operational and product 'metrics' of this kind not about the data, but about the ways in which the data is queried.

Operational metrics:

  • need to be queried close to real time
  • are primarily used to ensure that software systems are operating well
  • Alerting is a primary need
  • historical data is rarely required

Product metrics:

  • rarely need to be queried real time (usually hourly is more than timely enough)
  • are used to make product decisions
  • alerting is sometimes needed for data quality
  • historical data is often required

Capability gaps

Data Platform and Observability provide different capabilities to users. Often those users would like to be able to use capabilities from either stack with the same data. Notably:

Data Platform is missing:

  • public dashboarding -- superset is internal only
  • alerting -- can only be done with manual emails

Observability is missing:

  • historical metric support -- can only produce metric for the current time
  • ability to join datasets with SQL
  • data pipelining tools

Ideas

Some ideas on how to address the capability gaps:

  • Make Grafana able to query Data Platform systems.
    • But how...via Presto? Druid?. Data Lake stores PII, so this might be a no go without a 'public data lake'.

Overlap examples

Event Timeline

CC @Kappakayala this is a parent task that we can use to collect various use cases for observability use cases that overlap with product metrics. Hoping this will come in handy during hypothesis drafting for next fiscal year.