Page MenuHomePhabricator

IPoid: Define service level indicators and service level objectives
Closed, ResolvedPublic

Description

Per https://wikitech.wikimedia.org/wiki/SLO, we will need to create some SLIs/SLOs in collaboration with SRE, and link to them from https://wikitech.wikimedia.org/wiki/Service/IPoid.

See https://wikitech.wikimedia.org/wiki/SLO#Published_SLOs for some examples.

Event Timeline

I would suggest we delay this exercise until iPoid-Service (IPoid OpenSearch) work is underway, as that would probably end up with a different SLI/SLO.

kostajh renamed this task from Define service level indicators and service level objectives to IPoid: Define service level indicators and service level objectives.Feb 2 2026, 12:11 PM
MLechvien-WMF subscribed.

Tagging Data-Platform-SRE to assess when this can be completed

Gehel subscribed.

This requires SLI/SLO to be defined by DPE SRE for the OpenSearch on k8s clusters in general first. Then we will be able to define additional application specific SLI/SLO.

I have drafted some SLI definitions here: https://wikitech.wikimedia.org/wiki/SLO/OpenSearch_IPoid#Service_Level_Indicators_(SLIs)

I'll also include them in this ticket, so we can discuss them off-wiki and amend them on wikitech, should that be required.
The SLIs that I'm suggesting are as follows:

  • Availability: The percentage of search queries that return a successful HTTP status code (2xx), excluding 4xx client errors.
  • Latency: The percentage of requests served in under 100 milliseconds.
  • Freshness: The Airflow task that downloads and indexes the spur data task has been successfully executed in both data centres within the last 24 hours.

The promql queries to make these measurements are:

  • Availability:
sum by (destination_service_namespace) (
  rate(istio_requests_total{
    source_workload_namespace="istio-system",
    app="istio-ingressgateway",
    destination_service_namespace="opensearch-ipoid",
    response_code=~"2.."
  }[5m])
)
/
sum by (destination_service_namespace) (
  rate(istio_requests_total{
    source_workload_namespace="istio-system",
    app="istio-ingressgateway",
    destination_service_namespace="opensearch-ipoid",
    response_code!~"4.."
  }[5m])
)
* 100
  • Latency:
sum(rate(istio_request_duration_milliseconds_bucket{
  source_workload_namespace="istio-system",
  app="istio-ingressgateway",
  destination_service_namespace="opensearch-ipoid",
  le="100"
}[5m]))
/
sum(rate(istio_request_duration_milliseconds_count{
  source_workload_namespace="istio-system",
  app="istio-ingressgateway",
  destination_service_namespace="opensearch-ipoid"
}[5m]))
  • Freshness:
(
  sum(increase(airflow_ti_finish{
    dag_id="spur_download_and_index_anonymous_residential_eqiad",
    task_id="download_and_index_feed_eqiad",
    state="success"
  }[24h])) > bool 0
)
+
(
  sum(increase(airflow_ti_finish{
    dag_id="spur_download_and_index_anonymous_residential_codfw",
    task_id="download_and_index_feed_codfw",
    state="success"
  }[24h])) > bool 0
)
== bool 2

It's worth noting that when I started working on the latency indicator, we can clearly see regular periods where the percentage of requests served in under 100 milliseconds drops to about 80%.
https://grafana.wikimedia.org/goto/ffjdreu1e4q9sd?orgId=1

image.png (2,380×1,182 px, 376 KB)

I believe that this probably coincides with when the indexing job happens, but we can work on correlating this and tuning the system to mitigate it.

There are also some draft SLOs here: https://wikitech.wikimedia.org/wiki/SLO/OpenSearch_IPoid#Service_Level_Objectives

If you're happy with these, then I can start to work on the SLO dashboard and the burn-down alerts.

It's worth noting that when I started working on the latency indicator, we can clearly see regular periods where the percentage of requests served in under 100 milliseconds drops to about 80%.
https://grafana.wikimedia.org/goto/ffjdreu1e4q9sd?orgId=1

image.png (2,380×1,182 px, 376 KB)

I believe that this probably coincides with when the indexing job happens, but we can work on correlating this and tuning the system to mitigate it.

I believe that I know the cause of this. We're not distinguishing the two types of elasticsearch traffic from each other at the ingressgateway:

  • client requests from mediawiki
  • bulk_index requests from the airflow jobs

Naturally, the bulk_index API calls are going to take longer than the client requests, so that's the reason for the skewing down of the ratio of fast requests during the indexation periods.

I was wondering how we might go about filtering this, so that we can get just the latency for the client request traffic. We don't have the request_path in the istio_requests_total metric, so some potentially useful existing labels include:

  • destination_canonical_service
  • destination_app
  • destination_service
  • destination_workload

Perhaps I can modify the opensearch-cluster chart to populate these labels differently, depending on whether or not it is using the .*/_bulk.* request path.

I'll look into it.

I believe that these SLIs and SLOs are now defined at: https://wikitech.wikimedia.org/wiki/SLO/OpenSearch_IPoid

The bulk index requests are now separate from the other types of request, so that they no longer skew the latency SLI toward slower responses.

I have requested a review from the SRE-SLO working group, but I think that we might be able to call this part done.

We also have some general documentation of availability expectation for the OpenSearch on k8s in general: https://wikitech.wikimedia.org/wiki/Data_Platform/Systems/OpenSearch-on-K8s

Thanks all for the work on this! @kostajh as you were originally assigned here, sounds like we can close this task