Page MenuHomePhabricator

Investigate elevated P99/connection errors for opensearch-ipoid beginning 17 March 2026 at ~1600 UTC
Open, In Progress, MediumPublic

Description

Per this Slack thread with @kostajh , we are seeing increased P99 and timeouts from Mediawiki (hosted in wikikube) to the opensearch-ipoid OpenSearch cluster hosted in dse-k8s-eqiad) starting at 17 March 2026 at ~1600 UTC and continuing as I write this (19 March 2026 at 1900 UTC).

Ref:

Creating this ticket to:
[x Understand impact

  • Investigate
  • Decide next steps

Event Timeline

bking updated the task description. (Show Details)
bking changed the task status from Open to In Progress.Mar 19 2026, 7:35 PM
bking claimed this task.
bking triaged this task as Medium priority.
bking updated the task description. (Show Details)

I'm looking at this issue as an SRE who supports the dse-k8s infrastructure where the opensearch-ipoid infrastructure is hosted. Here's what we've found so far (see Slack thread for more details):

Impact

  • Users: A very small percentage of users are seeing failures, see logstash links below.
  • Logs: Impact still unclear, it's possible event logs are missing IP data.

Metrics

Cross-team consultations

  • Discussion with ServiceOps in #wikimedia-serviceops IRC . Some concern about not being able to see which individual wikikube hosts are having the connection issues, but individual wikikube hosts are logged and visible in the above logstash link. My not-overly-thorough look leads me to believe this is affecting all wikikube hosts.
  • Discussion with NetEng in #wikimedia-dcops on whether there were any changes to our network infrastructure around 17 March 2026 at ~1600 UTC. There were not.

Changes Made So Far

  • Upgrade all dse-k8s workers to 10G NICs T414787

Possible Next Steps

  • Force all ipoid traffic to CODFW. In other words, change iPoid's Mediawiki config to point to the CODFW endpoint . That would probably get rid of the errors, but it would also increase latency for most requests.
  • Enable service mesh . While I think this is a good idea, it would be better if we wait until we are on the newest version of the upstream helm chart (see T414217 for that work).
  • Do nothing. (Only if the impact is low enough)

Less plausible next steps
More general upkeep/performance fixes, unlikely to directly address the issue

  • Update OpenSearch from the 2.x to latest 3.x branch. The Semantic Search project (ref T412338 ) is using OpenSearch 3.x on the same k8s cluster with success.

Reader, if I missed or misstated anything, feel free to respond here.

Since we enabled discovery records (GeoIP awareness) in T417698, the P50 of iPoid has dropped by half . We also haven't seen any P99 spikes since the enablement. That being said, it has only been about 45 minutes as I write this. If the P99 spikes go away after a day, I'll close out this ticket. Otherwise, we'll work on enabling the service mesh as recommended by ServiceOps (see T406876 ).

Since we enabled discovery records (GeoIP awareness) in T417698, the P50 of iPoid has dropped by half . We also haven't seen any P99 spikes since the enablement. That being said, it has only been about 45 minutes as I write this. If the P99 spikes go away after a day, I'll close out this ticket. Otherwise, we'll work on enabling the service mesh as recommended by ServiceOps (see T406876 ).

I think it may be too soon to say -- https://logstash.wikimedia.org/goto/d5e3ee8da2d48ef63f89709bb22d0b5c shows similar levels of timeouts before this change was made. I agree it would make sense to re-evaluate after the service mesh is enabled, though.

I think it may be too soon to say

24 hours later, I can confirm that the P99 still suffers from latency spikes and timeouts are still increasing. The next step will be to try integrating the service mesh, see the linked ticket above.