Page MenuHomePhabricator

Project deployment-prep instance deployment-cirrussearch12 is down
Closed, InvalidPublic

Description

Common information

  • summary: Project deployment-prep instance deployment-cirrussearch12 is down
  • alertname: InstanceDown
  • instance: deployment-cirrussearch12
  • job: node
  • project: deployment-prep
  • severity: warning

Firing alerts


  • summary: Project deployment-prep instance deployment-cirrussearch12 is down
  • alertname: InstanceDown
  • instance: deployment-cirrussearch12
  • job: node
  • project: deployment-prep
  • severity: warning
  • Source

Event Timeline

bd808 added subscribers: bking, bd808.
bd808@mbp03:~$ ssh deployment-cirrussearch12.deployment-prep.eqiad1.wikimedia.cloud
bd808@deployment-cirrussearch12:~$ w
 00:32:57 up 358 days, 10:13,  5 users,  load average: 0.20, 0.07, 0.02
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
root     ttyS0    -                26Mar25 358days  0.58s  0.57s -bash
bd808    pts/0    172.16.17.143    00:32    1.00s  0.06s  0.00s w
root     pts/1    tmux(3521968).%0 30May25 293days  0.05s  0.05s -bash
root     pts/2    tmux(3521968).%1 30May25 293days  0.03s  0.03s -bash
bking    pts/3    tmux(3524570).%0 30May25 293days  0.05s  0.05s -bash

There was a data gap at 2026-03-19 18:48 followed by a load spike recorded in prometheus. Nothing super interesting in journalctl around that time. There are a whole lot of log entries like:

Mar 19 19:00:31 deployment-cirrussearch12 opensearch[3155309]: [78641.802s][info ][safepoint] Application time: 1.0001225 seconds
Mar 19 19:00:31 deployment-cirrussearch12 opensearch[3155309]: [78641.802s][info ][safepoint] Entering safepoint region: Cleanup
Mar 19 19:00:31 deployment-cirrussearch12 opensearch[3155309]: [78641.802s][info ][safepoint] Leaving safepoint region
Mar 19 19:00:31 deployment-cirrussearch12 opensearch[3155309]: [78641.802s][info ][safepoint] Total time for which application threads were stopped: 0.0002797 seconds, Stopping threads took: 0.0000355 seconds
Mar 19 19:00:31 deployment-cirrussearch12 opensearch[3155309]: [78642.802s][info ][safepoint] Application time: 1.0002177 seconds

This sort of looks like the JVM is constantly invoking GC. Anything worth looking into deeper here @bking?