It went undetected for 12 hours last time, we ought to do slightly better I guess.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Gehel | T109089 EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) | |||
Resolved | Gehel | T110171 Alert when ES indexes are freezed for more than 30 minutes |
Event Timeline
I guess that by frozen indices, we refer to freezing the jobs that write to elasticsearch, not closing the indices in elasticsearch itself. I'm not actually sure how that freezing works, I'll dig into the code see if I can understand.
This hasn't been touched in quite a while, so lowering priority and putting in the "Later" column. If this is important somehow, please feel free to let me know and we can shuffle it around.
It's an explicit follow-up from an incident. These should be prioritized along side other "fun/new" work appropriately (iow: not dropped).
@greg Good to know. I chatted to @EBernhardson about it before reprioritising and he said it's unclear how relevant this is now given how our rolling restarts work now. Hopefully @Gehel should know more. :-)
This hasn't been touched in quite a while, so lowering priority
I know this is a general Phabricator workflow thing but i never understood this logic, in other ticket systems priority would be raised when things had not been touched in a long time, not the other way around.
It's a fair point. I generally use task priority as descriptive; using that lens, if something hasn't been touched for over a year, then it's not really high priority, and keeping it marked as such is misleading. If everything is high priority, then nothing is. :-)
Already implemented in https://gerrit.wikimedia.org/r/c/operations/puppet/+/431754 as part of T193605