User Details
- User Since
- Oct 15 2019, 4:02 PM (179 w, 6 d)
- Availability
- Available
- IRC Nick
- rzl
- LDAP User
- RLazarus
- MediaWiki User
- Unknown
Mon, Mar 20
Sun, Mar 12
Fri, Mar 10
Perfect, thank you! I started https://wikitech.wikimedia.org/wiki/Incidents/2023-02-11_logstash_latency and filled in what we know so far (as far as I know, anyway).
Tue, Mar 7
Mon, Mar 6
@akosiaris and @Clement_Goubert will come up with a cluster layout this week, and @Clement_Goubert wanted to try putting at least one or two into service themselves. Feel free to assign to me afterward to churn through the rest.
Thu, Mar 2
Adding @CDanis as we were just talking about something along these lines.
Feb 25 2023
Feb 24 2023
Feb 23 2023
Sure, we could look at adding a warmup step to the server repool process. Historically we haven't worried about it, because the impact for one host is much smaller than when the entire cluster is cold, but it's worth looking at. I'd rather make this change in-place first though, and then it would be easy to also install something on each host. (We might choose to wait and do this after mw-on-k8s, but we can take a look at it for sure.)
Feb 21 2023
We decided we'll put these into service after the upcoming DC switchover, so we'll make a plan at the March 6 serviceops meeting.
Feb 15 2023
Feb 14 2023
Feb 9 2023
Perfect, thanks!
@Krinkle Are you aware of any current uses of warmup.js besides the DC switchover automation? Anywhere else I need to maintain compatibility, or adapt either humans or software to call the new script?
Feb 7 2023
POSTs will probably have to wait for the Python rewrite, but then they'll be easy. Can you recommend specific requests?
Feb 6 2023
I'm working on this.
In that case it sounds like yes, we do need cache warming in eqiad before repooling it -- and we'll need to add URLs to warm up s8, per this task.
Feb 3 2023
Agree we don't need it in order to switch the RW site to codfw.
Feb 2 2023
This is deployed! Thanks again for the patch, let me know if you need anything else.
Feb 1 2023
Jan 31 2023
Jan 27 2023
Jan 18 2023
(Back from vacation, sorry for the delay.) Yeah, I think we can close this. Thanks!
Dec 14 2022
Dec 2 2022
Thanks @Volans. DC ops, for mw1320, I wasn't able to manually shut it off -- please do just kill the power when you go in to unrack it. Thanks!
Over to dcops!
Dec 1 2022
Oops, yes, that should have been 306162. Thanks for the catch!
Nov 29 2022
Nov 24 2022
Changed my mind on this -- still going to look into other solutions, but I did bump the deadline to 60s so that it doesn't spuriously alert in the meantime.
Good find, thanks.
Oct 18 2022
Thanks John!
Oct 17 2022
Oct 14 2022
I was surprised to see SHOW PROCESSLIST didn't empty out after "very few minutes" like wt:MariaDB/troubleshooting#Depooling_a_replica suggests -- for the record, here's what it looked like, in full, 15-20 minutes after @Joe depooled it: P35486
Oct 13 2022
Oh, one more angle to think about! There are two Maps services we're writing SLOs for. For Kartotherian, we're just planning a request latency target at the 50th and the 95th -- i.e, the Prometheus query is the same except for the percentile. But for Tegola, we're measuring two different types of latency (HTTP request latency vs. cache operation time), each of those at p50 and p95. So we'd end up writing different queries, too.
Sep 21 2022
I was wondering if https://gerrit.wikimedia.org/r/c/operations/puppet/+/831230 might be related but I couldn't work out exactly what's going on here.
Sep 14 2022
Aug 29 2022
T312947 already tracks the larger question of how to organize runbooks for ProbeDown effectively.
Aug 23 2022
That sounds right to me; it would give us the same distribution as codfw, which is probably as much work as we need to do on this. I don't think it's worth investing time into any deliberate benchmarking, but if the cluster happens to saturate unevenly again in a future incident, we can tweak opportunistically.
Aug 21 2022
Aug 5 2022
Aug 2 2022
!log rzl@cumin2002 START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet
!log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet
Due to T309956 I'm moving ahead with mc2038 early, and using it to replace mc2024 which is currently out of service.
Jul 31 2022
Yep.
Jul 29 2022
Marking this resolved, thank you very much!
Jul 28 2022
@Papaul All yours!
Jul 27 2022
N.B. this is only seven hosts, mw225[1-5,7-8] -- mw2256 was already decommed in T263065.
Hi @CRoslof -- adding you here, per my email just now.
Jul 26 2022
With https://gerrit.wikimedia.org/r/813924 we ought to see smaller bursts in utilization, so I'm going to tentatively crank the shellbox replicas back down to 8, where it was before https://gerrit.wikimedia.org/r/803953.
Jul 25 2022
Yep, looks much better! Closing.
I'd like to redo @Legoktm's manual test first, and make sure I can't reproduce a spike -- I'll do that later today, and close afterward.