Page MenuHomePhabricator

In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes
Open, HighPublic

Description

The troubleshooting process around T370304 was needlessly painful.

Something as "simple" as being able to tell which k8s pods were talking with the s4 master required some really gross hacks. This made it very hard to even determine which Mediawiki deployment (mw-jobrunner vs mw-api-ext, basically) was the trigger of the outages each time.

It also took several outage cycles before we had a set of debugging tooling that any SRE could invoke manually or that worked automatically. (For example, see edit history on P67012)

This task is to discuss improvements we can make in the near-term (like, between ~hours and ~weeks?).

I have a lot of ideas and suggestions, but I'll file them as comments or subtasks rather than in this bug description.

Event Timeline

@Ladsgroup @Marostegui I have a mix of concrete proposals and naive questions for you re: MariaDB observability :)

Concrete proposal: monitor client connections as a load indicator for masters

As best as I understand, the issue here wasn't any one particular query, but rather bursts of hundreds of queries hitting the master at once, and then causing enough lock contention within MariaDB to keep query processing clogged up for a long while. (Potentially also with some client-side retries compounding the issue?)

  1. Track client connections + disconnects with eBPF on the MariaDB masters.
    • In terms of performance impact/cost, this is much much much lower than strace, and even still lower than perf record or similar. I am pretty sure in-kernel tracing via kprobes doesn't add a context switch or even much synchronization cost.
    • There's existing bpfcc tools that do almost exactly what we need -- tcptracer-bpfcc will record all accepts and socket closes quite easily P67435, and the similar-but-different tcplife-bpfcc script will even tell you how much data was sent either way + the duration the connection was open P67436. (This information is in the kernel's socket stats struct at close() time; it's not instrumenting read/write.)
    • We could write our own Python wrapper for one of these existing tools and add a Prometheus exporter, or we could explore packaging ebpf_exporter and writing the instrumentation + aggregation in that.
  2. Use that data to export two new counters to Prometheus:
    • Number of new connections accepted by the MariaDB master process, grouped by the namespace of the k8s pod (like mw-web or mw-jobrunner). Doing T344171 would make this grouping much easier, but we could do it other ways in the meantime. This captures "stampedes" of incoming queries from a source, regardless of the individual query cost.
    • Number of master-socket-seconds consumed, grouped again by k8s namespace. Basically, every time a connection is closed, add its duration to this counter. This metric aggregates an approximate per-query cost.
    • Both of these metrics will just be noise during a full master outage like the ones we were having, but I think will be very helpful as a potential load/saturation indicator while things are working, and particularly in the lead-up / trigger interval to an outage.
Naive question: how can we get better / any saturation signals?

Another feeling I had with this outage was basically "everything was fine until it wasn't" -- usage patterns that didn't cause massive outages months ago now caused them. But conventional load metrics like CPU utilization or disk IO don't show the issue. @Ladsgroup made a dashboard panel "Potential contention causes" which showed innodb_row_lock_waits as the big issue on s4, and eventually that let us verify a fix with the Cat-a-Lot change. But, like...

  • is that kind of metric something we can use as a signal of stress on the master?
  • is that kind of metric something we can use as a signal of long-term creeping increase of master load?
  • do we understand the "physical" meaning of some ranges of values for those contention counters?
Naive question: what needs to be done to propagate tracing context from Mediawiki to MariaDB?

It would be so nice to have an x-request-id (and ideally also a traceparent) accessible in the query text inside MariaDB. As I understand we need to be careful or otherwise we'd break a bunch of aggregation that happens in Logstash and also for Information_Schema (or things that use information_schema like our slow query logger...?)

CDanis triaged this task as High priority.Aug 21 2024, 6:17 PM

@Ladsgroup @Marostegui I have a mix of concrete proposals and naive questions for you re: MariaDB observability :)

Concrete proposal: monitor client connections as a load indicator for masters

As best as I understand, the issue here wasn't any one particular query, but rather bursts of hundreds of queries hitting the master at once, and then causing enough lock contention within MariaDB to keep query processing clogged up for a long while. (Potentially also with some client-side retries compounding the issue?)

  1. Track client connections + disconnects with eBPF on the MariaDB masters.
    • In terms of performance impact/cost, this is much much much lower than strace, and even still lower than perf record or similar. I am pretty sure in-kernel tracing via kprobes doesn't add a context switch or even much synchronization cost.
    • There's existing bpfcc tools that do almost exactly what we need -- tcptracer-bpfcc will record all accepts and socket closes quite easily P67435, and the similar-but-different tcplife-bpfcc script will even tell you how much data was sent either way + the duration the connection was open P67436. (This information is in the kernel's socket stats struct at close() time; it's not instrumenting read/write.)
    • We could write our own Python wrapper for one of these existing tools and add a Prometheus exporter, or we could explore packaging ebpf_exporter and writing the instrumentation + aggregation in that.

I really like this idea, we'd need help from IF though but I am happy to help as much as I can. I don't think it will have a huge impact on the performance, but we can start trying them on misc masters. Ideally we should have this tooling/metric everywhere, not just masters. However, the amount of writes compare with reads is way lower, so we'd need to measure the impact on reads as well.
If we prefer to have this only on masters for performance reasons, we should make this controlled by puppet and the master role, which we assign to the hiera file of each master, so it gets enabled/disable when we promote/demote a master.

We can of course explore the possibility to enable it for reads too.

  1. Use that data to export two new counters to Prometheus:
    • Number of new connections accepted by the MariaDB master process, grouped by the namespace of the k8s pod (like mw-web or mw-jobrunner). Doing T344171 would make this grouping much easier, but we could do it other ways in the meantime. This captures "stampedes" of incoming queries from a source, regardless of the individual query cost.
    • Number of master-socket-seconds consumed, grouped again by k8s namespace. Basically, every time a connection is closed, add its duration to this counter. This metric aggregates an approximate per-query cost.
    • Both of these metrics will just be noise during an outage like the ones we were having, but I think will be very helpful as a potential load/saturation indicator while things are working.

+1 - this would have been helpful during the s4 issues.

Naive question: how can we get better / any saturation signals?

Another feeling I had with this outage was basically "everything was fine until it wasn't" -- usage patterns that didn't cause massive outages months ago now caused them. But conventional load metrics like CPU utilization or disk IO don't show the issue. @Ladsgroup made a dashboard panel "Potential contention causes" which showed innodb_row_lock_waits as the big issue on s4, and eventually that let us verify a fix with the Cat-a-Lot change. But, like...

  • is that kind of metric something we can use as a signal of stress on the master?

This was a very specific case and it was hard to predict. In general query latency monitoring (which we already graph) is a good indication for when the master is starting to get overloaded.
Our thread_pool_size is set to 64 (so 64 concurrent RUNNING connections), measuring the amount of them running can also be a good indication, if the start to pile up, there's clearly contention (at any level) going on.

There are so many ways to stress a master that it is hard to tell to be honest. This one is useful but at the same time, in this specific case, we were also getting pretty blind during the outage (reminder: not even pt-heartbeat was being able to write).

  • is that kind of metric something we can use as a signal of long-term creeping increase of master load?

Long term yes, but in very fast and concrete cases like s4, we'd probably not be fast enough anyway to catch it. Monitoring that along with the query latency can give us early signs of masters starting to struggle.

  • do we understand the "physical" meaning of some ranges of values for those contention counters?

I don't get this question.

Something we should also do, and I've always wanted do to, but we never found the momentum....take minutely snapshosts of the masters (or any production db really) and store them somewhere for 1 or 2 days. So we can see what was going on during the whole day so we can troubleshoot past issues at least with 60 seconds snapshosts.
What to get probably on a first iteration:

  • Show engine innodb status
  • Show global status
  • iostat/top
  • processlist

What do you think about that too?

Related: T198755: Log the query that caused a lock timeout

It would be so nice to have an x-request-id (and ideally also a traceparent) accessible in the query text inside MariaDB. As I understand we need to be careful or otherwise we'd break a bunch of aggregation that happens in Logstash and also for Information_Schema (or things that use information_schema like our slow query logger...?)

It would also break the MariaDB query cache, I think (although apparently there are ways around that).

I think we do that already? Check DB errors dashboard in logstash.

It would be so nice to have an x-request-id (and ideally also a traceparent) accessible in the query text inside MariaDB. As I understand we need to be careful or otherwise we'd break a bunch of aggregation that happens in Logstash and also for Information_Schema (or things that use information_schema like our slow query logger...?)

It would also break the MariaDB query cache, I think (although apparently there are ways around that).

We don't use query cache in production.

Concrete proposal: monitor client connections as a load indicator for masters

I really like this idea, we'd need help from IF though but I am happy to help as much as I can. I don't think it will have a huge impact on the performance, but we can start trying them on misc masters. Ideally we should have this tooling/metric everywhere, not just masters. However, the amount of writes compare with reads is way lower, so we'd need to measure the impact on reads as well.
If we prefer to have this only on masters for performance reasons, we should make this controlled by puppet and the master role, which we assign to the hiera file of each master, so it gets enabled/disable when we promote/demote a master.
We can of course explore the possibility to enable it for reads too.

Great, thanks! I've suggested this as a possible project for next month's I/F team hackathon. I'll try to make sure it gets implemented and deployed to at least some masters while you're on sabbatical.

Naive question: how can we get better / any saturation signals?

Another feeling I had with this outage was basically "everything was fine until it wasn't" -- usage patterns that didn't cause massive outages months ago now caused them. But conventional load metrics like CPU utilization or disk IO don't show the issue. @Ladsgroup made a dashboard panel "Potential contention causes" which showed innodb_row_lock_waits as the big issue on s4, and eventually that let us verify a fix with the Cat-a-Lot change. But, like...

  • is that kind of metric something we can use as a signal of stress on the master?

This was a very specific case and it was hard to predict. In general query latency monitoring (which we already graph) is a good indication for when the master is starting to get overloaded.
Our thread_pool_size is set to 64 (so 64 concurrent RUNNING connections), measuring the amount of them running can also be a good indication, if the start to pile up, there's clearly contention (at any level) going on.
There are so many ways to stress a master that it is hard to tell to be honest. This one is useful but at the same time, in this specific case, we were also getting pretty blind during the outage (reminder: not even pt-heartbeat was being able to write).

OK, interesting. To me this says two things:

  • the eBPF monitoring I suggested sounds really valuable
  • it might be interesting to have some metrics exposed in the Mediawiki client itself about performance of certain kinds of queries
  • do we understand the "physical" meaning of some ranges of values for those contention counters?

I don't get this question.

Basically I was asking what the increases in the row lock contention metric actually mean in terms of the code. Is it just the number of times that a query's execution had to pause on a row lock owned by someone else? Or is it, like, the number of thread-seconds/second that were spent waiting on locks? Or something else entirely?


Something we should also do, and I've always wanted do to, but we never found the momentum....take minutely snapshosts of the masters (or any production db really) and store them somewhere for 1 or 2 days. So we can see what was going on during the whole day so we can troubleshoot past issues at least with 60 seconds snapshosts.
What to get probably on a first iteration:

  • Show engine innodb status
  • Show global status
  • iostat/top
  • processlist

I think this sounds great too. And it would be a very simple thing to write as a script right on the master, including some basic analysis.

I think we do that already? Check DB errors dashboard in logstash.

We log the query that suffered a lock timeout. I don't think we log the one that *caused* it.

ABran-WMF lowered the priority of this task from High to Medium.Aug 26 2024, 6:20 AM
ABran-WMF raised the priority of this task from Medium to High.
ABran-WMF moved this task from Triage to In progress on the DBA board.

Great, thanks! I've suggested this as a possible project for next month's I/F team hackathon. I'll try to make sure it gets implemented and deployed to at least some masters while you're on sabbatical.

Let me know if I can help, I'd be happy to! This is a great idea indeed.

Something we should also do, and I've always wanted do to, but we never found the momentum....take minutely snapshosts of the masters (or any production db really) and store them somewhere for 1 or 2 days. So we can see what was going on during the whole day so we can troubleshoot past issues at least with 60 seconds snapshosts.

lets talk about this when you get back, this is a great idea that seems doable

What to get probably on a first iteration:

  • Show engine innodb status
  • Show global status
  • iostat/top
  • processlist

What do you think about that too?

I'll come up with a CR for this in T375263