Page MenuHomePhabricator

[4 hour spike] Investigate how to add logging to keep track of query times
Open, MediumPublicSpike

Description

As we add more clauses to the queries, they're going to take longer. Before we launch category tracking in production, we should look into how much slower our queries get with it.

Let's add logging to see how long "Update stats" takes every time someone updates stats for an event. That's the time taken to run all individual queries on each wiki included in the event.
It would be fantastic if we can surface those stats in the UI on the staging instance. If not, it is okay to have them in the logs.

Create tasks to complete the discovered work.

Event Timeline

Niharika triaged this task as Medium priority.

The Symfony profiler will give you such data, so we're able to do this locally. I don't know how to get that on staging without putting the app in dev mode, which probably is not a good idea.

If we wanted to do our own logging, we could do some runtime checks in [[ https://github.com/wikimedia/grantmetrics/blob/master/src/AppBundle/Service/EventProcessor.php#L74 | EventProcessor:process ]] and log the event ID if it took over N seconds to finish. We could also use [[ https://github.com/wikimedia/grantmetrics/blob/master/src/AppBundle/Repository/Repository.php#L383 | Repository:executeReplicaQuery ]] and [[ https://github.com/wikimedia/grantmetrics/blob/master/src/AppBundle/Repository/Repository.php#L402 | Repository::executeQueryBuilder ]] to log individual slow queries. I'm pretty confident in the current queries used to perform COUNTs. The slow part is not so much the queries but the number of queries, say for events on *.wikipedia with participants who edit a lot of Wikipedias, and especially if some of these users have a high edit count on a given wiki. Grant Metrics has to query each wiki individually (that they collectively have edited).

Do we have slow query logging on the replicas? There are many tools that can help with this on the database side but I'm not sure what is available in our production environments.

Do we have slow query logging on the replicas? There are many tools that can help with this on the database side but I'm not sure what is available in our production environments.

I think the admins can see it somewhere. Nothing else is already in place on Toolforge as far as I know.

Adding our own logging shouldn't be hard, but it could get quite noisey. Overall I don't think slow queries are bad -- they are unavoidable for analytical tools like Grant Metrics (and XTools, etc.). There's probably room for improvement but you should be able to simulate a worse-case scenario locally. Categories meanwhile should in theory make the queries faster, as there are fewer revisions to scan.

I think the goal would be to identify not just slow queries. I agree that is potentially not actionable. Specifically, what we'd look for are queries that timeout. Or, queries that are very close to the timeout threshold.

It's fine to simulate things but real-world usage is always different so having instrumentation to see and identify issues or potential issues is valuable. As you say, in a system like this, the DB is often the bottleneck and so we want to pay extra attention to it.

I think the goal would be to identify not just slow queries. I agree that is potentially not actionable. Specifically, what we'd look for are queries that timeout. Or, queries that are very close to the timeout threshold.

Exactly. If our timeout threshold is 900 seconds and we discover that 90% of 'Update stats' requests are complete within 500 seconds -- that's good.
If we discover that 90% of them are hitting the 900 seconds limit, then we have a problem.
Basically I want us to have a way to figure out how long queries (for actual user-created events) take on average. That will tell us if we need to focus more on performance or we're okay.

It doesn't have to be in the UI. It can be in the logs which we can periodically go and look at. If we can have something better, that's great.

Just a note that you can grep the logs for error-query-timeout to find instances where queries timed out, since this throws an exception. It does not give you the SQL but it does give you the URI. Thankfully there have been no such exceptions in Event Metrics history!

jmatazzoni renamed this task from Add logging to keep track of query times to Investigate how to add logging to keep track of query times [4 hour spike].Dec 11 2018, 12:44 AM

I wonder if we could have access to turn this on for the grantmetrics database: https://mariadb.com/kb/en/library/slow-query-log-overview/

MBinder_WMF renamed this task from Investigate how to add logging to keep track of query times [4 hour spike] to [4 hour spike] Investigate how to add logging to keep track of query times .Dec 11 2018, 12:48 AM
MBinder_WMF added a project: Spike.

We had a discussion about it today, decided to try using https://grafana-labs.wikimedia.org for it, the underlying statsd host for it is labmon1001.eqiad.wmnet.

I have a PR that sort helps; it logs the runtime of each chunk of the update process. This is sent to a dedicated log file so it will be easier to track down problems. https://github.com/wikimedia/eventmetrics/pull/216

MBinder_WMF changed the subtype of this task from "Task" to "Spike".Jun 11 2019, 11:24 PM