Page MenuHomePhabricator

Improve TransactionProfiler as replacement for tendril's slow queries
Closed, ResolvedPublic

Description

Notes that have been mentioned in T273054 and some other places:

  • Queries should not get truncated and normlized. At least the non-normalized query should be provided as well.
  • TrxId should be removed from messages and generalized queries so it can be aggregated
  • Fix rest of hard-coded values in warnings to enable aggregation
  • Build a slow queries dashboard in grafana for easy access.
  • Possible sorting?
  • Add req id
  • (add more)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 738985 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Make TransactionProfiler logs more useful

https://gerrit.wikimedia.org/r/738985

@Ladsgroup would that cope with T291419?

Change 739672 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Add request id and full query to transaction profiler

https://gerrit.wikimedia.org/r/739672

Change 738985 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Make TransactionProfiler logs more useful

https://gerrit.wikimedia.org/r/738985

I don't know how hard it will be, as it may will require complex code changes, but having a user, ip or actor id available would be nice too if reasonable (they used to be there, but it was lost at some point).

Change 739672 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Add full query to transaction profiler

https://gerrit.wikimedia.org/r/739672

I'm landing this, but look out for how this behaves in the case of large/complex queries. Given that this uses a non-standard field, I don't think we have truncation in placec here, so it seems likely that in the case of large queries, we will end up dropping the entire message (partily preserved as unparsed/invalid json under channel:jsonTruncated) and thus will not aggregate toward the normalized aggregate and not show up in the slow query dashboard.

If that goes wrong in the way I anticipate, we may want to truncate these explicitly in the TransactionProfiler class upto some arbitrary threshold.

Change 739672 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Add full query to transaction profiler

https://gerrit.wikimedia.org/r/739672

Change 741132 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.9] rdbms: Make TransactionProfiler logs more useful

https://gerrit.wikimedia.org/r/741132

Change 741132 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.9] rdbms: Make TransactionProfiler logs more useful

https://gerrit.wikimedia.org/r/741132

Mentioned in SAL (#wikimedia-operations) [2021-11-24T16:35:56Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:741132|rdbms: Make TransactionProfiler logs more useful (T295706)]] (duration: 00m 57s)

Change 741134 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.9] rdbms: Add full query to transaction profiler

https://gerrit.wikimedia.org/r/741134

Change 741134 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.9] rdbms: Add full query to transaction profiler

https://gerrit.wikimedia.org/r/741134

Mentioned in SAL (#wikimedia-operations) [2021-11-24T17:35:03Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:741134|rdbms: Add full query to transaction profiler (T295706)]] (duration: 00m 56s)

Now we got the full query \o/ https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-2021.11.24?id=K5sFU30BfkHq3kAMLVB2

Tomorrow I will start wrestling with kibana to build a dashboard for this ^^

It's ready-ish: https://logstash.wikimedia.org/app/dashboards#/view/43fcccd0-4df5-11ec-81e9-e1226573bad4

(if it doesn't work, search in dashboards for "slow queries")

Two notes so far that need fixing:

  • we need dbhost in the fields, mediawiki is not sending them :(
  • I want to sum over "actual" field but it's not being treated as number it seems.

Change 741927 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Add db host to transaction profiler logging

https://gerrit.wikimedia.org/r/741927

Change 742259 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.9] rdbms: Add DB host to TransactionProfiler logging and fix time fields

https://gerrit.wikimedia.org/r/742259

Change 741927 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Add DB host to TransactionProfiler logging and fix time fields

https://gerrit.wikimedia.org/r/741927

Change 742259 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.9] rdbms: Add DB host to TransactionProfiler logging and fix time fields

https://gerrit.wikimedia.org/r/742259

Mentioned in SAL (#wikimedia-operations) [2021-11-29T17:58:58Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.9/includes/libs/rdbms/: Backport: [[gerrit:742259|rdbms: Add DB host to TransactionProfiler logging and fix time fields (T295706)]] (duration: 00m 56s)

Change 742259 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.9] rdbms: Add DB host to TransactionProfiler logging and fix time fields

https://gerrit.wikimedia.org/r/742259

With this merge, it is a lot better now and much more useful, thank you!!!
Question, is there a quick way to filter for queries that took longer than 60 seconds?

It should be but for whatever reason logstash still thinks the fields are not numeric. I'm on it (by on it I mean I'm bugging Filippo)

It should be but for whatever reason logstash still thinks the fields are not numeric. I'm on it (by on it I mean I'm bugging Filippo)

I took another look at this, and the fields in _source are numbers as expected (I'm looking at this document) though the indexed fields are text. My expectation and recollection was that the type would be preserved, however this doesn't seem to be the case. I noticed we have some specific overrides in the logstash template for numeric fields at modules/profile/files/logstash/templates/logstash_7.0-1.json though. I'm looping in @colewhite to double check my expectation/diagnosis, thanks!

Change 742923 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] logstash: Add maxSeconds and actualSeconds as numeric fields

https://gerrit.wikimedia.org/r/742923

Thanks. I made a patch for it like I know what I'm doing (narrator: he didn't). So a review for you and/or Cole would be amazing.

Change 742923 merged by Ladsgroup:

[operations/puppet@production] logstash: Add maxSeconds and actualSeconds as numeric fields

https://gerrit.wikimedia.org/r/742923

Change 742923 merged by Ladsgroup:

[operations/puppet@production] logstash: Add maxSeconds and actualSeconds as numeric fields

https://gerrit.wikimedia.org/r/742923

This change was successful and the fields are now treated as numerics but the fields are now different types between indexes. Because the field is in type conflict, some type-dependent aggregation and filtering capabilities for these fields may be unavailable.

some type-dependent aggregation and filtering capabilities for these fields may be unavailable.

Which is the one I need :( Is there a way to fix the conflict?

Which is the one I need :( Is there a way to fix the conflict?

The keyword-typed indexes have to be deleted (194gb of mediawiki logs over 3 days) or expire (after 90 days). I'll try a reindex to preserve the data and speed up the process in codfw. We're planning to move there for a little while in the near future.

Thanks. Let me know when that happens.

@Ladsgroup we have moved to codfw. The fields are no longer in conflict on this cluster.

Ladsgroup moved this task from Blocked to Done on the DBA board.

The sum section has been added and you can also now filter it based on actualSeconds too. This is done. I will focus on improving DBError dashboard next.

This is excellent news! The dashboard looks great now and definitely ready to replace Tendril's one.
Great work!

hashar subscribed.

The full SQL query might be too large for our logging stack. The message ends up being truncated along the way and is not properly ingested in OpenSearch, thus we potentially miss warnings (though they can be found under the jsonTruncated channel. I have filed T349140 about it.