Page MenuHomePhabricator

tools.meta was creating high cpu load on wikireplicas, rate-limited
Closed, ResolvedPublic


s52256 was stressing wikireplicas by creating high-cpu usage on the database. The tool has been rate-limited to avoid creating issues to other users (which complained about lag on IRC).

Please review the queries used or if some changes have to happen (it is not necessarily your fault, it may be due to mediawiki structure changes, that could have made it less efficient; or sometimes, abuse from 3rd parties).

It was also not necesarilly the only cause of issues, but we have to start by the top utilizator, and that was this tool at the time at the issue, going almost back to normal after rate-limiting. Please work on us on figuring out what could be done, app-rate limiting, etc.

Event Timeline

This is currently causing the popular Meta tools to malfunction. E.g. all the tools under, which are needed for steward tasks such as Stewards requests.

Unable s6.web.db.svc.eqiad.wmflabs, s8.web.db.svc.eqiad.wmflabs, ..

Exception: SQLSTATE[HY000] [1226] User 's52256' has exceeded the 'max_user_connections' resource (current value: 1)

I don't see a relation between these tools and Pywikibot-catfiles, so perhaps one of them using the wrong database credentials. Or, if they really are run from the same account, perhaps the bot should move to a different Tools project.

According to, s52256 (52256) belongs to tools.meta, not tools.drtrigonbot. Looks like that might be using the wrong db credentials indeed.

Krinkle renamed this task from Pywikibot-catfiles was creating high cpu load on wikireplicas, rate-limited to tools.meta was creating high cpu load on wikireplicas, rate-limited.Mar 8 2019, 3:06 PM
Krinkle assigned this task to Pathoschild.
Krinkle triaged this task as High priority.
Krinkle edited projects, added Tools; removed Pywikibot-catfiles.

Actually, it's not from Pywikibot-catfiles at all. That's just because does some kind of Phabricator search. based on key words, and suggested an unrelated project.

Some of the potential queries causing the issue (there were multiple of these running in parallel):

                comment_text AS ipb_reason,
                DATE_FORMAT(ipb_timestamp, "%Y-%b-%d") AS timestamp,
                DATE_FORMAT(ipb_expiry, "%Y-%b-%d") AS expiry,
                LEFT JOIN comment ON ipb_reason_id = comment_id
                (ipb_range_start <= 'v6-260088003980255009B2521EDAE937EA' AND ipb_range_end >= 'v6-260088003980255009B2521EDAE937EA')
                OR (ipb_range_start >= 'v6-260088003980255009B2521EDAE937EA' AND ipb_range_end <= 'v6-260088003980255009B2521EDAE937EA')
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ipblocks
         type: ALL
possible_keys: ipb_range
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1129010
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: comment
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: enwiki.ipblocks.ipb_reason_id
         rows: 1

I asked about this in #wikimedia-cloud, but unfortunately never got an answer. The only recent change is updating queries for T166732. The above query had two changes: (a) changed ipb_reason to comment_text AS ipb_reason, and (b) added LEFT JOIN comment ON ipb_reason_id = comment_id. I'm not sure why that makes the query so much slower, and it's hard to experiment without EXPLAIN access. Suggestions are welcome! I'd rather not revert the update entirely, since the ipb_reason column will be removed soon.

it's hard to experiment without EXPLAIN access

You may be able to use to check how the optimizer is treating your query.

Just to be clear of why action was taken- normally we allow tools to use whatever resources they want, but we got a complain on IRC about lagging on the wikireplica web and this was observed the #1 cause of CPU starvation.

Thanks! Could it be related to the comment view?

  • The old query performs a simple select on all ipblocks rows. It's not ideal, but I don't know of a more optimal way and it didn't cause issues before. The WHERE can be simplified to ipb_range_start = ? AND ipb_range_end = ? in this case, but it doesn't affect the query plan.
  • The new query is identical, but adds a left join on comment. Besides the expected eq_ref join, this also adds ten dependent subqueries on tables like revision and image. This version apparently causes high CPU.

Both take equivalent time though (≈2 seconds), even when I use SQL_NO_CACHE and change the WHERE values.

I reversed the update temporarily. That should solve the high CPU usage if it's due to the query changes (rather than some other change on Toolforge), but the tools will break with the upcoming comment change.

Pathoschild closed this task as Resolved.EditedMay 26 2021, 1:56 PM

This seems to be solved; I haven't seen any reported issues in the two years since the last comment.