Page MenuHomePhabricator

Disabling general.confirmeduser from dbreports for using up too much db resources
Closed, ResolvedPublic

Description

I've killed the job and disabled the cronjob temporarily. Will need discussion with DBAs to optimize the report to not consume so much db resources until it can be re-enabled.

Event Timeline

The very-long running query was:

| 21844654 | p50380g50440    | 10.68.xx.xx:54190  | enwiki_p                    | Execute |  577307 | Sending data                                                                |
/* confirmedusers.py SLOW_OK */
SELECT
  CONVERT(user_name USING utf8),
  user_editcount,
  rev_timestamp
FROM user
JOIN user_groups
ON ug_user = user_id
JOIN revision
ON rev_user = user_id
AND ug_group = 'confirmed'
AND user_editcount > 9
AND (SELECT
       MIN(rev_timestamp)
     FROM revision
     WHERE rev_user = user_id) < DATE_FORMAT(DATE_SUB(NOW(),INTERVAL 4 DAY),'%Y%m%d%H%i%s')
AND rev_timestamp = (SELECT
                       MIN(rev_timestamp)
                     FROM revision
                     WHERE rev_user = user_id)

Here the explain:

+------+--------------------+-------------+--------+------------------+--------------------+---------+----------------------------+-----------+--------------------------+
| id   | select_type        | table       | type   | possible_keys    | key                | key_len | ref                        | rows      | Extra                    |
+------+--------------------+-------------+--------+------------------+--------------------+---------+----------------------------+-----------+--------------------------+
|    1 | PRIMARY            | user_groups | ref    | PRIMARY,ug_group | ug_group           | 257     | const                      |       497 | Using where; Using index |
|    1 | PRIMARY            | user        | eq_ref | PRIMARY          | PRIMARY            | 4       | enwiki.user_groups.ug_user |         1 | Using where              |
|    1 | PRIMARY            | revision    | ref    | rev_timestamp    | rev_timestamp      | 16      | func                       |         1 | Using where              |
|    3 | DEPENDENT SUBQUERY | revision    | index  | NULL             | usertext_timestamp | 541     | NULL                       | 322795979 | Using where; Using index |
|    2 | DEPENDENT SUBQUERY | revision    | index  | NULL             | usertext_timestamp | 541     | NULL                       | 322795979 | Using where; Using index |
+------+--------------------+-------------+--------+------------------+--------------------+---------+----------------------------+-----------+--------------------------+

https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Autoconfirmed_users_in_the_confirmed_user_group

MariaDB [enwiki_p]> select count(*) from user_groups where ug_group = 'confirmed';
+----------+
| count(*) |
+----------+
|      495 |
+----------+
1 row in set (0.00 sec)

Presumably we could just do 495 queries against enwiki_p.revision_userindex to get the rev_timestamps and the report would generate a lot faster and would be significantly cheaper to update.

tools.dbreps@tools-bastion-03:~$ crontab -l | grep -B1 confirmedusers
# Disabled for using too much Db Resource - YuviPanda (T131956)
# 0 1 * * 1,4 jsub -once -j y -mem 280m -N general.confirmedusers -o ~/var/log -quiet ~/bin/dbreps -r general.confirmedusers -s enwiki

Thank you for disabling, @yuvipanda. Sorry about the trouble. :-(

p50380g50440 was running several queries that were never going to stop executing, and causing 1 day of lag on labsdb1001:

      Id: 3243394
    User: p50380g50440
    Host: 10.68.16.126:59189
      db: enwiki_p
 Command: Execute
    Time: 554131
   State: Copying to tmp table
    Info: /* linkedredlinkedcats.py SLOW_OK */
        SELECT
          CONVERT(cl_to USING utf8),
          COUNT(*)
        FROM categorylinks
        JOIN pagelinks ON pl_title = cl_to AND pl_namespace = 14
        JOIN page AS p1 ON pl_from = p1.page_id AND p1.page_namespace IN (0, 6, 10, 12, 14, 100)
        LEFT JOIN page AS p2 ON cl_to = p2.page_title AND p2.page_namespace = 14
        WHERE p2.page_title IS NULL
        GROUP BY 1 LIMIT 1000
Progress: 0.000

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: range
possible_keys: PRIMARY,name_title
          key: name_title
      key_len: 4
          ref: NULL
         rows: 24632789
        Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: pagelinks
         type: ref
possible_keys: PRIMARY,pl_from,pl_namespace,pl_backlinks_namespace
          key: PRIMARY
      key_len: 8
          ref: enwiki.page.page_id,const
         rows: 2
        Extra: Using index
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: categorylinks
         type: ref
possible_keys: cl_timestamp,cl_sortkey
          key: cl_timestamp
      key_len: 257
          ref: enwiki.pagelinks.pl_title
         rows: 20
        Extra: Using index
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: page
         type: eq_ref
possible_keys: name_title
          key: name_title
      key_len: 261
          ref: const,enwiki.pagelinks.pl_title
         rows: 1
        Extra: Using where; Using index
        
      Id: 10706891
    User: p50380g50440
    Host: 10.68.18.208:39267
      db: commonswiki_p
 Command: Execute
    Time: 7075
   State: Queried about 5850000 rows
    Info: /* selfcatcats.py SLOW_OK */
        SELECT
          CONVERT(page_title USING utf8),
          cat_pages,
          cat_subcats
        FROM page
        JOIN categorylinks
        ON cl_to = page_title
        RIGHT JOIN category
        ON cat_title = page_title
        WHERE page_id = cl_from
        AND page_namespace = 14
Progress: 0.000

(in particular, note the 6-day query doing a very expensive sort).
https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1001&var-port=9104&panelId=11&fullscreen&from=now-7d&to=now

I have banned p50380g50440 from labsdb1001 (not from other replicas or toolsdb, and because a warning had been sent in advance), until there is a guarantee that these queries will not happen again, and to facilitate the replica catching up with replication.

Any objection to close this ticket?
It is pretty old, the problematic job was disabled more than a year ago and the old labsdb servers are going away.

chasemp claimed this task.
chasemp subscribed.

Any objection to close this ticket?
It is pretty old, the problematic job was disabled more than a year ago and the old labsdb servers are going away.

yup :)