Page MenuHomePhabricator

dbstore1002: Query running for two days
Closed, ResolvedPublic

Description

Hello Neil,

I have seen that there is a connection from stat1002 to dbstore1002 which has now been running for two days.
Looks like the connection has been issued from your user:

The connection is established on port 49663 and it is owned by your userid:

This is the query:

select left(a.ar_timestamp, 6) as month, count(*) as deleted_creations
from enwiki.archive a
inner join
(
select ar_title, min(ar_timestamp) as ar_timestamp
from enwiki.archive
group by ar_title
) b
using (ar_title, ar_timestamp)
where 
ar_namespace = 0 and
ar_comment not like "%redir%" and
ar_len > 100
group by left(a.ar_timestamp, 6);

Explain output:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: a
         type: ref
possible_keys: name_title_timestamp,analytics_timestamp
          key: name_title_timestamp
      key_len: 4
          ref: const
         rows: 23110456
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: hash_ALL
possible_keys: key0
          key: #hash#key0
      key_len: 274
          ref: enwiki.a.ar_title,enwiki.a.ar_timestamp
         rows: 54432647
        Extra: Using join buffer (flat, BNLH join)
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: archive
         type: index
possible_keys: NULL
          key: name_title_timestamp
      key_len: 277
          ref: NULL
         rows: 54432647
        Extra: Using index; Using temporary; Using filesort
3 rows in set (0.00 sec)

I have killed this query.

MariaDB DBSTORE localhost log > kill 592144656;
Query OK, 0 rows affected (0.00 sec)

Would you be able to take a look and try to improve it?
Let me know if you need help with that.

I have added the tag Editing-Analysis but I am unsure if it is correct for this scenario, feel free to change it or let me know if you want me to.

Thanks
Manuel.

Event Timeline

@Marostegui, thank you for catching this! I didn't expect it to take more than a couple of hours, so I started it before I left for the weekend and didn't check it afterward. Also, the tag is correct—thank you for using it.

I'm trying to figure out what proportion of articles that were created each month were later deleted (T149049). I already counted the article-creating entries in the revision table, and here I was trying to do the same thing with the archive table.

However, I wasn't able to simply select on ar_parent_id = 0 because that field is only populated from 2012-03 onwards (P4366), so I came up with the self-join.

One thing I can try is moving the where conditions around to cut down the rows returned by the inner select. From the explain output, it looks like that may be faster:

explain select left(a.ar_timestamp, 6) as month, count(*) as deleted_creations
from enwiki.archive a
inner join
(
select ar_title, min(ar_timestamp) as ar_timestamp
from enwiki.archive
where
ar_namespace = 0 and
convert(ar_comment using utf8) not like "%redir%" and
ar_len > 100
group by ar_title
) b
using (ar_title, ar_timestamp)
group by left(a.ar_timestamp, 6) \G;

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <derived2>
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 23117203
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: a
         type: ref
possible_keys: analytics_timestamp
          key: analytics_timestamp
      key_len: 16
          ref: b.ar_timestamp
         rows: 544373
        Extra: Using where
*************************** 3. row ***************************
           id: 2
  select_type: DERIVED
        table: archive
         type: ref
possible_keys: name_title_timestamp
          key: name_title_timestamp
      key_len: 4
          ref: const
         rows: 23117203
        Extra: Using where
3 rows in set (0.02 sec)

I'm going to try this now and keep an eye on it; if it doesn't speed things up, I'll have to take you up on your offer of help :)

Okay, looks like that revised version worked in a mere 5 hours! :)

Hey Neil!

Thanks a lot for spending time fixing this so quickly! Very nice one, from 2days (or more) to 5 hours :-)

Well done!
Manuel.

nshahquinn-wmf raised the priority of this task from High to Needs Triage.Mar 30 2018, 10:12 AM
nshahquinn-wmf moved this task from Neil's in progress to Done on the Contributors-Analysis board.