Page MenuHomePhabricator

Special:Block request makes the database timeout on db1055
Closed, ResolvedPublicPRODUCTION ERROR

Description

https://logstash.wikimedia.org/#dashboard/temp/AVX-rq3Ic8qLrUhX7ZVu

SELECT log_id,log_type,log_action,log_timestamp,log_user,log_user_text,log_namespace,log_title,log_comment,log_params,log_deleted,user_id,user_name,user_editcount,
(SELECT GROUP_CONCAT(ct_tag SEPARATOR ',') FROM `change_tag` WHERE ct_log_id=log_id ) AS `ts_tags` 
FROM `logging` 
LEFT JOIN `user` ON ((log_user=user_id)) 
WHERE log_action IN ('block','reblock','unblock') AND log_type = 'suppress' AND log_namespace = '2' AND log_title = '<IP or maybe user>' 
ORDER BY log_timestamp DESC LIMIT 11

Event Timeline

This was reported by user @Bsadowski1 on IRC as an anoying error when handling blocks.

I am not giving it high priority because the number of errors seem low, and probably it is not reproducible (it doesn't happen again on reload). Please report any change you experiment on it getting better or worse.

Happened to me a few minutes ago when attempting to block a user at the same time as another administrator completed the action. Page should have been a block conflict one.

Happened on Special:Undelete as well when trying to view a deleted article.

Bsadowski1:

block a user at the same time as another administrator completed the action
Special:Undelete

are probably separate issues, please do not mix all kinds of errors on the same ticket. If the others are reproducible, please file separate tickets.

Errors are not necessarily something bad, in the case of two or more people doing the same action at the same time, they protect from the state of the database being consistent (only 1 block); I agree that when possible, a better error message should be sent (block confict), but that may not be always possible.

I will see now if I can discover what causes the original issue: db1055 timeouts.

I can confirm this is a very rare occurence, only 50 times in a week.

Explain on db1057:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: logging
         type: ref
possible_keys: type_time,page_time,log_title_time,log_title_type_time
          key: log_title_type_time
      key_len: 52
          ref: const,const
         rows: 1
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: user
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.logging.log_user
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: change_tag
         type: ref
possible_keys: ct_log_id,change_tag_log_tag
          key: change_tag_log_tag
      key_len: 5
          ref: enwiki.logging.log_id
         rows: 1
        Extra: Using where; Using index
3 rows in set (0.00 sec)

Explain on db1055:

*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: logging
         type: ref
possible_keys: type_time,page_time,log_title_time,log_title_type_time
          key: type_time
      key_len: 34
          ref: const
         rows: 6
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: user
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: enwiki.logging.log_user
         rows: 1
        Extra: 
*************************** 3. row ***************************
           id: 2
  select_type: DEPENDENT SUBQUERY
        table: change_tag
         type: ref
possible_keys: ct_log_id,change_tag_log_tag
          key: ct_log_id
      key_len: 5
          ref: enwiki.logging.log_id
         rows: 17457
        Extra: Using where; Using index
3 rows in set (0.00 sec)

The wrong indexes seem to be in use, probably due to: T132416. I will continue digging.

Forcing log_title_type_time on logging seems to create the same query plan, and a much faster query. The reason for that is probably the special partitioning on db1055. That is strange, because precisely, logging is partitioned by user_id, something that makes some queries from impossible to bearable.

Probably is due to the fact that the partitioning helps reading rows by log_user, but not log_title, even if title is in fact a user (blocks on a particular user).

However, if I IGNORE INDEX(type_time), it goes with page_time, and still ignores log_title_type_time. I have to ignore both IGNORE INDEX(type_time, page_time)to convice it to use it.

Things get even more interesting if I try to run that on db1051- it prefers page_time (which I think is anyway as good as the alternative), then uses log_title_type_time.

This is probably a simple case of outdated table statistics. I am going to depool db1055, run analyze on it, and check that the issue is gone.

Change 301377 had a related patch set uploaded (by Jcrespo):
Depool db1055 for database maintenance

https://gerrit.wikimedia.org/r/301377

Change 301377 merged by Jcrespo:
Depool db1055 for database maintenance

https://gerrit.wikimedia.org/r/301377

Change 301384 had a related patch set uploaded (by Jcrespo):
Depool db1055 also from the rc/log role

https://gerrit.wikimedia.org/r/301384

Change 301384 merged by Jcrespo:
Depool db1055 also from the rc/log role

https://gerrit.wikimedia.org/r/301384

The server is running an older version of MariaDB, unlike the others, that could be related, too. I will also upgrade the server at the same time.

After upgrade + engine_dependent_statistics fix + ANALYZE table, this issue is solved. It prefers page_time, but the execution is 0 seconds. Once I repool the server, this issue should not reappear anymore.

Change 301586 had a related patch set uploaded (by Jcrespo):
Repool db1055 after maintenance

https://gerrit.wikimedia.org/r/301586

Change 301586 merged by jenkins-bot:
Repool db1055 after maintenance

https://gerrit.wikimedia.org/r/301586

I do not see any more errors on blocking users since 2016-07-27T12:05:33 UTC. I believe this to be fixed.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM