Page MenuHomePhabricator

Deletion not working on English Wikipedia
Closed, ResolvedPublic

Description

See https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&diff=835707270&oldid=835707218 where an admin reported deletion appears to have been broken for about 20 minutes now.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2018, 8:59 AM
BethNaught triaged this task as Unbreak Now! priority.Apr 10 2018, 9:00 AM
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptApr 10 2018, 9:00 AM

There is an alter table on the archive table on the enwiki master going on. It is almost done though.
We didn't see issues on other masters when altering this table, so I went ahead with enwiki.

Alter table finished

@BethNaught can you check if it works again?

Works for me at this time.

Thanks - it should not give any errors again

jcrespo lowered the priority of this task from Unbreak Now! to Normal.Apr 10 2018, 9:27 AM
jcrespo added a subscriber: jcrespo.

Normal as the incident should be solved, we now have to research what actually happened.

Errors started at 8:40 UTC, lasting until 9:19 UTC. During those times, deletes and moves likely failed. They should all succeed now.

This is what was being executed:

SET SESSION innodb_lock_wait_timeout=1; SET SESSION lock_wait_timeout=30; ALTER TABLE archive MODIFY COLUMN ar_text mediumblob NULL, MODIFY COLUMN ar_flags tinyblob NULL;

Oshwah added a subscriber: Oshwah.Apr 10 2018, 9:30 AM
jcrespo added a comment.EditedApr 10 2018, 9:39 AM

What I saw was INSERTs into archive being blocked due to metadata locking, but that would not make sense except at the start of the command, or the command would fail in 30 seconds. Maybe it requires a second metadata lock under certain conditions?

What I saw was INSERTs into archive being blocked due to metadata locking, but that would not make sense except at the start of the command, or the command would fail in 30 seconds. Maybe it requires a second metadata lock under certain conditions?

Yeah, that's my surprise too. That would have been expected either at the start or at the end, but not during the alter itself. I didn't see that happening with other masters

jcrespo added a comment.EditedApr 10 2018, 10:00 AM

These were the queries ongoing at that time:
{P6973}

I wonder if this is a one time thing due to an obscure combination of rename + alter + deletes, or there is a software bug.

Alternatively to that, it could be a new locking model due to to changes of comment storage- where extra locking is needed there, creating more contention on regular maintenance.

jcrespo moved this task from Triage to In progress on the DBA board.

These are the versions of the previous altered masters

s1: 10.0.28 (the one that caused this)
s2: 10.0.29
s3: 10.0.23
s4: 10.0.32
s5: 10.0.33
s6: 10.0.23
s7: 10.0.23

none of them gave any issues with that specific alter.
I have also confirmed that on 10.0.28 this alter is fully online

MariaDB [test]> select @@version;
+-----------------+
| @@version       |
+-----------------+
| 10.0.28-MariaDB |
+-----------------+
1 row in set (0.00 sec)
MariaDB [test]> ALTER TABLE archive MODIFY COLUMN ar_text mediumblob NULL, MODIFY COLUMN ar_flags tinyblob NULL, ALGORITHM=INPLACE;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

There have been dozens of servers already altered with that specific alter and none of them showed lag even, so it is indeed fully online.

So it could be a race condition with the other queries?
It is probably true that enwiki archive table has a lot more activity than the rest of the already altered masters.

CC @Anomie this is not directly related- maintenance was the direct cause, but I believe the new comment model may be creating worse locking patterns on deletion, with queries like:

SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1,COALESCE( comment_rev_comment.comment_text, rev_comment ) AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text,NULL AS `rev_actor`,rev_content_format,rev_content_model  FROM `revision` LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id))   WHERE rev_page = 'X'   FOR UPDATE

Could the SELECT ... FOR UPDATE be restricted to the revision table and select the comment on a second query, without locking? My thesis is the extra locking could affect deletions as all will try to block exclusively the same "deletion reason comment", which creates higher contention. So something like:

SELECT  rev_id,rev_page,rev_text_id,rev_timestamp,rev_minor_edit,rev_deleted,rev_len,rev_parent_id,rev_sha1, rev_comment AS `rev_comment_text`, rev_user,rev_user_text,NULL AS `rev_actor`,rev_content_format,rev_content_model  FROM `revision` WHERE rev_page = 'X'   FOR UPDATE

and then,

SELECT comment_rev_comment.comment_text, AS `rev_comment_text`,comment_rev_comment.comment_data AS `rev_comment_data`,comment_rev_comment.comment_id AS `rev_comment_cid`,rev_user,rev_user_text LEFT JOIN `revision_comment_temp` `temp_rev_comment` ON ((temp_rev_comment.revcomment_rev = $obtained_rev_id)) LEFT JOIN `comment` `comment_rev_comment` ON ((comment_rev_comment.comment_id = temp_rev_comment.revcomment_comment_id));

without for update. We can create a specific task for that.

We can create a specific task for that.

Please do.

Could the SELECT ... FOR UPDATE be restricted to the revision table and select the comment on a second query, without locking?

If we don't mind whatever the overhead is of doing multiple queries, sure.

Although at least for WikiPage::doDeleteArticleReal() we'll probably also want to lock the joined revision_comment_temp (and in the future revision_actor_temp) rows.

I don't know if we really want to do the second SELECT per row though, that's a lot of back-and-forth. And splitting the tables would be a bit of a pain on the PHP side. It would be easier from PHP to do something like SELECT 1 FROM revision WHERE rev_page = 'X' FOR UPDATE to lock the rows followed by the existing query (without FOR UPDATE) to fetch the actual data.

I agree with everything you said, my comment was a quick sketch of what I wanted, and what you proposed was what I really wanted, creating T191892 to handle that there.

Note this is related to this ticket only because I think it was the trigger, that with the maintenance, made things a bit worse than intended, but the issue has been happening for a long time ago, I mentioned it made commons deletes a bit less reliable.

We have started an Incident Report for this: https://wikitech.wikimedia.org/wiki/Incident_documentation/20180410-Deleting_a_page_on_enwiki
Please feel free to comment anything there

jcrespo closed this task as Resolved.Apr 10 2018, 3:37 PM
jcrespo assigned this task to Marostegui.

I am going to close this ticket as the initial report, "Deletion not working", was resolved as soon as the maintenance finished. We hope that with T191892 that would mitigate the issues in the future, but only after that change is deployed we could test that is true. We will monitor and reopen if we gather more information/the mitigation doesn't work. Feel free to use the Incident talk page for more questions and comments rather than this ticket.