Page MenuHomePhabricator

Links tables corrupted due to incorrectly parenthesized delete queries
Closed, ResolvedPublicPRODUCTION ERROR

Description

Following group1 promotion to 1.38.0-wmf.17 there was a large spike in read-only related errors and slow queries. Unfortunately I wasn't able to select a specific relevant root error to report, but it appears that many of the slow queries are related to INSERTs made to the pagelinks and templatelinks tables, suggesting a possible relation to the risky patch in T293958#7612230

See the slow queries logstash dashboard here as well as exceptions around the same time window as today's group1 rollout.

https://logstash.wikimedia.org/goto/fc93968e03e382d6ead41535bab7133f

Event Timeline

dduvall triaged this task as Unbreak Now! priority.Jan 12 2022, 8:58 PM
dduvall created this task.

Sorry for the very general report here but I wasn't able to pinpoint a specific root error in logstash to which to link. Please adjust as necessary.

This huge spike of lag correlates with a huge spike in writes (which generated lots of lag): https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=1642018237886&to=1642020538564
They all seem to match this deployment:

20:21 dduvall@deploy1002: Synchronized php: group1 wikis to 1.38.0-wmf.17 refs T293958 (duration: 01m 21s)

The number of row writes per second doesn't seem to have yet gone back to pre-deployment levels:

https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&from=1642018534577&to=1642023246589

Screenshot_20220112_224005.png (740×1 px, 215 KB)

Specifically on s4 and s8, still an active, ongoing issue related to wikibase, or just a very long tail (around 30K rows/s, while normally it is around 8K rows/s per section)? Someone should keep an eye on that- doesn't seem to have been fully solved (keeping the unbreaking now for now).

Pinging Tim, he should be coming online about now.

DELETE FROM `templatelinks` WHERE (tl_from = 9691118 AND (tl_namespace = 10 AND tl_title IN ('LangSwitch','Purge') ) OR (tl_namespace = 828 AND tl_title = 'LangSwitch'))

The query is insufficiently parenthesized. I believe it deletes all templatelinks to [[Module:LangSwitch]] regardless of source page.

Change 753581 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Database::factorConds(): fix insufficient parenthesization

https://gerrit.wikimedia.org/r/753581

Change 753498 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.38.0-wmf.17] Database::factorConds(): fix insufficient parenthesization

https://gerrit.wikimedia.org/r/753498

The current .16 background issues seem to be exceptions for failing to lock pages at:
/srv/mediawiki/php-1.38.0-wmf.16/includes/deferred/LinksUpdate.php
https://logstash.wikimedia.org/goto/b7d2781a9e247a0595896ba649c19a6b

I don't know if that is related to .17 deploy or not.

There will be additional insert traffic as refreshlinks jobs reinsert the deleted links.

tstarling renamed this task from Wikimedia\Rdbms\DBReadOnlyError: Database is read-only: The database is read-only until replication lag decreases. to Links tables corrupted due to incorrectly parenthesized delete queries.Jan 12 2022, 11:42 PM

Exceptions seem to be nice now after full revert. :-)

My suggestion is to either throttle/limit more aggressively the jobs after re-deploy, or run a repairing script in advance, to prevent overwhelming the primary dbs and forcing read only.

I mean the insert traffic will be there already, since the old version will slowly repair deleted links. So we should throttle the jobs now if they are a problem.

I mean the insert traffic will be there already

Ah, the traffic may be higher than usual for this time of the day, but more than managable, so then no actionable needed for that, thank you!

DELETE FROM `templatelinks` WHERE (tl_from = 9691118 AND (tl_namespace = 10 AND tl_title IN ('LangSwitch','Purge') ) OR (tl_namespace = 828 AND tl_title = 'LangSwitch'))

I changed this query to SELECT COUNT(*) and ran it, and it showed 40M rows. No write queries are still running. So I guess something killed the DELETE query before it managed to actually delete 40M rows.

Change 753581 merged by jenkins-bot:

[mediawiki/core@master] Database::factorConds(): fix insufficient parenthesization

https://gerrit.wikimedia.org/r/753581

I wrote this perl script: P18710. It makes a 500MB SQL file from the row-based binlog of commonswiki pagelinks deletions.

Change 753601 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Add --batch-size to sql.php

https://gerrit.wikimedia.org/r/753601

The plan is to undelete pagelinks on commonswiki which were deleted between 20:19:00 and 20:40:00 using the pasted perl script and sql.php with the batch size patch. The sooner the better, since the undeleted link data becomes more stale as time goes by. I will go ahead as soon as someone reviews the concept and voices approval.

Change 753499 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.38.0-wmf.16] maintenance: Add --batch-size to sql.php

https://gerrit.wikimedia.org/r/753499

I used the statement-based binlog to determine that for commonswiki pagelinks, the first affected query was at 20:20:40 and the last was at 20:33:23. So I'll narrow the range for undeletion accordingly.

I used

grep -B1 'DELETE .* FROM `pagelinks`.*AND (pl_namespace'

since the unaffected queries have a two parentheses before pl_namespace.

Change 753601 merged by jenkins-bot:

[mediawiki/core@master] maintenance: Add --batch-size to sql.php

https://gerrit.wikimedia.org/r/753601

Change 753499 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.16] maintenance: Add --batch-size to sql.php

https://gerrit.wikimedia.org/r/753499

Mentioned in SAL (#wikimedia-operations) [2022-01-13T02:30:02Z] <TimStarling> on mwmaint1002: inserting 4221344 rows into commonswiki.pagelinks to clean up from T299095

  • Using the statement-based binlog, I confirmed that there were no affected deletes on commonswiki.iwlinks.
  • I confirmed that there were no affected deletes on commonswiki.templatelinks. Templatelinks is rarely affected because the bug is triggered when multiple target namespaces appear in a delete query.
  • On wikidatawiki there was only one bad query, and it deleted 1.7M rows linking to [[P156]]. I manually trimmed the unrelated rows from the undelete SQL and started the import.

Mentioned in SAL (#wikimedia-operations) [2022-01-13T03:33:10Z] <TimStarling> on mwmaint1002: inserting 1714288 into wikidatawiki.pagelinks for T299095

Mentioned in SAL (#wikimedia-operations) [2022-01-13T04:30:49Z] <TimStarling> on mwmaint1002: inserting 11565 rows into itwiki.pagelinks for T299095

  • In s2, only itwiki is affected. I undeleted 11565 rows.
  • In s3, the affected wikis were elwiktionary eowiktionary hiwiki incubatorwiki itwikiquote ruwiktionary simplewiki specieswiki trwiktionary. The total restored row count is 226335.

Mentioned in SAL (#wikimedia-operations) [2022-01-13T05:00:44Z] <TimStarling> doing T299095 restorations on s3 wikis

In s2 and s3 I used

grep -B3 'DELETE.*FROM `\(pagelinks\|templatelinks\).* (.._namespace.*namespace'

on the statement-based binlog to find affected queries and wikis. The regex detects a deletion with a single parenthesis before the first namespace, and with at least two namespaces in total. In s5-s7 I searched for the function comment instead of the single parenthesis to detect the code version.

  • s5: No affected queries found
  • s6: No affected queries found
  • s7: No affected queries found

That concludes the undeletion operations. I think the only thing left to do is the merge and deployment of the Database::factorConds() fix.

Change 753498 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.17] Database::factorConds(): fix insufficient parenthesization

https://gerrit.wikimedia.org/r/753498

DELETE FROM `templatelinks` WHERE (tl_from = 9691118 AND (tl_namespace = 10 AND tl_title IN ('LangSwitch','Purge') ) OR (tl_namespace = 828 AND tl_title = 'LangSwitch'))

The query is insufficiently parenthesized. I believe it deletes all templatelinks to [[Module:LangSwitch]] regardless of source page.

How did you find the offending query? In the slow query log?

Thanks for taking care of this @tstarling. Great idea to look through the binlogs to find and revert the destructive queries! I wouldn't have thought of that.

How did you find the offending query? In the slow query log?

Yes.

Great idea to look through the binlogs to find and revert the destructive queries! I wouldn't have thought of that.

Credit for that goes to @jcrespo, who started off the project and handed it over to me at 1:45am his time.

It appears that this is no longer a blocker.

tstarling claimed this task.