Page MenuHomePhabricator

Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections
Closed, ResolvedPublic

Description

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1031&from=1493828837614&to=1493832461569

Not only it made wiktionaries' cognate fail, also flow, echo, translations and other things hosted on x1-master.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 351868 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Enable Cognate for Wiktionary in Read Only mode

https://gerrit.wikimedia.org/r/351868

Change 351799 merged by Jcrespo:
[operations/mediawiki-config@master] db: Remove all read traffic from x1, es2 & es3-master-eqiad

https://gerrit.wikimedia.org/r/351799

Change 351867 merged by jenkins-bot:
[mediawiki/extensions/Cognate@wmf/1.29.0-wmf.21] Add read only mode

https://gerrit.wikimedia.org/r/351867

Mentioned in SAL (#wikimedia-operations) [2017-05-04T16:29:50Z] <thcipriani@tin> Synchronized php-1.29.0-wmf.21/extensions/Cognate: SWAT: [[gerrit:351867|Add read only mode]] T164407 (duration: 00m 56s)

Change 351868 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Cognate for Wiktionary in Read Only mode

https://gerrit.wikimedia.org/r/351868

Mentioned in SAL (#wikimedia-operations) [2017-05-04T16:42:31Z] <thcipriani@tin> Synchronized wmf-config: [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-05-04T16:49:42Z] <thcipriani@tin> Synchronized wmf-config: Revert [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-05-04T17:01:51Z] <thcipriani@tin> Synchronized wmf-config: Revert revert [[gerrit:351868|Enable Cognate for Wiktionary in Read Only mode]] T164407 (duration: 00m 40s)

@Lydia_Pintscher @Lea_Lacroix_WMDE @MarcoSwart Cognate is now enabled on wiktionaries in read only mode.
This means that all interwiki links that were preset and provided by Cognate before the DC switch will appear again after a page purge.
This also means that links will not automatically be created when pages are created and links will not automatically be removed when pages are deleted.
We will have to run the populatePages maint script every few days to try and keep everything as in sync as possible until we take Cognate out of read only mode.

This is very strange, maybe my monitoring is bad, but I see all queries going to the master, and none to the slave (or maybe I am doing something wrong):

$ mysql -h db1031.eqiad.wmnet sys -e "SELECT sum(exec_count) FROM statement_analysis WHERE db = 'cognate_wiktionary'"
+-----------------+
| sum(exec_count) |
+-----------------+
|           65342 |
+-----------------+
$ mysql -h db1029.eqiad.wmnet sys -e "SELECT sum(exec_count) FROM statement_analysis WHERE db = 'cognate_wiktionary'"
+-----------------+
| sum(exec_count) |
+-----------------+
|            NULL |
+-----------------+
	# ExtensionStore shard1 - initially for AFTv5
	'extension1' => [
		'10.64.16.20' => 0, # db1031, master
		'10.64.16.18' => 1, # db1029
	],
$ dig +short -x 10.64.16.20
db1031.eqiad.wmnet.
$ dig +short -x 10.64.16.18
db1029.eqiad.wmnet.

Thanks for the interim solution. I have informed the WikiWoordenboek community and I will see if there is a fast way to purge all the pages involved.

queries are actually running on db1029, just for some reason they are not being registered on performance_schema tables. This is a monitoring bug, not a problem with cognate, so you can ignore my last comment.

We have also deployed some extra monitoring for the db related stuff in Cognate and this can be see @ https://grafana.wikimedia.org/dashboard/db/mediawiki-cognate

Change 351911 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] wgCognateReadOnly false for 'small' wikis

https://gerrit.wikimedia.org/r/351911

Change 351911 merged by jenkins-bot:
[operations/mediawiki-config@master] wgCognateReadOnly false for 'small' wikis

https://gerrit.wikimedia.org/r/351911

Mentioned in SAL (#wikimedia-operations) [2017-05-04T18:18:03Z] <addshore@tin> Synchronized wmf-config/InitialiseSettings.php: T164407 [[gerrit:351911|wgCognateReadOnly false for small wikis]] (duration: 00m 40s)

Thanks @Addshore. We informed en, de and fr communities about this update.

Change 351923 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] wgCognateReadOnly false for medium wikis

https://gerrit.wikimedia.org/r/351923

Change 351923 merged by jenkins-bot:
[operations/mediawiki-config@master] wgCognateReadOnly false for medium wikis

https://gerrit.wikimedia.org/r/351923

Mentioned in SAL (#wikimedia-operations) [2017-05-04T19:01:50Z] <addshore@tin> Synchronized wmf-config/InitialiseSettings.php: T164407 [[gerrit:351923|wgCognateReadOnly false for medium wikis]] (duration: 00m 39s)

So Cognate is currently fully enabled on small and medium wiktionaries, the lists of these can be found below:

This is how things will remain over the weekend and we will carry out further investigation at the start of next week.

Some odditys may exist in the database for pages that were created or deleted while cognate was either switched off or in read only mode, some of these may be tackled tomorrow (especially for small and medium wikis) however this will most likely happen next week.

Change 352095 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Cognate@master] Add PurgeDeletedCognatePages maint script

https://gerrit.wikimedia.org/r/352095

Looking at the writes etc that cognate has been running for the same time period it doesn't look like this is cognate related

https://grafana-admin.wikimedia.org/dashboard/db/mediawiki-cognate?refresh=1m&orgId=1&from=1493889687428&to=1493976087428

However execution time for the cognate writes has now dramatically decreased.

image.png (268×955 px, 51 KB)

Change 352569 had a related patch set uploaded (by Addshore; owner: Addshore):
[operations/mediawiki-config@master] Put Cognate in write mode for all wiktionaries

https://gerrit.wikimedia.org/r/352569

Change 352095 merged by jenkins-bot:
[mediawiki/extensions/Cognate@master] Add PurgeDeletedCognatePages maint script

https://gerrit.wikimedia.org/r/352095

Change 352757 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Cognate@wmf/1.29.0-wmf.21] Add PurgeDeletedCognatePages maint script

https://gerrit.wikimedia.org/r/352757

@jcrespo @Marostegui waiting for your approval to switch this back on on 'large' wikis.
It would be great to get this done this week.

@Addshore I think you may be blocked the wrong people. I handled the original outage (and I think nobody disagreed that was the right thing to do at the time), and created this ticket, but code deployments and mediawiki configuration changes are generally handled by Release-Engineering-Team (specially given that this seems like a pure software/deployment issue, but correct me if I am wrong). I would tell the same thing that I said to cxtranslation devels on a very similar incident- the only thing that ops require to get us happy (and I think users affected in general deserve) is an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation with its followups to avoid this issue in the future, and I see none right now.

@Addshore I think you may be blocked the wrong people. I handled the original outage (and I think nobody disagreed that was the right thing to do at the time), and created this ticket, but code deployments and mediawiki configuration changes are generally handled by Release-Engineering-Team (specially given that this seems like a pure software/deployment issue, but correct me if I am wrong). I would tell the same thing that I said to cxtranslation devels on a very similar incident- the only thing that ops require to get us happy (and I think users affected in general deserve) is an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation with its followups to avoid this issue in the future, and I see none right now.

Well, we had discussed switching Cognate back on as it was pre outage to see what it was doing the the db servers (without any code changes), but this would require you to revert the infrastructure / db changes that you said you had made at the end of last week.
If we no longer want to do this then I'll get this turned back on today and run the maintenance scripts to fix the entries in the DB.

Change 352569 merged by jenkins-bot:
[operations/mediawiki-config@master] Put Cognate in write mode for all wiktionaries

https://gerrit.wikimedia.org/r/352569

Mentioned in SAL (#wikimedia-operations) [2017-05-10T18:14:05Z] <dereckson@tin> Synchronized wmf-config/InitialiseSettings.php: Put Cognate in write mode for all wiktionaries (T164407) (duration: 00m 42s)

Lydia_Pintscher lowered the priority of this task from Unbreak Now! to Medium.May 11 2017, 8:48 AM

Lowering the priority as it is back in production on all Wiktionaries. Only things left to do: write incident report and run script to clean up any remaining entries in the db.

Marostegui raised the priority of this task from Medium to High.May 11 2017, 9:49 AM

After it was enabled yesterday, it is throwing lots of errors: https://logstash.wikimedia.org/goto/8713a44d76d7a211d3a404468d224ac7
So far there are no issues from a DB performance point of view, but this needs to be looked at.

Addshore lowered the priority of this task from High to Medium.May 11 2017, 12:57 PM

After it was enabled yesterday, it is throwing lots of errors: https://logstash.wikimedia.org/goto/8713a44d76d7a211d3a404468d224ac7
So far there are no issues from a DB performance point of view, but this needs to be looked at.

This is unrelated to the patch turning write mode on for all wikis.
Infact, per https://grafana.wikimedia.org/dashboard/db/mediawiki-cognate enabeling it everywhere basically creates no increase in write queries to the cluster.

The logs are due to new code being deployed with wmf.1 and handheld by T165005.
Nothing to do with the outage / connection issues.

Change 353851 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Cognate@master] Add a clean-first option to populatePages script

https://gerrit.wikimedia.org/r/353851

Change 353860 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Cognate@wmf/1.30.0-wmf.1] Add a clear-first option to populatePages script

https://gerrit.wikimedia.org/r/353860

Change 353860 merged by jenkins-bot:
[mediawiki/extensions/Cognate@wmf/1.30.0-wmf.1] Add a clear-first option to populatePages script

https://gerrit.wikimedia.org/r/353860

Mentioned in SAL (#wikimedia-operations) [2017-05-15T13:19:24Z] <addshore@tin> Synchronized php-1.30.0-wmf.1/extensions/Cognate/src/CognateStore.php: SWAT: [[gerrit:353860|Add a clear-first option to populatePages script]] T164407 PT 1/2 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-05-15T13:20:29Z] <addshore@tin> Synchronized php-1.30.0-wmf.1/extensions/Cognate/maintenance/populateCognatePages.php: SWAT: [[gerrit:353860|Add a clear-first option to populatePages script]] T164407 PT 2/2 (duration: 00m 39s)

Change 353851 merged by jenkins-bot:
[mediawiki/extensions/Cognate@master] Add a clear-first option to populatePages script

https://gerrit.wikimedia.org/r/353851

Mentioned in SAL (#wikimedia-operations) [2017-05-16T10:28:51Z] <addshore> T164407 addshore@terbium mwscriptwikiset extensions/Cognate/maintenance/populateCognatePages.php wiktionary.dblist --batch-size=1000

Mentioned in SAL (#wikimedia-operations) [2017-05-20T17:29:39Z] <addshore> addshore@terbium:/srv/mediawiki/php-1.30.0-wmf.1$ mwscriptwikiset extensions/Cognate/maintenance/purgeDeletedCognatePages.php wiktionary.dblist --batch-size=1000 >> ~/purge.201705161230.log T164407

It looks like the above script failed once it got to etwiktionary "Could not wait for replica DBs to catch up to db1054" so we should continue from there.

If you have the hostname that failed, I can double check if there was something going on with that host. Right now there are no lag or anything on s2 so maybe it was a temporary issue.

Mentioned in SAL (#wikimedia-operations) [2017-05-23T07:48:16Z] <addshore> addshore@terbium:~$ ~/mymwscriptwikiset extensions/Cognate/maintenance/purgeDeletedCognatePages.php et+wiktionary.dblist --batch-size=1000 >> ~/purge.201705161230.log T164407

Mentioned in SAL (#wikimedia-operations) [2017-05-23T09:46:18Z] <addshore> addshore@terbium:/srv/mediawiki/php-1.30.0-wmf.1$ mwscriptwikiset extensions/Cognate/maintenance/purgeDeletedCognatePages.php wiktionary.dblist --batch-size=1000 >> ~/purge.201705161230.log T164407

urgeDeletedCognatePages.php has finished running so the cleanup is also now done.

Change 352757 abandoned by Addshore:
Add PurgeDeletedCognatePages maint script

https://gerrit.wikimedia.org/r/352757