Not only it made wiktionaries' cognate fail, also flow, echo, translations and other things hosted on x1-master.
|Open||None||T164504 Tracking: Cleanup x1 database connection patterns|
|Resolved||Addshore||T164407 Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections|
|Resolved||Addshore||T165608 Write an incident report of wikimedia outage caused by Cognate extension|
|Open||None||T166122 Cognate does some updates synchronously, and others via JobQueue. That may lead to inconsistencies in the DB|
- Mentioned In
- T164504: Tracking: Cleanup x1 database connection patterns
rECOGe19eccd072b8: Add a clear-first option to populatePages script
rECOGbe91880dfea6: Add a clear-first option to populatePages script
rECOGf4d77c959a55: Add a clean-first option to populatePages script
T165005: Wikimedia\Rdbms\LoadBalancer::reuseConnection: got DBConnRef instance.
rECOGc590e4402f4d: Add PurgeDeletedCognatePages maint script
rECOGddc67124743f: Add PurgeDeletedCognatePages maint script
rECOGcc023332e30a: Add PurgeDeletedCognatePages maint script
rECOGaae56a431ed0: Add PurgeDeletedCognatePages maint script
rECOG06468e4158a9: Add PurgeDeletedCognatePages maint script
rECOG97f894225d22: Add PurgeDeletedCognatePages maint script
rECOG22f9f05d8384: Add PurgeDeletedCognatePages maint script
rECOGc576d4a5a350: Add PurgeDeletedCognatePages maint script
rECOGe786295eb5cc: Add PurgeDeletedCognatePages maint script
rECOGd9f9a3644436: Add read only mode
rECOG4970050d6d03: Add read only mode
rECOG5f20dc54a26e: Add read only mode
rECOGda8516540ff7: Add read only mode
rECOG42a1afb0f8e8: Add read only mode
rECOGbbea9b6fa85a: Add read only mode
rECOG8d92e97cc180: Add read only mode
rECOGee78136414c4: Add read only mode
rECOG23e888af1af9: Release connections as early as possible in CognateStore
T164451: Non-ASCII pages don't display correctly the links
rECOGcd6c35eead88: Fix selectSitesForPage to use getReadConnectionRef
T164417: Cognate extension does not work on eo.wikt
rECOGc5b9db207786: Do not use DB_MASTER to select
rECOGe2ba256bd8c6: Do not use DB_MASTER to select
T164406: Something weird going on with Flow in nowiki?
- Mentioned Here
- T165005: Wikimedia\Rdbms\LoadBalancer::reuseConnection: got DBConnRef instance.
@Lydia_Pintscher @Lea_Lacroix_WMDE @MarcoSwart Cognate is now enabled on wiktionaries in read only mode.
This means that all interwiki links that were preset and provided by Cognate before the DC switch will appear again after a page purge.
This also means that links will not automatically be created when pages are created and links will not automatically be removed when pages are deleted.
We will have to run the populatePages maint script every few days to try and keep everything as in sync as possible until we take Cognate out of read only mode.
This is very strange, maybe my monitoring is bad, but I see all queries going to the master, and none to the slave (or maybe I am doing something wrong):
$ mysql -h db1031.eqiad.wmnet sys -e "SELECT sum(exec_count) FROM statement_analysis WHERE db = 'cognate_wiktionary'" +-----------------+ | sum(exec_count) | +-----------------+ | 65342 | +-----------------+ $ mysql -h db1029.eqiad.wmnet sys -e "SELECT sum(exec_count) FROM statement_analysis WHERE db = 'cognate_wiktionary'" +-----------------+ | sum(exec_count) | +-----------------+ | NULL | +-----------------+
# ExtensionStore shard1 - initially for AFTv5 'extension1' => [ '10.64.16.20' => 0, # db1031, master '10.64.16.18' => 1, # db1029 ],
$ dig +short -x 10.64.16.20 db1031.eqiad.wmnet. $ dig +short -x 10.64.16.18 db1029.eqiad.wmnet.
queries are actually running on db1029, just for some reason they are not being registered on performance_schema tables. This is a monitoring bug, not a problem with cognate, so you can ignore my last comment.
So Cognate is currently fully enabled on small and medium wiktionaries, the lists of these can be found below:
This is how things will remain over the weekend and we will carry out further investigation at the start of next week.
Some odditys may exist in the database for pages that were created or deleted while cognate was either switched off or in read only mode, some of these may be tackled tomorrow (especially for small and medium wikis) however this will most likely happen next week.
Something happened yesterday at 1:50 where most row writes contention on x1-master disappeared: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=19&fullscreen&orgId=1&from=1493889687428&to=1493976087428&var-dc=eqiad%20prometheus%2Fops&var-server=db1031
Looking at the writes etc that cognate has been running for the same time period it doesn't look like this is cognate related
However execution time for the cognate writes has now dramatically decreased.
@Addshore I think you may be blocked the wrong people. I handled the original outage (and I think nobody disagreed that was the right thing to do at the time), and created this ticket, but code deployments and mediawiki configuration changes are generally handled by Release-Engineering-Team (specially given that this seems like a pure software/deployment issue, but correct me if I am wrong). I would tell the same thing that I said to cxtranslation devels on a very similar incident- the only thing that ops require to get us happy (and I think users affected in general deserve) is an incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation with its followups to avoid this issue in the future, and I see none right now.
Well, we had discussed switching Cognate back on as it was pre outage to see what it was doing the the db servers (without any code changes), but this would require you to revert the infrastructure / db changes that you said you had made at the end of last week.
If we no longer want to do this then I'll get this turned back on today and run the maintenance scripts to fix the entries in the DB.
After it was enabled yesterday, it is throwing lots of errors: https://logstash.wikimedia.org/goto/8713a44d76d7a211d3a404468d224ac7
So far there are no issues from a DB performance point of view, but this needs to be looked at.
This is unrelated to the patch turning write mode on for all wikis.
Infact, per https://grafana.wikimedia.org/dashboard/db/mediawiki-cognate enabeling it everywhere basically creates no increase in write queries to the cluster.
The logs are due to new code being deployed with wmf.1 and handheld by T165005.
Nothing to do with the outage / connection issues.