Investigate increase in "readonly" saving errors in Content Translation
Closed, ResolvedPublic
Actions

Description

Since 2016-07-12 there is a surge in the number of Content Translation saving errors with the "readonly" status.

"Saving" here doesn't mean "saving a MediaWiki page", but "saving a translation in progress in the wikishared database". (If I'm not mistaken, this is done by using the cxsave API action.)

These errors are logged in EventLogging. An example of returned JSON trace is: {"servedby":"mw1207","error":{"code":"readonly","info":"The wiki is currently in read-only mode","readonlyreason":"The database has been automatically locked while the slave database servers. (Unfortunately it's truncated because of a limitation on string length in EventLogging.)

From 2015-11-01 until 2016-07-12 there were usually 0 of these errors every day, and sometimes a small number per, up to 11. Since that date it's between 30 and 90.

@Reedy and @Nikerabbit found that the error probably comes from LoadBalancer.php, but further investigation is needed: Why did this start happening? This is probably not a normal condition, and this affect the general stability of Content Translation and possibly other Wikimedia sites' features.

Details

	Subject	Repo	Branch	Lines +/-
	Pool db1079 as an api server; reduce main load	operations/mediawiki-config	master	+4 -1

Customize query in gerrit

Related Objects

Mentioned In: T140622: Audit CX frontend code for possible causes for save spikes
Mentioned Here: T142135: LoadBalancer::waitForAll ignores critical group-load-only servers like 'recentchanges' slaves
T141340: /* RenameUserJob::run */ of user Burgundo created >1000 seconds of lag on itwiki (db1036), impacting recentchanges, watchlist and log visualization of all s2 wikis
T95501: Fix causes of replica lag and get it to under 5 seconds at peak

Event Timeline

Amire80 created this task.Jul 22 2016, 12:03 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 22 2016, 12:03 PM

Amire80 triaged this task as Medium priority.Jul 22 2016, 12:04 PM

Amire80 moved this task from Needs Triage to CX10 (July-Sep. 2016) on the ContentTranslation board.

Amire80 mentioned this in T140622: Audit CX frontend code for possible causes for save spikes.Jul 22 2016, 12:25 PM

aaron added a project: DBA.Jul 22 2016, 4:41 PM

This continues, with 74–93 errors each day since I created the bug.

There were no end-users who complained about this, but the sudden increase from zero to dozens every day since July 12 is still concerning. If it's not actually a problem, I can just filter it out in my daily checks, but I'd like a solid confirmation that it's not a problem.

Assuming this is x1, I cannot see any time those server lagged more than 2 seconds in the last week (or any other strange pattern):
https://tendril.wikimedia.org/host/view/db1029.eqiad.wmnet/3306

There was a cx_translation incident, however, on the 13th, that made x1 unavailable, and some code was changed at that time:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20160713-ContentTranslation

If you give me the full error (database), I can check other servers.

This error most likely comes from ApiMain::checkReadOnly, which calls wfReadOnly and wfReadOnlyReason, the latter calls wfGetLB()->getReadOnlyReason() which means it checks the main database connection, not the shared database connection.

30 most recent cases:

+----------------+----------------------+
| timestamp      | event_targetLanguage |
+----------------+----------------------+
| 20160727193236 | fa                   |
| 20160727193236 | fa                   |
| 20160727193235 | fa                   |
| 20160727193235 | fa                   |
| 20160727193235 | fa                   |
| 20160727193235 | fa                   |
| 20160727193235 | uk                   |
| 20160727193233 | fa                   |
| 20160727193233 | fa                   |
| 20160727193233 | uk                   |
| 20160727193233 | es                   |
| 20160727162657 | tr                   |
| 20160727162651 | pt                   |
| 20160727152633 | pt                   |
| 20160727152012 | cs                   |
| 20160727152004 | pt                   |
| 20160727152004 | zh                   |
| 20160727152002 | cs                   |
| 20160727152001 | zh                   |
| 20160727151957 | pt                   |
| 20160727151957 | zh                   |
| 20160727151954 | pt                   |
| 20160727151954 | zh                   |
| 20160727151953 | zh                   |
| 20160727151952 | pt                   |
| 20160727151951 | pt                   |
| 20160727151948 | pt                   |
| 20160727151947 | pt                   |
| 20160727151940 | pt                   |
| 20160727151938 | pt                   |
+----------------+----------------------+

Logstash shows couple of interesting events around that time for fawiki:

ReadOnly exceptions from ResourceLoaderModule::saveFileDependencies confirming that it really happened.
User '[Sensored]' (IP [Sensored]) tripped fawiki:limiter:cxsave:user:[sensored] at 11 (limit 10 in 30s) confirming that our spike protector is working (maybe even too aggressive?).

It is still unclear to me what could have caused the lag. Could it be the 10 saves we allowed to go through? Or the insert to blobs_cluster2N?

For fawiki, for example, in the last 128 hours there were 580 times where a Job called LinksUpdate::updateLinksTimestamp blocked replication for an average of 9 seconds and a max time of 15 seconds. Over 400 times for HTMLCacheUpdateJob::invalidateTitles, taking between 35 seconds and 10 seconds.

On db-eqiad.php, max lag parameter seems to be set to 8 seconds which, AFAIK, it means that every time it gets over that, API goes read-only.

You can probably report those as childs of T95501 (and help fixing them).

updateLinksTimestamp() just updates one row by PK. I might have lock contention on the master but how would that effect the slaves?

I should have clarified those are the the times I measured on the master, but they were individual queries, not transactions. They could (and probably will) perfectly take less time on the slave, for example, if they have contention on the master that doesn't appear on the slaves- but unless something strange is happening (e.g. disk failures), and due to the hardware selection for master and slaves, lag on the slaves < master transaction duration.

I do not have stats for transactions on the slave, so this is a good approximation for the "probable offenders". The charts show that lag doesn't happen on separate slave, but on all of them at the same time:

https://tendril.wikimedia.org/chart?hosts=family%3Adb1041&vars=%5Eseconds_behind_master&mode=value

Also, if transactions take 1 second, but there are 100 of them happening one after another with no lag checks/waits, lag will still happen by aggregation (and I was the one that wanted to enable multi-channel writes! :-) ). The main issue with link-like handling is that we are probably not talking about 100 entries, but at times updates of thousands or sometimes millions of titles at a time in a very inefficient schema (but that is another story).

Do you know if lag waits have into account pooled but weight 0 slaves? We came across recently into T141340 and maybe certain kind of slaves are not accounted for that.

Was updateLinksTimestamps changed around 2016-07-12 when these started happening?

Amire80 added a project: Language-Q1-2016-17 Sprint 3.Jul 28 2016, 11:45 AM

Amire80 removed a project: Language-Q1-2016-17 Sprint 3.Aug 3 2016, 6:51 AM

We did not pick up this for our current sprint because:

it seems that the lag is not caused by our queries (please clarify if not true)
there are no user reports about saving failures which hints that this is not affecting a lot of users at the moment

This is probably caused/made worse by a configuration mistake: T142135. Once that is fixed, there is a high chance this problem goes away or is really minimized (lag not being monitored on certain servers). While I am not 100% sure, it fits the lag observed on certain servers during job execution.

Change 303145 had a related patch set uploaded (by Jcrespo):
Pool db1079 as an api server; reduce main load

https://gerrit.wikimedia.org/r/303145

gerritbot added a project: Patch-For-Review.Aug 5 2016, 11:49 AM

Change 303145 merged by Jcrespo:
Pool db1079 as an api server; reduce main load

https://gerrit.wikimedia.org/r/303145

The daily number of these errors went down sharply since August 12. It used to be over 70 per day, and now its between 0 and 8. This brings us back to how it was before July 12 (maybe just a bit higher). Should this remain open?